KR2020lab3ex
Some more code examples to get started with lab 3.
Question Parsing with SpaCy
Loading SpaCy:
import spacy # do this once from command line: # python -m spacy download en nlp = spacy.load("en")
A ridiculously simple parser for questions:
import spacy.symbols as sym def parse(txt): doc = nlp(txt) verbs = [] for word in doc: if word.dep in [sym.nsubj, sym.nsubjpass, sym.pobj]: chunk, verb = find_upward_verb(word) if verb != None: verbs.append((verb, chunk)) # uncomment this to see what the "parser" sees #print(word, word.dep_, word.pos_, word.head) return verbs
The code above assumes that the input is a single sentence. It matches verbs with their subjects and objects. This function collects the relevant parts from the tree of the sentence:
def find_upward_verb(word): words = [] verb = None while word.dep_ != "ROOT": words.append(word) word = word.head if word.pos in [sym.VERB, sym.AUX]: verb = word break words.reverse() # from top to down return words, verb
Some examples:
>>> parse("what is estonia?") [(is, [estonia])] >>> parse("where is estonia?") [(is, [estonia])] >>> parse("where is estonia located?") [(located, [estonia])] >>> parse("what is located in estonia?") [(located, [what]), (located, [in, estonia])] >>> parse("who is the mayor of tallinn?") [(is, [who]), (is, [mayor, of, tallinn])]
This structure should be relatively easy to handle. For example, the third question is about the verb "located" and it's subject or object is "estonia". The returned objects are still SpaCy Tokens so you can access their POS tags and other information. However, you probably need to additionally detect at least the "question" words, such as "where", "what", "who" etc. to properly interpret the question.
Obviously, a lot more work is required to parse longer questions or handle finesse like the word "not" appearing somewhere.
Semantic Similarity
Semantic similarity with word vectors is a hot topic and you will find hundreds of articles and guides online. Recent versions of SpaCy have a way to find words with similar meaning.
This example uses pre-trained vectors that ship with SpaCy:
>>> import numpy as np >>> # python -m spacy download en_core_web_lg >>> nlp = spacy.load("en_core_web_lg") >>> bla=nlp("hill") >>> keys, _, sim = nlp.vocab.vectors.most_similar(np.array([bla[0].vector]), n=10) >>> for k, s in zip(keys[0], sim[0]): print((nlp.vocab.strings[k], s)) ... ('HIll', 1.0) ('hill', 1.0) ('HILL', 1.0) ('Hill', 1.0) ('hills', 0.795) ('Hills', 0.795) ('HILLS', 0.795) ('mountain', 0.6898) ('MOUNTAIN', 0.6898) ('Mountain', 0.6898)
If you want to search within a limited set of terms then you need to make your own Vectors object. This does not necessarily mean training your own vectors. If all your words are in SpaCy's vocabulary, you can simply copy vectors from there.
Note that vectors are not limited to just words in SpaCy. You can match against longer text:
>>> nlp("banana").similarity(nlp("yellow fruit")) 0.655803422700308 >>> nlp("banana").similarity(nlp("red fruit")) 0.6324075748339544
Fuzzy Search
Instead of words with similar meaning, we might want to find words that look similar. This is an annoying problem to solve efficiently. Using the right type of index can help a bit.
>>> import editdistance >>> import pybktree >>> # assuming labels contain the strings that we want to match against >>> fuzzy_index = pybktree.BKTree(editdistance.eval, labels) >>> fuzzy_index.find("Reikjavik", 3) [(1, 'Reykjavik'), (3, 'Miklavik')]
Unfortunately these tools do not scale too well beyond ~1M strings. If you have relatively few strings to match against, you can do a simple linear search and find the ones with lowest distance using editdistance.eval()
.
More Parsing
The textacy package contains some wrappers for SpaCy functionality, for example matching with expressions:
>>> import textacy >>> doc = nlp("who is the mayor of tallinn?") >>> print(list(textacy.extract.matches(doc, ... [{"POS":"ADP"},{"OP":"*"},{"POS":"PROPN"}]))) [of tallinn]
It also adds some new algorithms, such as finding the most important words:
>>> import textacy.ke >>> textacy.ke.textrank(doc) [('tallinn', 0.16844059871436393), ('mayor', 0.1573258553341743)]