KR2020lab3ex

Allikas: Lambda

Some more code examples to get started with lab 3.

Question Parsing with SpaCy

Loading SpaCy:

import spacy
# do this once from command line:
# python -m spacy download en
nlp = spacy.load("en")

A ridiculously simple parser for questions:

import spacy.symbols as sym                                                     

def parse(txt):
    doc = nlp(txt)
    verbs = []
    for word in doc:
        if word.dep in [sym.nsubj, sym.nsubjpass, sym.pobj]:
            chunk, verb = find_upward_verb(word)
            if verb != None:
                verbs.append((verb, chunk))
        # uncomment this to see what the "parser" sees
        #print(word, word.dep_, word.pos_, word.head)
    return verbs

The code above assumes that the input is a single sentence. It matches verbs with their subjects and objects. This function collects the relevant parts from the tree of the sentence:

def find_upward_verb(word):                                                     
    words = []                                                                  
    verb = None                                                                 
    while word.dep_ != "ROOT":                                                  
        words.append(word)                                                      
        word = word.head                                                        
        if word.pos in [sym.VERB, sym.AUX]:                                     
            verb = word                                                         
            break                                                               
    words.reverse() # from top to down                                          
    return words, verb                                                          

Some examples:

>>> parse("what is estonia?")
[(is, [estonia])]
>>> parse("where is estonia?")
[(is, [estonia])]
>>> parse("where is estonia located?")
[(located, [estonia])]
>>> parse("what is located in estonia?")
[(located, [what]), (located, [in, estonia])]
>>> parse("who is the mayor of tallinn?")
[(is, [who]), (is, [mayor, of, tallinn])]

This structure should be relatively easy to handle. For example, the third question is about the verb "located" and it's subject or object is "estonia". The returned objects are still SpaCy Tokens so you can access their POS tags and other information. However, you probably need to additionally detect at least the "question" words, such as "where", "what", "who" etc. to properly interpret the question.

Obviously, a lot more work is required to parse longer questions or handle finesse like the word "not" appearing somewhere.

Semantic Similarity

Semantic similarity with word vectors is a hot topic and you will find hundreds of articles and guides online. Recent versions of SpaCy have a way to find words with similar meaning.

This example uses pre-trained vectors that ship with SpaCy:

>>> import numpy as np
>>> # python -m spacy download en_core_web_lg
>>> nlp = spacy.load("en_core_web_lg")
>>> bla=nlp("hill")
>>> keys, _, sim = nlp.vocab.vectors.most_similar(np.array([bla[0].vector]), n=10)
>>> for k, s in zip(keys[0], sim[0]): print((nlp.vocab.strings[k], s))
... 
('HIll', 1.0)
('hill', 1.0)
('HILL', 1.0)
('Hill', 1.0)
('hills', 0.795)
('Hills', 0.795)
('HILLS', 0.795)
('mountain', 0.6898)
('MOUNTAIN', 0.6898)
('Mountain', 0.6898)

If you want to search within a limited set of terms then you need to make your own Vectors object. This does not necessarily mean training your own vectors. If all your words are in SpaCy's vocabulary, you can simply copy vectors from there.

Note that vectors are not limited to just words in SpaCy. You can match against longer text:

>>> nlp("banana").similarity(nlp("yellow fruit"))
0.655803422700308
>>> nlp("banana").similarity(nlp("red fruit"))
0.6324075748339544

Fuzzy Search

Instead of words with similar meaning, we might want to find words that look similar. This is an annoying problem to solve efficiently. Using the right type of index can help a bit.

>>> import editdistance
>>> import pybktree

>>> # assuming labels contain the strings that we want to match against
>>> fuzzy_index = pybktree.BKTree(editdistance.eval, labels)
>>> fuzzy_index.find("Reikjavik", 3)
[(1, 'Reykjavik'), (3, 'Miklavik')]

Unfortunately these tools do not scale too well beyond ~1M strings. If you have relatively few strings to match against, you can do a simple linear search and find the ones with lowest distance using editdistance.eval().

More Parsing

The textacy package contains some wrappers for SpaCy functionality, for example matching with expressions:

>>> import textacy
>>> doc = nlp("who is the mayor of tallinn?")
>>> print(list(textacy.extract.matches(doc,
...         [{"POS":"ADP"},{"OP":"*"},{"POS":"PROPN"}])))
[of tallinn]

It also adds some new algorithms, such as finding the most important words:

>>> import textacy.ke
>>> textacy.ke.textrank(doc)
[('tallinn', 0.16844059871436393), ('mayor', 0.1573258553341743)]