KR 2019 homework part 1: overall task

Allikas: Lambda

Task

The end result of the lab is a presentation: ca five minutes during which you explain:

  • which systems you experimented with
  • which gave useful output from your input
  • examples of output from your input (show actual textual examples)

The task of the first lab is to build and demonstrate a system which can

  • tokenize text (split any text into words: trivial)
  • run a NER tool on the text giving wikipedia url-s as output. See webaida. Notice that:
    • some NER-s do not give wiki URL-s, but just types. This is also useful.
    • for the second overall scenario (like wikidata) we do need wiki urls, though.
  • run and experiment with a (recommendably) pre-trained word vectorization system like word2vec: as an example, see this tutorial. We will come back to that.
  • run and experiment with parser tool on the text. There are several ways to parse, some more useful for our purposes than others. We will come back to that.

You do not have to be wildly successful with all these subtasks to pass. You will, though, have to try out each of these subtasks and report on how it went (got ok results, got bad results, could not get it to work, ...)

The second lab will build upon the results here. In particular, some choices in the second lab depend on what you managed to do with ner, word2vec and parsers.

Technology and tools

You are free to use any programming language, but Python is recommended.

You are free to use NLP tools and API-s and datasets, but your program should drive them from the beginning to end (ie your program takes input files, calls tools, modifies and prints output). It is better to use the tools than not to use them!

Popular toolkits for NLP

The recommendation is to take the first (spacy) or the second (nltk) toolkit.

Web APIs