KR 2019 homework part 1: overall task
Task
The end result of the lab is a presentation: ca five minutes during which you explain:
- which systems you experimented with
- which gave useful output from your input
- examples of output from your input (show actual textual examples)
The task of the first lab is to build and demonstrate a system which can
- tokenize text (split any text into words: trivial)
- run a NER tool on the text giving wikipedia url-s as output. See webaida. Notice that:
- some NER-s do not give wiki URL-s, but just types. This is also useful.
- for the second overall scenario (like wikidata) we do need wiki urls, though.
- run and experiment with a (recommendably) pre-trained word vectorization system like word2vec: as an example, see this tutorial. We will come back to that.
- run and experiment with parser tool on the text. There are several ways to parse, some more useful for our purposes than others. We will come back to that.
You do not have to be wildly successful with all these subtasks to pass. You will, though, have to try out each of these subtasks and report on how it went (got ok results, got bad results, could not get it to work, ...)
The second lab will build upon the results here. In particular, some choices in the second lab depend on what you managed to do with ner, word2vec and parsers.
Technology and tools
You are free to use any programming language, but Python is recommended.
You are free to use NLP tools and API-s and datasets, but your program should drive them from the beginning to end (ie your program takes input files, calls tools, modifies and prints output). It is better to use the tools than not to use them!
Popular toolkits for NLP
The recommendation is to take the first (spacy) or the second (nltk) toolkit.
- Spacy toolkit for Python
- NLTK: the main Python toolkit, see also this tutorial
- Google SyntaxNet (see in github)
- CoreNLP: the main Stanford NLP tool in the context of a larger set of Stanford NLP toolkits like stanford NER etc and this NER tutorial
- Pattern toolkit for Python
- opennlp
- PyNLP for Python
- NER tutorial for Linux in the context of a larger practical tutorial
Web APIs
- Google cloud natural language API
- opencalais (free registration required)