Teadmised second lab 2011
The main goal of the second lab is to extend data obtained during the first, web-scraping lab, using rules and a reasoning engine.
web-scrape uncertain data from the public sources using search engines and simple text analysis/statistics.
Conditions
You can do the lab either alone or together with a fellow student. Three students per lab is not ok, though.
Technology - language, op system etc - is completely free EXCEPT that you have to use Otter version 3.3 as a reasoning engine.
Ideally you should create both (a) your own rules for improving data / deriving new data and (b) rules from wordnet taxonomy to obtain generalisations from derived facts.
The lab is graded considering (see below for details):
- your own rules: how good/sensible they are and what results you get for a selected domain of objects
- rules generated from the wordnet database
- quality and interestingness of the overall results
You may skip writing your own rules (a) or wordnet rules (b) but in that case the grades for the lab will be weaker.
What should the app do
Your system should take as input data obtained during the first lab for some object (say, "Statue of Liberty" or "Andrus Ansip") and it should derive new facts, augmented with the confidence number.
The new facts for the object have to be derived using Otter and rules of your own making.
As a result the full app should
- take a phrase like "Statue of Liberty" as input
- scrape raw data from the web for the phrase (use the first lab you have completed)
- derive new data
- output the full set of data, including the derived and the raw data, with confidence numbers and the indicator raw/derived for each fact.
Subtasks of the lab
The lab is essentially split into these subtasks:
- the main task: create the ruleset suitable for your data and the reasoner
- format the output of your first task so that it is suitable for the reasoner: basically the correct syntax.
- create the input file for the reasoner: just compose it from a header, data block, rule block and a footer.
- run the reasoner and send output to a file.
- filter out the derived facts from the reasoner output.
- present the full resulting dataset.
Creating the ruleset
The main task of the lab is the ruleset creation. This requires creative thinking and experimentation with the reasoner.
Start by downloading and installing Otter Otter version 3.3 and experiment a bit.
It is a good idea to use the ITV0060 lab 2 example for experimentation: this file contains suitable settings. Use the Otter manual for additional details and settings.
The ruleset, as said, should contain two parts:
(a) Your own ruleset. This should either improve the confidence of already derived facts by combining them and/or derive completely new facts.
(b) Ruleset generated from the wordnet taxonomy database. This should generate generalisations or hypernyms in the wordnet terminology, see the car hypernym. For example, from the fact that X is a car you should derive that it is motor_vehicle, vehicle, physical_entity etc.