ITV006 third lab 2013
The main goal of the third lab is to extend data obtained during the second api/web-scraping lab, using rules and a reasoning engine: generate new information for the object, not present in the original data. This could mean things like:
- the possible tags (ie uncertain types) and their relevance (relative importance) and trustworthiness of the relevance measure given.
- the relation of the object to some other objects, again with some measure of the strength or trustworthiness of the relation.
As a basis, you should use the api/web-scraped data from the public sources using search engines. You will likely:
- do simple text analysis/statistics first
- write the results of the analysis/statistics out as facts
- finally you employ your own rules to derive new facts about the tags or relations
For some cases it may be useful to take rules from the wordnet taxonomy to obtain generalisations from derived facts, to generate generalisations or hypernyms in the wordnet terminology, see the car hypernym. For example, from the fact that X is a car you should derive that it is motor_vehicle, vehicle, physical_entity etc.
Conditions
You can do the lab either alone or together with a fellow student. Three students per lab is not ok, though.
Technology - language, op system etc - is completely free EXCEPT that you have to use Otter version 3.3 as a reasoning engine.
Ideally you should create both (a) your own rules for improving data / deriving new data and (b)
The lab is graded considering (see below for details):
- your own rules: how good/sensible they are and what results you get for a selected domain of objects
- quality and interestingness of the overall results
What should the app do
Your system should:
- take as input data obtained by api/web scraping and derive new facts, augmented with the relevance and/or confidence number.
- optionally take as possible input one concrete object for which we want to derive new information (if the whole set is too large)
- for debugging: output the full set of data, including the derived and the raw data, with confidence numbers and the indicator raw/derived for each fact
- for final result: output the interesting/important new facts
You should give a short presentation of your work.
The new facts for the object have to be derived using Otter and rules of your own making.
Subtasks of the lab
The lab is essentially split into these subtasks:
- the main task: create the ruleset suitable for your data and the reasoner
- if necessary, do additional statistics to be added to the facts
- format the output of api/web scraping/statistics so that it is suitable for the reasoner: basically the correct syntax.
- create the input file for the reasoner: just compose it from a header, data block, rule block and a footer.
- run the reasoner and send output to a debug file.
- filter out the interesting derived facts from the reasoner output.
- present the full resulting dataset of interesting facts.
Creating the ruleset
The main task of the lab is the ruleset creation. This requires creative thinking and experimentation with the reasoner.
Start by downloading and installing Otter Otter version 3.3 and experiment a bit.
It is a good idea to use the ITV0060 lab 2 example for experimentation: this file contains suitable settings. Use the Otter manual for additional details and settings.
The ruleset, as said, may, for example, contain two parts:
(a) Your own ruleset.
(b) Ruleset generated from the wordnet taxonomy database. This should generate generalisations or hypernyms in the wordnet terminology, see the car hypernym. For example, from the fact that X is a car you should derive that it is motor_vehicle, vehicle, physical_entity etc.