Teadmised second lab 2012

Allikas: Lambda

The main goal of the second lab is to extend data obtained during the first, web-scraping lab, using rules and a reasoning engine: find a type of the object and estimate the suitability of it as a touristic object.

As a basis, you should web-scrape uncertain data from the public sources using search engines and do simple text analysis/statistics. Then you write the results of the analysis/statistics out as facts. Finally you employ your own rules to derive new facts about the type and suitability of the object.

Conditions

You can do the lab either alone or together with a fellow student. Three students per lab is not ok, though.

Technology - language, op system etc - is completely free EXCEPT that you have to use Otter version 3.3 as a reasoning engine.

Ideally you should create both (a) your own rules for improving data / deriving new data and (b) rules from wordnet taxonomy to obtain generalisations from derived facts.

The lab is graded considering (see below for details):

  • your own rules: how good/sensible they are and what results you get for a selected domain of objects
  • rules generated from the wordnet database
  • quality and interestingness of the overall results

You may skip writing your own rules (a) or wordnet rules (b) but in that case the grades for the lab will be weaker.

What should the app do

Your system should take as input data obtained by web scraping some object (say, "Statue of Liberty" or "Andrus Ansip") and it should derive new facts, augmented with the confidence number.

The new facts for the object have to be derived using Otter and rules of your own making.

As a result the full app should

  • take a phrase like "Statue of Liberty" as input
  • scrape raw data from the web for the phrase (use the first lab you have completed)
  • derive new data
  • for debugging: output the full set of data, including the derived and the raw data, with confidence numbers and the indicator raw/derived for each fact
  • for final result: output the interesting/important new facts, ie the touristic suitability info and the type info with high confidence.

Subtasks of the lab

The lab is essentially split into these subtasks:

  • the main task: create the ruleset suitable for your data and the reasoner
  • format the output of web scraping/statistics so that it is suitable for the reasoner: basically the correct syntax.
  • create the input file for the reasoner: just compose it from a header, data block, rule block and a footer.
  • run the reasoner and send output to a debug file.
  • filter out the interesting derived facts from the reasoner output.
  • present the full resulting dataset of interesting facrs.

Creating the ruleset

The main task of the lab is the ruleset creation. This requires creative thinking and experimentation with the reasoner.

Start by downloading and installing Otter Otter version 3.3 and experiment a bit.

It is a good idea to use the ITV0060 lab 2 example for experimentation: this file contains suitable settings. Use the Otter manual for additional details and settings.

The ruleset, as said, should contain two parts:

(a) Your own ruleset. This should either improve the confidence of already derived facts by combining them and/or derive completely new facts.

(b) Ruleset generated from the wordnet taxonomy database. This should generate generalisations or hypernyms in the wordnet terminology, see the car hypernym. For example, from the fact that X is a car you should derive that it is motor_vehicle, vehicle, physical_entity etc.