Details of the second lab: rule system
Main goal of the lab: write a ruleset for tourism objects and use Otter or Prover9 to derive a list of recommended tourism objects: each object has an associated interest number. The tourism object information should be read from rdfa.
The lab will build upon your first rdfa lab.
What will you have to do
Concretely, you will have:
- Find rdfa pages with touristically interesting objects: it should contain some relevant info about these objects.
- The recommended site is http://dbpedia.org/snorql/
- Try query (use Results: XML selection from the box below the query window!)
SELECT ?x WHERE { ?x dbpedia-owl:location <http://dbpedia.org/resource/Berlin>. }
- Try similar queries for other cities (London etc)
- Try a more limited query (use Results: XML selection from the box below the query window!)
SELECT ?x ?p ?v WHERE { ?x dbpedia-owl:location <http://dbpedia.org/resource/Berlin>. ?x rdf:type <http://dbpedia.org/ontology/Building>. } ORDER BY ?x
- Your program will have to read all the links in the result listing, scanning the corresponding pages for rdfa (example: http://dbpedia.org/page/Palace_of_the_Republic )
- Your program should be able to work with several cities
- Look at these rdfa pages and invent/create sensible small rule files in the Otter syntax for deriving tourist interest data from physically present rdfa data.
- Add example data about interests of a concrete tourist to the file.
- Write a program which takes your first lab output, you rule/data file and builds a new Otter syntax file containing both rules and the data obtained from the rdfa page.
- Then let the same program call Otter with the newly built file and send the result to a result file.
- Write a program which takes the output file, parses out the derived facts saying interest and stores these to either csv or rdfa format file.
- Finally print out the file.
You may write just one program for all these tasks or several different. In the latter case you will also have to write a "master" program (as a shell script or bat file, for example) calling these subprograms.
The final resulting program should take a page url or a file name as a single input and produce a list of relevant derived facts as output.
Example data/rule set
This data/rule set is just a basic example: you should extend it significantly and make it correspond/match real tourism object data.
set(hyper_res). %set(binary_res). set(factor). set(print_kept). %set(prolog_style_variables). formula_list(sos). % --- city facts ---- fact(rdf_type,tallinn_Bogapott,tourism_cafe,100). fact(rdf_type,tallinn_vonkrahl,tourism_pub,100). fact(tourism_pop,tallinn_Bogapott,50,100). fact(tourism_pop,tallinn_vonkrahl,90,100). fact(rdf_type,tallinn_oleviste,tourism_church,100). fact(rdf_type,tallinn_vabaduserist,tourism_memorial,100). fact(rdf_type,tallinn_xcafe,tourism_cafe,70). % ---- city rules ---- all X all U (fact(rdf_type,X,tourism_cafe,U) & $GT($DIV($PROD(U,60),100),50) -> fact(rdf_type,X,tourism_drinkinghole, $DIV($PROD(U,60),100))). all X all U (fact(rdf_type,X,tourism_pub,U) -> fact(rdf_type,X,tourism_drinkinghole,U)). all X all U (fact(rdf_type,X,tourism_church,U) -> fact(rdf_type,X,tourism_architecture,U)). all X all U (fact(rdf_type,X,tourism_memorial,U) & $GT($DIV($PROD(U,80),100),50) -> fact(rdf_type,X,tourism_architecture, $DIV($PROD(U,80),100))). % ---- interest application rules ------ % with popularity info all P all X all Y all U all V all W all M (fact(interest,P,Y,U) & fact(rdf_type,X,Y,V) & fact(tourism_pop,X,W,M) -> fact(interest,P,X, $DIV($PROD(W,$DIV($PROD(U,V),100)),100)) ). % without popularity info all P all X all Y all U all V all W all M (fact(interest,P,Y,U) & fact(rdf_type,X,Y,V) -> fact(interest,P,X, $DIV($PROD(90,$DIV($PROD(U,V),100)),100)) ). % ----------- person description ------ fact(interest,p1,tourism_architecture,80). fact(interest,p1,tourism_drinkinghole,60). % --- want to get something like this --- %** KEPT (pick-wt=5): 20 [hyper,13,15,18,demod] fact(interest,p1,tallinn_Bogapott,32). %** KEPT (pick-wt=5): 21 [hyper,13,15,16,demod] fact(interest,p1,tallinn_vonkrahl,54). %** KEPT (pick-wt=5): 22 [hyper,13,14,19,demod] fact(interest,p1,tallinn_vabaduserist,57). %** KEPT (pick-wt=5): 23 [hyper,13,14,17,demod] fact(interest,p1,tallinn_oleviste,72). end_of_list.
Suggestions and links
How to prepare the input file: probably the simplest way is to prepare separate temporary files for rdfa output in otter format, the otter file header, the otter file bottom part and the rule file. Then concatenate these files together.
It is significantly harder (and unnecessary) to attempt to create a full otter format input string in one program, than it is to create separate files and concatenate them.
As said, probably the simplest way to create the required system is to write several small programs/scripts for subtasks and then write a master shell script or bat file calling all these programs one by one and using their output files as input to following programs.
How to get the derived facts from the Otter output file? Just look for rows with a string indicating that this is a derived fact (for example, rows with *KEPT*) and cut out the relevant part. Pure string manipulation, in other words. Concrete suggestions for plan9:
- start processing from an output row containing == SEARCH ==
- use rows starting with kept: and cut out the relevant part after the clause number up to period .
- stop processing when you reach a row starting with ==
Which strategy/output preferences to use? A good choice is to put the following rows to the beginning of an input file:
clear(auto). set(print_kept). set(hyper_resolution). clear(binary_resolution). set(initial_nuclei).
Otter prover:
- the old version http://www.cs.unm.edu/~mccune/otter/
- the new version (Prover9) http://www.cs.unm.edu/~mccune/prover9/
Download, experiment, then read documentation:
- http://www.cs.unm.edu/~mccune/otter/Otter33.pdf
- http://www.cs.unm.edu/~mccune/mace4/manual/January-2007/
See also: