KR 2024 homework part 1
You will have to write a small program to convert/filter data to a suitable format for experimenting with a logical reasoner, and then write ca 10 nontrivial rules for answering questions. Finally, experiment with the reasoner on this data + rules and make a small presentation of your data, rules, successes and failures.
The presentation does not have to be made with special slides and a long writeup (although this is also OK). You are expected to show the data, rules and results and briefly explain your ideas, experiments and the work process. Just using an editor for showing this stuff is OK. Best if you can also make a small demo with the reasoner actually running and proving some of the queries.
Not everything you try out needs to work: some failures are also interesting.
We award a few extra points for adventurous people adding extra data into the fray; see the last section about optional stuff.
Sisukord
Data
Riina took the large datasets from https://yago-knowledge.org/ and prepared relatively small selections (all about something Estonian) from this data. The link to the files is here. The data is presented as plain triplets, no fancy encoding (except namespaces) is involved.
First, please download the files and explore them a bit. Then convert the files to the format our reasoner gk (or its web version http://logictools.org) can eat. Write a program for this! Trying to just replace/filter stuff using an editor will become very cumbersome soon.
For example, the triplet
yago:Lake_Peipsi rdf:type yago:Lake
can be converted to
triple(yago:Lake_Peipsi,rdf:type,yago:Lake).
or with the quotation marks
triple("yago:Lake_Peipsi","rdf:type","yago:Lake").
Notice that I have not bothered to replace the abbreviations of namespaces like yago, rdf etc with full names http://yago-knowledge.org/resource/ and http://www.w3.org/1999/02/22-rdf-syntax-ns#: the abbreviations make the symbols unique enough for our purposes.
The predicate triple is really arbitrary: you may use any name you like. You may also use the fourth argument for type, if you wish (like "string"/"integer"/"id", in case it helps). Importantly, there are subtle differences between writing constants like yago:Lake_Peipsi with or without quotation marks:
- Without quotation the reasoner does not assume that two different constants necessarily mean different objects. Saying yago:Lake_Peipsi = yago:Lake_Peipus does not lead to a contradiction. Also, non-ascii symbols are generally not ok for non-quoted constants.
- Quotation marks make the reasoner to treat constants as strings, and thus "yago:Lake_Peipsi" = "yago:Lake_Peipus" is a contradiction. Also, non-ascii symbols are OK.
Observe that the datasets contain a huge amount of labels (i.e. human names) and comments in different languages, like
yago:Tartu rdfs:label "Tartu"@sh . yago:Tartu rdfs:label "Tartu"@lfn . yago:Tartu rdfs:label "Tartu"@es . yago:Tartu schema:alternateName "Tartu"@ko . yago:Tartu schema:alternateName "Tartu"@mhr . yago:Tartu schema:alternateName "Jurjev"@cs . yago:Tartu_Art_House rdfs:comment "museum di Estonia"@id . yago:Tartu_Art_House rdfs:comment "Museum in Estland"@de .
You may safely remove all the comments and most (or all) names, except these which you really use in your questions and rules, if any. For example, you may want to keep just one rdfs:label for an object, without indicating a language (the @-stuff) at all.
Rules and questions
The best way to try out rules and questions is to use http://logictools.org for quick experiments along with small selections of data. For example, try
triple("yago:Lake_Peipsi","rdf:type","yago:Lake"). triple("yago:Võrtsjärv","rdf:type","yago:Lake"). triple(X,Y,"yago:Lake") => $ans(X).
and you should get a single answer along with the trivial proof. To get more answers, click the "Advanced" button and copy { "max_answers": 2} into the strategy box. Beware that in case you ask for more answers than there really are, the reasoner may just keeping looking for more, eventually timeouting, and not giving any answers at all.
Now let us try out something simple, but not totally trivial:
triple(yago:Jaan_Lepp, schema:birthPlace, yago:Anija_Parish). triple(yago:August_Vaga, schema:birthPlace, yago:Kehra). triple(yago:Silvi_Vrait, schema:birthPlace, yago:Kehra). triple(yago:Karmen_Pedaru, schema:birthPlace, yago:Kehra). triple(X,schema:birthPlace,Z) => bornin(X,K). bornin(X,Z) & bornin(Y,Z) => borninsameplace(X,Y). borninsameplace(X,Y) & -$substr(X,Y) & -$substr(Y,X) => $ans(X,Y).
Put, say { "max_answers": 3} to the strategy box, to get more answers than one.
Here we took a small selection of data from one of the files and added a few rules. Notice the trick with the special builtin predicate $substr(K,Z): it treats the argument constants as strings and checks whether the first one is a substring of the second one. Thus the -$substr(X,Y) & -$substr(Y,X) simply eliminates equal constants to avoid answers like $ans('yago:August_Vaga','yago:August_Vaga').. Alternatively, if we had used the quotation marks around person id-s, we could have used the rule
borninsameplace(X,Y) & X!=Y => $ans(X,Y).
instead.
Next, a slightly more interesting example (use { "max_answers": 6} or less):
triple("yago:Lake_Peipus","rdf:type", "yago:Lake"). triple("yago:Võrtsjärv","rdf:type","yago:Lake"). triple("yago:Veisjärv","rdf:type","yago:Lake"). triple("yago:Väike_Emajõgi", "yago:flowsInto", "yago:Võrtsjärv"). triple("yago:Tänassilma_River", "yago:flowsInto", "yago:Võrtsjärv"). triple("yago:Tarvastu_River", "yago:flowsInto", "yago:Võrtsjärv"). triple("yago:Võrtsjärv", "yago:flowsInto", "yago:Emajõgi"). triple("yago:Veisjärv","yago:flowsInto","yago:Õhne"). triple("yago:Õhne", "yago:flowsInto", "yago:Võrtsjärv"). triple("yago:Emajõgi", "yago:flowsInto", "yago:Lake_Peipus"). triple(X,"yago:flowsInto",Y) => boattravel(X,Y). triple(X,"yago:flowsInto",Y) => boattravel(Y,X). boattravel(X,Y) & boattravel(Y,Z) => boattravel(X,Z). triple(X,"rdf:type","yago:Lake") & triple(Z,"rdf:type","yago:Lake") & boattravel(X,Y) & boattravel(Y,Z) & X!=Z => laketolake(X,Z). laketolake(X,Y) => $ans(X,Y).
Interestingly, the Yago dataset uses two different id-s for Peipsi: yago:Lake_Peipus and yago:Lake_Peipsi with several additional triplets indicating equality of these id-s to some other id-s, but not to each other! The related datasets for these id-s are also different. Conceptually, this is an errorish thingy in Yago.
yago:Lake_Peipsi rdf:type yago:Lake yago:Lake_Peipsi owl:sameAs wd:Q2627792 yago:Lake_Peipsi schema:sameAs "/g/1217z1dh" yago:Lake_Peipus rdf:type yago:Lake yago:Lake_Peipus owl:sameAs wd:Q19253 yago:Lake_Peipus schema:sameAs "/m/021yh2"
Ideas for writing your own rules
The main recommendation is thinking what nontrivial things we could ask from the dataset and then looking whether it actually contains sufficient data for this. If you have an idea, try it out with small examples first.
Try to combine different rules requiring the generation of longer nontrivial proofs for answers.
Some ideas, including somewhat crazy ones:
- Categorizing objects into larger categories, using the transivity of rdf:type (you have to write your own rule for the transitivity). You may also (strictly optional) convert yago-taxonomy.txt to rules and use this for larger categories.
- What city/town/lake/island is inside what, according to located-in-estonia.txt. What can you derive from this, say, using transitivity of inside?
- Who was born where and how could they potentially travel with the boat to a place another person was born :)
- Finding data from people-in-estonia.txt for persons active in created-in-estonia.txt. Finding people born nearby active in the same movies. Etc.
- Convert birth dates like yago:Endel_Ruberg schema:birthDate "1917-05-21T00:00:00Z"^^xsd:dateTime to simple integers and use the built-in arithmetic of th e reasoner (see "About" -> "Special symbols and additional constructions" in logictools) to filter out suitable people. Who is surely not alive any more? Who are roughly of the same age and have something more in common?
- Converting triplets in the dataset triple(a,b,c) where the first and second argument should conceptually strictly determine the third (like when c is a birthdate) to equalities like b(a)=c instead and then using terms and equalities in your rules for such cases.
- Generating trips with actual paths shown in answers: you will need function term arguments in your predicates for this, see the example from the first lecture:
railroad(tallinn,tapa). railroad(tapa,tartu). highway(tallinn,virtsu). sealane(virtsu,kuivastu). highway(kuressaare,kuivastu). railroad(X,Y) => railroad(Y,X). highway(X,Y) => highway(Y,X). sealane(X,Y) => sealane(Y,X). railroad(X,Y) => easytravel(X,Y,use(train,X,Y)). highway(X,Y) => easytravel(X,Y,use(bus,X,Y)). sealane(X,Y) => easytravel(X,Y,use(ship,X,Y)). easytravel(X,Y,P1) & easytravel(Y,Z,P2) => easytravel(X,Z,combine(P1,P2)). easytravel(tartu,X,Y) & spacity(X) => $ans(X,Y).
Using large datasets: logictools vs command line reasoner gkc
Logictools is unable to eat very large datasets and the options to control it are limited. Also, experimenting only on the web may be cumbersome.
Thus I'd recommend to download the full-powered version of the reasoner used in logictools from https://github.com/tammet/gkc : it is best to use the somewhat older release https://github.com/tammet/gkc/releases/tag/v0.6.0 which has the same codebase as logictools. Just take the provided windows, linux or macos binary: no need to compile yourself.
A rather thorough tutorial of gkc with examples is README.md here: https://github.com/tammet/gkc/blob/master/Examples/README.md
Optional: additional datasets
In case you are adventurous, try to use additional datasets and combine these with the ones provided by Riina! For example, the Yago stuff contains triplets like
yago:Lake_Peipsi owl:sameAs wd:Q2627792
with the "wd:" defined above as "http://www.wikidata.org/entity/", thus leading to pages like http://www.wikidata.org/entity/Q2627792 . Now, looking at the source of this page, you find rows like <link rel="alternate" href="https://www.wikidata.org/wiki/Special:EntityData/Q2627792.nt" type="application/n-triples">
with the link
https://www.wikidata.org/wiki/Special:EntityData/Q2627792.nt leading to a large set of triplets related to Lake Peipsi! Most of these represent some specific complex encodings of this wikidata web page itself, but several are actually about Peipsi.
A list of well-known and used large "generic" fact and knowledge bases can be found in http://lambda.ee/wiki/Teadmiste_formaliseerimine#Block_3:_General-knowledge_databases . You may want to have a look at these, but beware: obviously it takes time to understand and process. Another cool source is kaggle.