Teadmised first lab 2011

The main goal of the first lab is to web-scrape uncertain data from the public sources using search engines and simple text analysis/statistics.

Sisukord

1 Conditions
2 What should the app do
3 Hints and suggestions
4 Kuidas alustada

Conditions

You can do the lab either alone or together with a fellow student. Three students per lab is not ok, though.

Technology - language, op system etc - is completely free.

What should the app do

Your system should take as input a list of arbitrary investigated objects like "Tower Bridge", "Statue of Liberty", "Tallinna raekoda", "Olde hansa", "Tallinn University of Technology", "Andrus Ansip", "Ford focus", "inception" and use web search and web-scraping to obtain probabilistic data about the objects. We look for

popularity (search hits)
object type (building, church, tower, landmark, person, car, ...)
creation year (building year, birthdate, year of introduction to market, ...)
other often-used characteristic words/tags (steeple, national, medieval, academic, peaminister, thriller, ...)

The data should be augmented by a numeric confidence measure (0....1) indicating rough trust in the result. The confidence is mostly based on the percentage of investigated pages/sources giving this result, as well as a perceived quality of the data source (like: wikipedia has high quality, arbitrary pages low).

In order to get the list of pages to web scrape you should do several searches, selecting from / using in parallel

wikipedia search
wolfram alpha search
google or bing (or some other) web search

The results should be scraped. Use a different scraping method for semistructured data (wikipedia, wolfram alpha) and unstructured data (arbitrary web pages). For the latter it makes sense to split the page into sentences (just use period . as a separator) and look for sentences containing the investigated phrase. What other words does it contain? What numbers does it contain?

After scraping N pages for the same phrase, look for the most highly occurring type words, year numbers, other tags.

Hints and suggestions

While developing, it is easier to use a pre-stored search result/wikipedia page and pre-stored set of pages instead of doing full search and download each time.

However, the system should finally work with any phrases, meaning that the pre-loaded data cannot be normally used.

Interesting links and ideas: sophisticated systems.

Nell:

Peter Norvigi artikkel statistikapõhisest rakenduslikust tehismõistusest

Scrapimisframeworke:

Other:

Kuidas alustada

Ükski neist soovitustest pole kohustuslik, aga aitavad, kui oled segaduses:

Tuttav progekeel java/python

Teha rakendus mis tombab alla google otsingu tulemuse. Kolm võimalikku varianti, parem eelista seda, mis neist tundub sulle lihtsam (alltoodud järjekord ongi tõenäoline lihtsuse järjekord):

- html: headerisse ntx firefoxi user agent a la

        User-Agent:  Mozilla/5.0 (Windows; U; 
        Windows NT 6.1; ru; rv:1.9.2b5)   
        Gecko/20091204 Firefox/3.6b5

- vana api
- uus api

Kui google leht käes (mistahes apiga)siis tee proge osa mis otsib lehelt kõik viidatud lingid. Trüki kontrolliks välja.

Seejärel tõmba järjest endale kõik viidatud lehed. Saveda pole vaja. Trüki kontrolliks välja.

Lisa tõmbamise koodile lehe source töötlus:
- spliti punktide järgi.
- otsi laused sinu otsisõnaga.
- loe kokku eri sõnad lauses ja save globaalsesse baasi a la

    car: 1
    wheels: 2 
    jne

- iga lehe järel muutuvad sõnade arvud baasis suuremaks

Seejärel läheb asi huvitamaks ja keerulisemaks, samas on sul juba tekkinud intuitsioon, kuidas edasi.

Teadmised first lab 2011

Sisukord

Conditions

What should the app do

Hints and suggestions

Kuidas alustada

Navigeerimismenüü

Personaalsed tööriistad

Nimeruumid

Variandid

vaatamisi

Veel

Otsing

Navigeerimine

Kasulikku

Tööriistad