Teadmised first lab 2011

Allikas: Lambda

The main goal of the first lab is to web-scrape uncertain data from the public sources using search engines and simple text analysis/statistics.

Conditions

You can do the lab either alone or together with a fellow student. Three students per lab is not ok, though.

Technology - language, op system etc - is completely free.

What should the app do

Your system should take as input a list of arbitrary investigated objects like "Tower Bridge", "Statue of Liberty", "Tallinna raekoda", "Olde hansa", "Tallinn University of Technology", "Andrus Ansip", "Ford focus", "inception" and use web search and web-scraping to obtain probabilistic data about the objects. We look for

  • popularity (search hits)
  • object type (building, church, tower, landmark, person, car, ...)
  • creation year (building year, birthdate, year of introduction to market, ...)
  • other often-used characteristic words/tags (steeple, national, medieval, academic, peaminister, thriller, ...)

The data should be augmented by a numeric confidence measure (0....1) indicating rough trust in the result. The confidence is mostly based on the percentage of investigated pages/sources giving this result, as well as a perceived quality of the data source (like: wikipedia has high quality, arbitrary pages low).

In order to get the list of pages to web scrape you should do several searches, selecting from / using in parallel

  • wikipedia search
  • wolfram alpha search
  • google or bing (or some other) web search

The results should be scraped. Use a different scraping method for semistructured data (wikipedia, wolfram alpha) and unstructured data (arbitrary web pages). For the latter it makes sense to split the page into sentences (just use period . as a separator) and look for sentences containing the investigated phrase. What other words does it contain? What numbers does it contain?

After scraping N pages for the same phrase, look for the most highly occurring type words, year numbers, other tags.

Hints and suggestions

While developing, it is easier to use a pre-stored search result/wikipedia page and pre-stored set of pages instead of doing full search and download each time.

However, the system should finally work with any phrases, meaning that the pre-loaded data cannot be normally used.

Interesting links and ideas: sophisticated systems.

Nell:


Peter Norvigi artikkel statistikapõhisest rakenduslikust tehismõistusest

Scrapimisframeworke:

Other:


Kuidas alustada

Ükski neist soovitustest pole kohustuslik, aga aitavad, kui oled segaduses:

  • Tuttav progekeel java/python
  • Teha rakendus mis tombab alla google otsingu tulemuse. Kolm võimalikku varianti, parem eelista seda, mis neist tundub sulle lihtsam (alltoodud järjekord ongi tõenäoline lihtsuse järjekord):
    • html: headerisse ntx firefoxi user agent a la
        User-Agent:  Mozilla/5.0 (Windows; U; 
        Windows NT 6.1; ru; rv:1.9.2b5)   
        Gecko/20091204 Firefox/3.6b5
    • vana api
    • uus api
  • Kui google leht käes (mistahes apiga)siis tee proge osa mis otsib lehelt kõik viidatud lingid. Trüki kontrolliks välja.
  • Seejärel tõmba järjest endale kõik viidatud lehed. Saveda pole vaja. Trüki kontrolliks välja.
  • Lisa tõmbamise koodile lehe source töötlus:
    • spliti punktide järgi.
    • otsi laused sinu otsisõnaga.
    • loe kokku eri sõnad lauses ja save globaalsesse baasi a la
    car: 1
    wheels: 2 
    jne
    • iga lehe järel muutuvad sõnade arvud baasis suuremaks
  • Seejärel läheb asi huvitamaks ja keerulisemaks, samas on sul juba tekkinud intuitsioon, kuidas edasi.