Teadmised first lab 2011
The main goal of the first lab is to web-scrape uncertain data from the public sources using search engines and simple text analysis/statistics.
Conditions
You can do the lab either alone or together with a fellow student. Three students per lab is not ok, though.
Technology - language, op system etc - is completely free.
What should the app do
Your system should take as input a list of arbitrary investigated objects like "Tower Bridge", "Statue of Liberty", "Tallinna raekoda", "Olde hansa", "Tallinn University of Technology", "Andrus Ansip", "Ford focus", "inception" and use web search and web-scraping to obtain probabilistic data about the objects. We look for
- popularity (search hits)
- object type (building, church, tower, landmark, person, car, ...)
- creation year (building year, birthdate, year of introduction to market, ...)
- other often-used characteristic words/tags (steeple, national, medieval, academic, peaminister, thriller, ...)
The data should be augmented by a numeric confidence measure (0....1) indicating rough trust in the result. The confidence is mostly based on the percentage of investigated pages/sources giving this result, as well as a perceived quality of the data source (like: wikipedia has high quality, arbitrary pages low).
In order to get the list of pages to web scrape you should do several searches, selecting from / using in parallel
- wikipedia search
- wolfram alpha search
- google or bing (or some other) web search
The results should be scraped. Use a different scraping method for semistructured data (wikipedia, wolfram alpha) and unstructured data (arbitrary web pages). For the latter it makes sense to split the page into sentences (just use period . as a separator) and look for sentences containing the investigated phrase. What other words does it contain? What numbers does it contain?
After scraping N pages for the same phrase, look for the most highly occurring type words, year numbers, other tags.
Hints and suggestions
While developing, it is easier to use a pre-stored search result/wikipedia page and pre-stored set of pages instead of doing full search and download each time.
However, the system should finally work with any phrases, meaning that the pre-loaded data cannot be normally used.
Interesting links and ideas: sophisticated systems.
Nell:
Peter Norvigi
artikkel statistikapõhisest rakenduslikust tehismõistusest
Scrapimisframeworke:
Other:
- http://www.cs.uic.edu/~liub/WebMiningBook.html
- http://www.frenchlane.com/Web-Content-Mining-4.pdf
- http://www.cs.uic.edu/~liub/teach/cs511-spring-06/CS511-opinion-synthesis.ppt
- http://www.cato.org/pubs/pas/pa584.pdf
Kuidas alustada
Ükski neist soovitustest pole kohustuslik, aga aitavad, kui oled segaduses:
- Tuttav progekeel java/python
- Teha rakendus mis tombab alla google otsingu tulemuse. Kolm võimalikku varianti, parem eelista seda, mis neist tundub sulle lihtsam (alltoodud järjekord ongi tõenäoline lihtsuse järjekord):
- html: headerisse ntx firefoxi user agent a la
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2b5) Gecko/20091204 Firefox/3.6b5
- vana api
- uus api
- Kui google leht käes (mistahes apiga)siis tee proge osa mis otsib lehelt kõik viidatud lingid. Trüki kontrolliks välja.
- Seejärel tõmba järjest endale kõik viidatud lehed. Saveda pole vaja. Trüki kontrolliks välja.
- Lisa tõmbamise koodile lehe source töötlus:
- spliti punktide järgi.
- otsi laused sinu otsisõnaga.
- loe kokku eri sõnad lauses ja save globaalsesse baasi a la
car: 1 wheels: 2 jne
- iga lehe järel muutuvad sõnade arvud baasis suuremaks
- Seejärel läheb asi huvitamaks ja keerulisemaks, samas on sul juba tekkinud intuitsioon, kuidas edasi.