Teadmised first lab 2012

Allikas: Lambda

The main goal of the first lab is to web-scrape uncertain data from the public sources using search engines and calculate popularity index.

Conditions

You can do the lab either alone or together with a fellow student. Three students per lab is not ok, though.

Technology - language, op system etc - is completely free.

What should the app do

Your system should take as input a list of arbitrary list of cities like "London", "New York", "Tallinn" etc (at least 10 cities) and use web search and web-scraping to obtain probabilistic and popularity data about the objects. We look for

  • population
  • area
  • search results (popularity) by different search engines (google, bing) in different search categories (web, images, videos, news)

The data should be enriched by a numeric confidence measure (0..1) indicating rough trust in the result. The confidence is mostly based on the sourcecpercentage of investigated pages/sources giving this result, as well as a perceived quality of the data source (like: wikipedia has high quality, arbitrary pages low). In order to get the list of pages to web scrape you should do several searches, selecting from

  • wikipedia, wolfram alpha search
  • google, bing (web, images, videos, news) web search

The results should be scraped using suitable scraping method and confidence values should be set or calculated depending on the source or meaning of search query. All the captured properties and the confidence values should be used for popularity calculation and list of cities should be ordered by popularity. What is the mean of the calculated popularity?


Hints and suggestions

While developing, it is easier to use a pre-stored search result/wikipedia page and pre-stored set of pages instead of doing full search and download each time.

However, the system should finally work with custom place names, meaning that the pre-loaded data cannot be normally used.

Interesting links and ideas: sophisticated systems.

Scraping frameworks:

Nell:

Peter Norvigi artikkel statistikapõhisest rakenduslikust tehismõistusest

Other: