Notes for KR lab 1
Sisukord
Searching
You do not - strictly speaking - need Google: could also use bing etc. There are two ways to get data from Google: using a new APi or an old deprecated API. The old deprecated one is easier to use: just open url like
https://ajax.googleapis.com/ajax/services/search/web?v=1.0&rsz=8&start=0&q=Andrus+Ansip
where rsz=8 indicates that there are 8 links per one answer page (default 4, numbers larger than 8 not accepted) and start=0 says you want to get the links from link nr 0.
In other words, start with start=0, then use start=8, start=16 etc until you still get results: the max nr of links you get is ca 100.
The small example below implements such an iteration over search pages.
Caching
While caching the results is not strictly necessary, it is a good idea to store both
- google search results (link lists)
- html sources obtained from links
to a file.
Then - while experimenting and debugging - you do not need to search for each name more than once. A large number of searches will simply deplete your qoogle quota for some period. Also, reading all these files takes time.
The small example below does not do any caching.
Processing html source
In the end we are only interested about a small number of specific keywords. Thus there is no real need to throw away garbage words, html constructions etc: you can do this, but you may also skip it.
A good idea is to
- replace potential separators like < , > with whitespace
- replace period . with a whitespace-period-whitespace like this " . "
After such elementary replacing you should probably just split the whole html into "words" separated by whitespace.
Processing words on the page
Now when you have obtained a word list (containing lots of garbage words) you want to associate words with their relevance number on the page:
- the closer the word is to a searched phrase, the more relevant it is.
- the more the word occurs on the page, the more relevant it is.
Once you decide upon the concrete "relevance formula" and calculate relevances for the words, you may well throw away all words with a very low relevance: we will not use them anyway.
Aggregating words on different pages
The next step from here is to start aggregating the relevance-weighted lists of words for each html source. Observe that some html pages themselves are more relevant than others. For example, wikipedia and facebook pages are probably highly relevant. Similarly, pages at the top of the search list are probably more relevant than the pages and the end of the search list.
Once you have decided upon the "page relevance" formula, you may want to aggregate the weighted word lists of separate pages into the common weighted word list for the whole phrase, using the page relevance number as an additional weight to be used while aggregating.
What to do if you find very few interesting words?
A good idea is to decide upon a small list of highly relevant and common words and search the name together with these words!
Real estate portals
A good idea in Estonia is to use http://www.kv.ee/ where you can simply search with a get query like this:
See also https://www.ehr.ee/
Small example as a start
A very small example program in Python 3 to introduce scraping and Google search api: it does a google search for Andrus Ansip with 8 results per page and then goes through the result pages (each containing 8 links).
The m function replaces some chars with whitespace, splits the page to words and prints out the words with word count, starting with the ones having the highest word count.
No file caching used here. Also, we do not check for words near the name, etc. As it is, the program is not a suitable answer for the first lab: you can, as said. use it as a starting point.
import urllib import urllib.request import json def m(url): #print ("reading html") try: u=urllib.request.urlopen(url) html = u.read().decode("utf-8") except: print("failed to read url") return #print (str(html)) r=html.replace("."," . ").replace(","," ").replace("\""," ") r=r.replace("<"," ").replace(">", " ") s=r.split() d={} for w in s: if w in d: d[w]=d[w]+1 else: d[w]=1 l=[] for k in d: l.append((k,d[k])) t=sorted(l, key=lambda e: e[1]) print (str(t)) def g(): print ("googling") start=0 while start<100: print("start "+str(start)) url="https://ajax.googleapis.com/ajax/services/search/web?v=1.0&rsz=8" url+="&start="+str(start) url+="&q=" url+="Andrus+Ansip" try: u=urllib.request.urlopen(url) html = u.read().decode("iso-8859-1") j=json.loads(html) if not j or not "responseData" in j: print ("failed to read/parse") return p=j["responseData"] if not p or not "results" in p: print("no more results") return p=p["results"] for r in p: url=r["url"] print ("******* url *****") print (str(url)) m(url) except: print("failed to read google search url") start+=8 g()