Tweet ENG

Allikas: Lambda


Specification

Write a command line application that takes a geographical location as a parameter (e.g. Tallinn, Estonia, Rapla, Harjumaa, etc) - that finds recent tweets from Twitter that are posted from this location and prints them out.

For example:

    06.09.2008 altexor: Klaxons -a1 disco punk in tallinn. fun   
    06.09.2008 Nath_Milliet: Just booked our plane tickets to go & 
      see Glenn in Tallinn on the 1st of november. I just can't wait :)
    05.09.2008 nolecore: Too much Vana Tallinn....argggghhh    

Assignment gives maximum 10 points. The main part 5pts and other from extra features. Assigment without the main part is not accepted.

It might be reasonable for you to register as Twitter user, although this is not mandatory for the assigment.

Details of the main part: how it works

The program must use the Twitter API - look for [1]

You can easily look for tweets posted at a geographical location using this API. Notice that a Twitter user might not define his/her location correctly, but this is not your conecern. Twitter system tries to figure out geographical coordinates based on the location defined.

API examples:

These URL-s give a list of tweets of reasonable size given the coordinates and radius around it.

If you don't have a special reason for using JSON format, then it is highly recommended to use the ATOM format (JSON is excellent for use within JavaScript)

Your program has to open the URL which has fields separated by commas in the end:

  • latitude (59.438862)
  • longitude (24.754472)
  • radius (10km)

Google Maps API

How to get coordinates of a location given by user (Estonia, Tallinn, etc.)? You would have to use the Google Maps API:

NB! If you open the example links - you will probably get a wrong result with encoded error code.
You would have to copy-paste the url to your web browser and hit enter. (Or opening it from command line using CURL) - then it works OK.

The Google CSV format is far more simpler than XML. So it is recommended to use that.

The fields of a CSV response:

  • Reasonable zoom level (you can use this for defining the radius later)
  • Result code (200 is OK)
  • Latitude
  • Longitude

NB! latitude and longitude com in inverted order in respect to Twitter API.

Up to this it is the main part and it will give 5 points and you will get the assignment done.

First extra assignment: kohad.csv file

Additional feature of the program that is inteded for

  • defining your own locations (home, University, etc)
  • save Google API responses to it so the next time you dont have to query the API (local cache)

The CSV file has to be available for you program and recommended to be located at the same folder where you program is. Plus it has to contain rows with the following field definitions:

   official_name, x-coord, y-coord, radius_km, alternative_name_1, ..., alternative_name_N

or rows, where coordinates and radius are missing:

   ametlik_nimi,,,,alternatiivnimi_1,...,alternatiivnimi_N
   official_name,,,,alternative_name_1, ..., alternative_name_N

where

  • official name is what you query for (Tallin instead of Home, Mustamäe instead of University)
  • alternative names are the ones a user can query for, for that location
  • coordinates are coordinates and can be undefined at first!

Algorithm for kohad.csv use:

  • Check the local cache before query Google API
  • If found use it, skip Google API
  • If found but coordinates are missing:
    • Query Google API with official_name and find the coordinates
    • Save the coordinates to local cache so you don't have to hit Google API the next time

This module will give 2 points.

Second additional feature: dataen.txt file

See osa on jälle täiendav jupp programmist, mis koha suuruse (ja sobiva raadiuse) hindamisel kasutab suures failis dataen.txt kodeeritud hinnangut elanikkonna suurusele selles kohas.

An additional part of your program that uses dataen.txt file to calculate a reasonable radius. The file contains population info in a given location.

The file is a downloadable ZIP 6.3M, after extraction 23M.

Format explanation.

This module will give 2 points.

Third extra feature: sorting

This part will give 1 point.

Sorting program will accept the following command line arguments:

  • java findtweets Tallinn
  • java findtweets Tallinn name
  • java findtweets Tallinn date
  • java findtweets Tallinn content

kus esimene variant on varemkirjeldatud ja harilik, teine aga sorteerib ja trükib vastuslisti autori nime järgi, kolmas sorteerib vastuslisti kirjutamiseaja järgi, neljas sisu järgi.

The first example is the "normal" execution. The second sorts and prints the results sorted based on the author name. Third - based on submission date Fourth - based on tweet content

You would have to implement all of these.

Scoring

Scoring in conclusion:

  • Main part: 5
  • kohad.csv file: 2
  • dataen.txt file: 2
  • sorting: 1

Recommendations and ideas

Reading from web

Learn the example program: http://java.sun.com/docs/books/tutorial/networking/urls/readingURL.

Kui ta ei suuda võrgust lugeda, siis on tõenäoline põhjus selles, et Sinu arvuti ei saa otse internetti ühendust, vaid peab töötama läbi nn proxy.

If you are unable to read from the web it is highly likely that your computer is not connected directly to the Internet has to do it through a proxy.

NB! It is possible (but no for sure - try it!), that all computers in the class have to use the proxy. The given example program cannot read from the Web. read the TTÜ AK proxy tutorial: what you would have to do if it is necessary to use the proxy.

You would probably not have this issue at home or some other location and stuff works.

Use try-catch block for I/O - this includes web access and local file access

This might be good reading Sun tutorial for netowork I/O

Twitteri xml-data kasutamine

Kuidas saada twitteri xml-vastusest vajalik info (kuupäev, nimi, sõnum)? Vaata twitteri xml-vastuse source! Otsi sealt oma programmiga järjest stringe:

  • <published>...</published> vahel olev string on kuupäev/kell
  • <title>...</title> vahel on string on sõnum
  • <name>... </name> vahel enne sulge olev string on autori nimi

Seejärel otsi jälle järgmist jne jne. Soovitan salvestada leitud stringid kahemõõtmelisse stringimassiivi, siis saab neid hiljem sorteerida. Ülesande põhiosa täitmiseks pole seda massiivi isegi tingimata vaja.

Using dataen.txt

Student Sander sent additional hints:

Raadiuse arvutamisel populatsiooni järgi kasutasin algselt lihtsalt populatsiooni jagamist arvuga. Siis aga oli nii, et kui Tallinnale andis mõistliku raadiuse, siis New Yorgile andis suurema kui terve manner. Ent siis sain aru, et inimeste on linnades võrdeliselt pindalaga (pikkimõõde ruudus), mitte pikkimõõtmega. See tähendab, et raadiusega on võrdeline ruutjuur populatsioonist. Praegu kasutangi raadiuse hinnangu saamiseks

I used a division by constant from population for computing the radius. This gave a reasonable result for Tallinn but not New York. So I figured the population is proportional with the area (longitutudal dimension squared) not the longitudal alone. Which means the radius is proportianal the square root of the population which I used for calculating the radius.

Math.sqrt(population) / 40 <- a constant figured out by trial-and-error

Although Lapland and China populations are different, this still gives a more-or-less reasonable results. Additionally I disallowed a radius smaller than 5km - plus it doesn't make sense to query tweets from poorly populated areas anyway.

Snippets and stuff

These example programs can come in handy: