GATE Cloud Twitter Collector
The Twitter Collector is an easy-to-setup service on GATE Cloud, which collects tweets based on matching keywords, from given Twitter users, geo-tagged with particular locations, language, or a combination of all four. It uses the Twitter streaming API, so collects tweets continuously, until you choose to stop it.
To use the service: https://cloud.gate.ac.uk/shopfront/displayItem/twitter-collector. It runs on dedicated Amazon cloud machines and thus we have to charge a small amount for its use, since Amazon charges us in turn.
1. Reserving and starting a Twitter Collector
Log in to GATE Cloud and reserve a machine from the Twitter Collector service page. Provided you have enough funds in your GATE Cloud account, this will lead to a page confirming the reservation. There is no set-up cost for reserving the machine, but your account will be charged for the clock time (not the CPU time) that it is left running ("State active" on the dashboard or machine reservation page).
Each link in the Reservation ID column of the Cloud Machines section of your dashboard brings up a page with the https URL, account name, and password for administering the machine. (The hostname will be in the form twitter-collector-<identifier>.services.gate.ac.uk.) You need to click the "Start Instance" button and wait for the machine to become active in order to configure it.
You will receive an e-mail when it is active, and if you refresh the reservation page, the URL will become clickable. Follow the link for access to the machine, and enter the user name and generated password to configure it. (You can change the password once you have logged into the machine.)
2. Configuring a Twitter Collector
Your Twitter Collector's admin page contains the following sections.
Twitter API Access
Use this section of the machine's admin page to authorize the machine to use your Twitter account. GATE Cloud uses your account for read-only access to the Twitter API. If you operate more than one Twitter Collector at a time, you will need a different Twitter account for each one. (This condition is imposed by Twitter, not by GATE Cloud.)
This section defines your search. You can specify a set of keywords (including hashtags) to track, a list of users to follow, a set of geolocations, and a list of language codes.
Geolocations are given by selecting rectangles of latitude and longitude on an map of the world which can be zoomed and panned with the mouse in the usual ways. The rectangles, which do not need to touch each other, will be added together. Language codes are the two-letter IANA codes; the list is available from a link in the relevant stream configuration section.
The Twitter API itself does not provide an and function for searches, so if you choose that option in our menu, be aware that GATE Cloud is collecting all tweets that match any of your criteria and then applying the and internally, so you may hit Twitter's rate limit more often than you expect. (The limit is imposed by Twitter, not by GATE Cloud.)
This section is very important: if you do not specify a place to save the results, most of them will be lost and only the latest 2 GB will be available for the Twitter Collector machine.
Tweets will be collected in JSON format and compressed with GZIP. Here you can specify the common part of the filenames, the maximum chunk size, and where to save the files (on your GATE Cloud account, or your Amazon S3 bucket, or both). Save them to your GATE Cloud account in order to process them with GATE Cloud services. Be sure to click the "Update" button.
Press the "Start collecting" button to start collecting tweets. While collection is running, you can view log messages and use the "Stop collecting" and "Roll over to next chunk" buttons.
This section offers six reports on your data collected so far, updated in real time.
- View the top hashtags as a bar chart or tag cloud, and add hashtags to the tracking list in the collector.
- View a line graph of tweet frequency per hour.
- View a bar chart or cloud of the top topics according to our lists of terms. The lists can be edited from a link on the topics graph page. Topics can be added as terms to the tracker.
- View a bar chart or cloud of the most mentioned users. These can also be added for the collector to track.
- View a bar chart or cloud of the frequency of the terms you are tracking.
- View a bar chart or cloud of the top words of one grammatical type (adjective, verb, pronoun, adverb, noun, proper noun). This is only available for tweets in English. Use the pull-down menu and click Select to change the grammatical type. Words here can be also be added to the tracker.
Each report can be downloaded as SVG or JSON. The Twitter collector - GATE Cloud header at the top of each report page is a link that will take you back to this collector's main (control) page.
You can download recent chunks (GZIP files) from here.
3. Processing data with GATE applications
GATE applications can be obtained from the Services page and added to your Dashboard, where they can be configured to run over GATE Cloud data bundles, as described in the annotation jobs documentation.
If you click the Roll-over to next chunk button under "Collector Status" in the Twitter Collector, the message "You have some partially-uploaded bundles" will appear in your Dashboard. Clicking the link after that message will give you options to manipulate such bundles, and you can click the Finished button on a partially uploaded bundle's page to close it off early and make it available for further processing. This can also be used to close the last chunk of data when you click the Stop collecting button (under "Collector Status").