Skip to content

matteo-grella/tweetdeck-scraper

Repository files navigation

TweetDeck Scraper Docker Container

A docker container wrapping a tool to continuously scrape Tweetdeck tweets, store them in ElasticSearch and add scraped ids to a RabbitMQ queue.

The extracted information are:

  • Id
  • Publish date
  • Download date
  • Author
  • Language
  • Text
  • Full body
  • Image url

Configuration

Sensible settings are configured in a .env file, these are exported as env variables when running the container so can be changed even after the container is built:

Twitter account params:

$ cd tweetdeck-scraper
$ echo "TWITTER_USERNAME=username" > .env
$ echo "TWITTER_PASWORD=password" >> .env

Elasticsearch settings:

$ echo "ES_HOST=localhost" >> .env
$ echo "ES_PORT=9200" >> .env
$ echo "ES_CURRENT=index_name" >> .env
$ echo "ES_USERNAME=username" >> .env
$ echo "ES_SECRET=password" >> .env

RabbitMQ settings:

$ echo "RMQ_HOST=localhost" >> .env
$ echo "RMQ_PORT=5762" >> .env
$ echo "RMQ_QUEUE=tweetdeck" >> .env
$ echo "RMQ_USERNAME=username" >> .env
$ echo "RMQ_PASSWORD=password" >> .env

Additional settings can be found inside tweetdeck_scraper/settings.py and should be modified before building the container.

DEBUG

When True additional infos are logged.

LOG_PATH

The path of the log file.

SCRAPE_INTERVAL

Seconds between scraping actions.

COLUMNS

Which columns to scrape, value can be 'ALL' or a list of xpaths.

Usage

Build the container

$ ./build-docker.sh

Run the container

$ ./run-docker.sh

Enjoy :)


Legal

It is your responsibility to ensure that your use of tweetdeck-scraper does not violate applicable laws.

Licensing

Tweetdeck Scraper is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.