Analyzing Reddit’s Top Posts & Images With Google Cloud (Part 2 – AutoML)

BMW 128i CloudVision

In the last iteration of this article, we analyzed the top 100 subreddits and tried to understand what makes a reddit post successful by using Google’s Cloud ML tool set to analyze popular pictures.

In this article, we will be extending the last article’s premise – to analyze picture-based subreddits with Dataflow – by using Google’s AutoML Vision toolset, training a model, and exposing it via REST to recognize new images.

The source code for this is available on GitHub under the GNU General Public License v3.0.

What is Reddit?

Reddit is a social network where people post pictures of cats and collect imaginary points, so-called “upvotes”.

Reddit (/ˈrɛdɪt/, stylized in its logo as reddit) is an American social news aggregation, web content rating, and discussion website. Registered members submit content to the site such as links, text posts, and images, which are then voted up or down by other members. Posts are organized by subject into user-created boards called “subreddits”, which cover a variety of topics including news, science, movies, video games, music, books, fitness, food, and image-sharing. Submissions with more up-votes appear towards the top of their subreddit and, if they receive enough votes, ultimately on the site’s front page.”(https://en.wikipedia.org/wiki/Reddit)

Reddit is the 3rd most popular site in the US and provides a wonderful basis for a lot of interesting, user-generated data.

Technology & Architecture

We will be partly re-using the architecture of the last article, with some slight adjustments here and there.

Architecture
Architecture

As we focus on the image recognition part, we upload a training set of images to Cloud Storage (alongside with our manual classifications), train an AutoML model, and access Reddit data via our Desktop using REST.

The latter part can be automated in subsequent steps, e.g. using Dataflow and PubSub (you can find some prep-work on my GitHub page).

AutoML Vision

Google’s AutoML is a managed machine learning framework. While it is technically still in Beta, it already proves a tremendous advantage: It more or less automates complex segments of “traditional” machine learning, such as image recognition (AutoML Vision) or NLP (AutoML Natural Language and AutoML Translation).

Specifically AutoML Vision enables developers and engineers who are not familiar with the mathematical intricacies of image recognition to build, train, and deploy ML models on the fly – which is why we are going to use it here.

AutoML vs. Cloud Vision

While the Cloud Vision API gives us access to Google’s ever-growing set of data (that naturally is used to train the internal ML models), we can use AutoML to train our very own model with specific use cases that a common-ground approach – such as Cloud Vision – might not capture.

Now, let’s use one of my favorite topics, cars. The following picture shows the output of the Cloud VIsion API, fed with a picture of my car, an E87 BMW 128i.

BMW 128i CloudVision
BMW 128i CloudVision

While it did classify the car as both “BMW” and “car”, it failed to recognize any specifics.

Let’s take another example, my old E85 BMW Z4 3.0i, from when I was living in Germany:

BMW Z4 3.0i CloudVision
BMW Z4 3.0i CloudVision

Once again, it figured out we are dealing with a BMW, but the fact that the massive hood that houses the beauty that is the naturally aspirated three liter I6, nor the fact that the roof is, in fact, missing told Cloud Vision that this must be a Z4.

The main decision criteria here should be: Is it worth spending the extra effort to train your own model? Is your data set so specific and unique that it warrants its own model? Do you have proper Data Scientists in your organization that could do a (better) custom job?

In our case – yes, it is. So, time to train our own model, without having to deal with Tensorflow, massive coding efforts, or a Masters in Statistics.

Getting data & classifying images

As we are trying to extend the idea of image recognition, we first need a set of images to get started on. For that, we will use /r/bmw where people show off their BMWs, mostly E30 and F80 M3s (keep at it folks, I cannot get enough of them). What could go wrong with user-generated content for training sets?

A simple script is used to re-use part of our existing reddit/praw and Python setup to simply pull the top posts from the subreddit, filter by the type “extMedia”, save it as image under /tmp and prepare a CSV file that we will use for classification later.

The resulting images wind up on Google Cloud Storage (GCS).

# encoding=utf8
from __future__ import print_function

import config
import os
import praw
import urllib
import re

from reddit.Main import get_top_posts

__author__ = "Christian Hollinger (otter-in-a-suit)"
__version__ = "0.1.0"
__license__ = "GNU GPLv3"


def unixify(path):
    return re.sub('[^\w\-_\. ]', '_', path)


def get_image(post, img_path):
    filename = unixify(post.title)
    tmp_uri = '{}{}'.format(img_path, filename)
    print('Saving {url} as {tmp}'.format(url=post.content, tmp=tmp_uri))
    urllib.urlretrieve(post.content, tmp_uri)
    return tmp_uri, filename


def write_gcp(_input, _output, bucket_name):
    from google.cloud import storage
    # Instantiates a client
    storage_client = storage.Client()

    # Gets bucket
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(_output)

    # Upload
    blob.upload_from_filename(_input)

    print('Uploading {} to bucket {}'.format(_output, bucket_name))


def csv_prep(gcs_loc):
    return '{gcs_loc}|\n'.format(gcs_loc=gcs_loc).encode('utf-8')


def main():
    import sys
    reload(sys)
    sys.setdefaultencoding('utf8')

    # Get reddit instance
    reddit = praw.Reddit(client_id=config.creddit['client_id'],
                         client_secret=config.creddit['client_secret'],
                         user_agent=config.creddit['user_agent'])
    # Set GCP path
    os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = config.cgcp['api_key']
    LIMIT = config.limit
    bucket_name = config.cgcp['images']

    # Settings
    subreddit=config.crawler['subreddit']
    img_path=config.crawler['path']+subreddits

    # Get top posts
    top_posts = get_top_posts(subreddit, reddit, LIMIT)

    # Filter images
    images = filter(lambda p: p.type == 'extMedia', top_posts)

    csv = ''
    # Download images
    for post in images:
        tmp, filename = get_image(post, img_path)
        write_gcp(tmp, subreddit + '/' + filename, bucket_name)
        csv += csv_prep('gs://{bucket_name}/{subreddit}/{filename}'
                        .format(bucket_name=bucket_name, subreddit=subreddit, filename=filename))

    # Dump pre-classifier CSV
    with open(img_path+'images.csv', 'a') as file:
        file.write(csv)


if __name__ == "__main__":
    main()

Now, here’s where the fun begins – we need to classify the training set by hand. It is generally recommended to have at least 100 images per category (Google actually offers a human-driven service for this!), but we are going to stick to less – it’s Saturday.

In order to simplify the model, I dumbed-down my categories – 1 and 2 series, 3 and 4 series, 5 and 6 series, concept and modern, Z3, Z4, and Z8 as well as classics, such as the iconic M1 or 850Csi. The latter introduces way to much noise, however, having a folder full of those cars is fun on its own.

➜  bmw ls -d */

1-2series/  3-4series/ 5-6series/  classics/ concept-modern/  z3-4-8/

Setting up AutoML

After labeling the images, we can proceed to AutoML and point to our CSV file. As the CSV contains the gs:// path and label, we are presented with an overview that looks like this:

Reddit AutoML Data
Reddit AutoML Data

Once the labeling is complete, we can train the model from the Web UI. It will warn you that you don’t have enough labels per image.

Label Warning
Label Warning

After the training is complete, we can see the model performance.

Model Performance (Reddit)
Model Performance (Reddit)

That does not look good. How did this happen?

The answer is fairly simple: All of reddit’s images are taken from various angles, in various lighting conditions, and with various model years. The noise in the images is too high to achieve a good result and we don’t have enough data to properly train the model.

Image Recognition Theory

In order to understand the issue, let’s talk theory for a second. Cloud ML uses TensorFlow to classify our images, exposing the model via an easy API.

As usual, this is not a scientific paper – I’m one of those engineer-folks who use the research output, not the researcher. Things will simplified and maybe even wrong. But hey, it works in the end!

What is TensorFlow?

“TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. […] In a TensorFlow graph, each node has zero or more inputs and zero or more outputs, and represents the instantiation of an operation. Values that flow along normal edges in the graph (from outputs to inputs) are tensors, arbitrary dimensionality arrays where the underlying element type is specified or inferred at graph-construction time. Special edges, called control dependencies, can also exist in the graph: no data flows along such edges, but they indicate that the source node for the control dependence must finish executing before the destination node for the control dependence starts executing.” (Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467)

While AutoML uses Google’s NASNet approach to find the right architecture –

“Our work makes use of search methods to find good convolutional architectures on a dataset of interest. The main search method we use in this work is the Neural Architecture Search (NAS) framework proposed by [71]. In NAS, a controller recurrent neural network (RNN) samples child networks with different architectures. The child networks are trained to convergence to obtain some accuracy on a held-out validation set. The resulting accuracies are used to update the controller so that the controller will generate better architectures over time. The controller weights are updated with policy gradient (see Figure 1).”

Figure 1
Figure 1

(Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc V. Le: Learning Transferable Architectures for Scalable Image Recognition. arXiv:1707.07012v4)

…we will quickly talk about Convolutional Neural Networks.

“CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing. They are also known as shift invariant or space invariant artificial neural networks (SIANN), based on their shared-weights architecture and translation invariance characteristics.

Convolutional networks were inspired by biological processes in that the connectivity pattern between neurons resembles the organization of the animal visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field.

CNNs use relatively little pre-processing compared to other image classification algorithms. This means that the network learns the filters that in traditional algorithms were hand-engineered. This independence from prior knowledge and human effort in feature design is a major advantage.”

(https://en.wikipedia.org/wiki/Convolutional_neural_network)

Given the nature these networks work – by analyzing an images binary components in a set of computational layers – it is easy to confuse the network with seemingly different, albeit actually very similar images.

Bharath Ray’s article (link below) explains it as follows:

“It finds the most obvious features that distinguishes one class from another. Here [an example about cars taken from different angles], the feature was that all cars of Brand A were facing left, and all cars of Brand B are facing right”

Check out this article by Bharath Ray on Medium for more details on how do overcome this manually.

Adjusting the Model

The solution to our problem is fairly simple – we just need a better training set with more data from more images. Car pictures tend to be taken from very similar angles, in similar lighting conditions, and with similar trim styles of vehicles.

First, we’ll use https://github.com/hardikvasa/google-images-download to get ourselves proper training images from Google. Ironic, isn’t it?

Download Training Set
Download Training Set

Next, simply create a mock-CSV and quickly populate the classifiers.

cd '/tmp/bmw/BMW 3 series/'
gsutil -m cp  "./*" gs://calcium-ratio-189617-vcm/bmw-1s
gsutil ls gs://calcium-ratio-189617-vcm/bmw-1s >> ../1series.csv

After getting enough training data, we can go back to AutoML Vision and create a new model based on our new images.

Classification CSV
Classification CSV

After importing the file, we are greeted with a much more usable training set:

Google AutoML Data
Google AutoML Data

Now, when we evaluate the model, it looks a lot less grim:

Model Performance
Model Performance

Using our model with Reddit data

After we figured out the issue with our training set, let’s try out the REST API.

We’ll use this image from reddit:

by /u/cpuftw at https://www.reddit.com/r/BMW/comments/5ziz5c/my_first_m_2017_nardo_grey_m3/

And simply throw a REST request at it, using a simple Python 2 script:

import sys

from google.cloud import automl_v1beta1
from google.cloud.automl_v1beta1.proto import service_pb2

"""
This code snippet requests a image classification from a custom AutoML Model
Usage: python2 automl_rest.py $img $project $model_id
"""

def get_prediction(_content, project_id, model_id):
    prediction_client = automl_v1beta1.PredictionServiceClient()

    name = 'projects/{}/locations/us-central1/models/{}'.format(project_id, model_id)
    payload = {'image': {'image_bytes': _content }}
    params = {}
    request = prediction_client.predict(name, payload, params)
    return request  # waits till request is returned


if __name__ == '__main__':
    file_path = sys.argv[1]
    project_id = sys.argv[2]
    model_id = sys.argv[3]

    with open(file_path, 'rb') as ff:
        _content = ff.read()

    print(get_prediction(_content, project_id,  model_id))

And the results are…

Results M3
Results M3

On point! Well, it is an M3, but that is still a 3 series BMW.

Next, remember my old Z4?

Z4
Z4
payload {
classification {
    score: 0.999970555305
}
    display_name: "bmwz4"
}

Yes, sir! That is, in fact, a Z4.

Conclusion

Now, what did we learn?

First off, using the Cloud Vision API simplifies things tremendously for the overwhelming majority of use cases. It gives you a very accurate output for most standard scenarios, such as detected images not appropriate for your user base (for filtering user-generated content) or for classifying and detecting many factors in an image.

However, when the task becomes too specific, AutoML helps us to build our custom model without having to deal with the intricacies of a custom TensorFlow model. All we need to take care of is good training data and careful labeling before training the model. The simple REST API can be used just like the Cloud Vision API in your custom software.

I don’t know about you, but I’m a big fan – we managed to build a system that would otherwise require a lot of very smart Data Scientists. Granted, it will not achieve the accuracy a good Data Scientists can (on AutoML or not) – these folks know more than I do and can figure out model issues that I cannot; however, this is the key point. Any skilled Engineer with a basic understanding of ML can implement this system and advance your project with a custom ML model. Neato!

All development was done under Arch Linux on Kernel 4.18.12 with 16 AMD Ryzen 1700 vCores @ 3.6Ghz and 32GiB RAM

Continue Reading

Analyzing Reddit’s Top Posts & Images With Google Cloud (Part 1)

Entity Wordcloud

In this article (and its successors), we will use a fully serverless Cloud solution, based on Google Cloud, to analyze the top Reddit posts of the 100 most popular subreddits. We will be looking at images, text, questions, and metadata.

We aim to answer the following questions:

  • What are the top posts?
  • What is the content of the top posts? What types of images are popular?
  • When is the best time to post to reddit?
  • Is “99% of the Karma” really in the hand of the “top 1% of posters”?

This will be the first part of multiple; we will be focussing on the data processing pipeline and run some exemplary analysis on certain image-based subreddits using the Cloud Vision API.

The source code for this is available on GitHub under the GNU General Public License v3.0.

What is Reddit?

Reddit is a social network where people post pictures of cats and collect imaginary points, so-called “upvotes”.

Reddit (/ˈrɛdɪt/, stylized in its logo as reddit) is an American social news aggregation, web content rating, and discussion website. Registered members submit content to the site such as links, text posts, and images, which are then voted up or down by other members. Posts are organized by subject into user-created boards called “subreddits”, which cover a variety of topics including news, science, movies, video games, music, books, fitness, food, and image-sharing. Submissions with more up-votes appear towards the top of their subreddit and, if they receive enough votes, ultimately on the site’s front page.”

(https://en.wikipedia.org/wiki/Reddit)

Reddit is the 3rd most popular site in the US and provides a wonderful basis for a lot of interesting, user-generated data.

Technology & Architecture

We will be using the following technologies:

  • Python 2.7.3
  • Cloud Dataflow / Apache Beam
  • BigQuery
  • Google Cloud Storage (GCS)
  • Cloud ML / Vision API
  • Cloud Datalab

Resulting in the following architecture –

Architecture
Architecture

Compute Engine or Cloud shell are used to run the data gathering Python script and stores the data to Cloud Storage.

Dataflow and Cloud Vision API will be used to process the data and store it to BigQuery.

DataLab will be used to analyze & visualize the data.

Gathering Data

For gathering the initial data, we will use a simple Python script using the reddit praw library. You can run this from your Google Cloud Shell or your local desktop (or a Compute Engine instance).

This code will do the following:

  • Pull the “top” posts of all time from the 100 most popular subreddits, up to a limit you define (I took 1,000 for this article)
  • Detect the type of post:
    • Self – text
    • Question – simply a title (like in /r/askreddit)
    • extMedia – external media (images, videos)
    • Link – external links, e.g. to blog posts
  • Add a unique ID to the post by MD5-hashing the title and timestamp
  • Store the result as JSON, split by subreddit
  • Upload the JSON to GCS
class DictEncoder(json.JSONEncoder):
    def default(self, obj):
        return obj.__dict__


class Post:
    def __init__(self, title, subreddit, author, upvotes, date_iso, link, type, num_comments, content):
        self.id = hashlib.md5((title + str(date_iso)).encode('utf-8')).hexdigest()
        self.title = title
        self.subreddit = subreddit
        self.author = author
        self.upvotes = upvotes
        self.date_iso = int(date_iso)
        self.link = link
        self.type = type
        self.num_comments = num_comments
        self.content = content

    def __str__(self):
        return "{title}, upvotes: {up}, date: {date}, link: {link}, content: {content}".format(
            title=self.title.encode('utf8'),
            up=self.upvotes,
            date=time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(self.date_iso)).encode('utf8'),
            link=self.link.encode('utf8'),
            content=self.content.encode('utf-8'))


def get_top_posts(subreddit, reddit, limit):
    # Store posts
    posts = []

    for submission in reddit.subreddit(subreddit).top(limit=limit):
        if submission.pinned:
            continue

        try:
            if submission.is_self and submission.selftext is not None:
                # Self post - text
                content = submission.selftext
                _type = 'self'
            elif submission.is_self and submission.selftext is None:
                # Self post - no header - askreddit etc.
                content = submission.title
                _type = 'question'
            elif submission.url is not None and submission.preview is not None and submission.preview.__len__ > 0 \
                    and 'images' in submission.preview and submission.preview['images'].__len__ > 0:
                # External media - store preview if available
                content = submission.preview['images'][0].get('source').get('url')
                _type = 'extMedia'
            elif submission.url is not None and submission.media is not None:
                # External media
                content = submission.url
                _type = 'extMedia'
            elif submission.url is not None and submission.media is None:
                # External link
                if 'imgur' in submission.url or '.jpg' in submission.url or '.png' in submission.url or '.gif' in submission.url:
                    _type = 'extMedia'
                else:
                    _type = 'link'
                content = submission.url
            else:
                # Empty post
                content = None
                _type = 'none'
                continue

            post = Post(submission.title, submission.subreddit_name_prefixed, submission.author.name, submission.ups,
                        submission.created, submission.permalink,
                        _type, submission.num_comments, content)
            posts.append(post)
            print("subreddit: {subreddit}".format(subreddit=submission.subreddit_name_prefixed))
        except Exception as e:
            print(e)
            continue

        # https://github.com/reddit-archive/reddit/wiki/API
        # Honor fair use terms - 60 requests per minute
        time.sleep(1)

    return posts


def write_json_gcp(_input=config.creddit['file'], _output=config.cgcp['file'], bucket_name=config.cgcp['bucket']):
    from google.cloud import storage
    # Instantiates a client
    storage_client = storage.Client()

    # Gets bucket
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(_output)

    # Upload
    blob.upload_from_filename(_input)
    print('Uploaded {} to {} in bucket {}'.format(_input, _output, bucket_name))


def main():
    # Get reddit instance
    reddit = praw.Reddit(client_id=config.creddit['client_id'],
                         client_secret=config.creddit['client_secret'],
                         user_agent=config.creddit['user_agent'])
    # Set GCP path
    os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = config.cgcp['api_key']
    LIMIT = config.limit

    # Define top subreddits
    csv = 'subreddit|type|title|upvotes|num_comments|content|author|date\n'
    subreddits = ['announcements', 'funny']
    # (Full list on GitHub)
    posts = []
    flat_json = ''
    # Enable for debugging
    #subreddits = ['pics']

    for subreddit in subreddits:
        flat_json = ''
        try:
            top_posts = get_top_posts(subreddit, reddit, LIMIT)
            posts = posts + top_posts

            for post in top_posts:
                flat_json += json.dumps(post.__dict__) + '\n'

            if config.use_json_array == 'true':
                # Write back Json as array
                with open(subreddit + config.creddit['file'], 'a') as file:
                    file.write(json.dumps([ob.__dict__ for ob in posts]))
            else:
                # Write back JSON one line at a time for DataFlow
                with open(subreddit + config.creddit['file'], 'a') as file:
                    file.write(flat_json.encode('utf8'))

            write_json_gcp(subreddit + config.creddit['file'], subreddit + config.creddit['file'])
        except Exception as e:
            print(e)
            print('Encountered error, skipping record')
            continue


if __name__ == "__main__":
    main()

The resulting JSON has the following structure:

{
  "date_iso": 1515704703,
  "author": "unknown_human",
  "num_comments": 4109,
  "title": "Meeting Keanu Reeves at a traffic light",
  "subreddit": "r/pics",
  "content": "https://i.redditmedia.com/txql52xsvYCE8qkOxDL3WZfTt9b_bv2XqVI9mopa4kg.jpg?s=2b315defb2812191eb14fea6111376a8",
  "link": "/r/pics/comments/7pnxv2/meeting_keanu_reeves_at_a_traffic_light/",
  "upvotes": 200378,
  "type": "extMedia",
  "id": "4f6541b8b9b98e26346a228312a1b662"
}

The reason we don’t run this through a distributed system is reddit’s “Fair Use” API policy, which limits us to 1 API call per second, therefore rendering high-performance computing fairly pointless.

Cloud Vision API

Before we dive into the data processing on the cloud, let’s quickly talk about image recognition.

Google’s Cloud Vision API is a powerful tool to quickly analyze an image’s content and detect its relevant features and relative importance within the image.

It abstracts the actual machine learning models from the user and makes it a fantastic tool to integrate in any data processing pipeline, as it doesn’t require you to actually figure out your own model or to train it. While CloudML does enable you to figure all this out with, say, TensorFlow, chances are that a lot of use cases will not require that level of effort.

Take this example of a picture of an otter I took in the Atlanta Zoo the other day and review the labels the Cloud Vision API returned:

Cloud Vision Example 1
Cloud Vision Example 1

These are the labels the vision API detected –

{
  "labelAnnotations": [
    {
      "mid": "/m/035qhg",
      "description": "fauna",
      "score": 0.9405464,
      "topicality": 0.9405464
    },
    {
      "mid": "/m/089v3",
      "description": "zoo",
      "score": 0.8177689,
      "topicality": 0.8177689
    },
   ...  ]

As you can see, the seemingly main content of the image, the animal, has a relatively low score, as it is only a small part of the image. It did, however, interpolate the fact that this was taken in the zoo (as opposed to in the wild) based off the image’s other features, such as the artificial riverbed.

Now, compare it to another picture of two otters I took in a different Zoo in the UK a couple of years ago:

Cloud Vision Example 2
Cloud Vision Example 2

Here, we can clearly see that the model correctly identified the content, but got into a much higher level of detail, given the absence of noise from the surroundings.

Taking this into account, we need to keep a couple of things in mind about our data:

  • We need to filter out low probabilities
  • We need to ensure not to consider too generic terms, for instance “meal” when looking at images from /r/food

You can try out the Cloud VIsion API here: https://cloud.google.com/vision/

Introducing Data Flow

In the next step, we utilize Cloud Data Flow to process the data further.

Dataflow is based on the Apache Beam API and is an auto-scaling data-processing framework. It follows a fairly simple programming model in either Python or Java, relying on immutable, distributed collections (PCollections) and functions that get applied to one line of an input file at a time.

Dataflow is fully managed (serveless) and auto-scales to more processing nodes when required.

Similar to Apache Spark, Beam can use the same code for streaming and batch data. You can also run it on e.g. Flink or Spark, but for the sake of having a serverless architecture, we will focus on Data Flow.

For more details, I will refer you to the official Apache Beam documentation.

The Data Flow Pipeline

Data Flow Pipeline
Data Flow Pipeline

We will be doing the following processing steps:

  • Read the JSON, decode it
  • Split records with images (type extMedia)
  • Get the image*
  • Apply the VisionAPI
  • Store the results (image VisionAPI output and posts) to BigQuery in two separate tables

 

First off, we read the initial JSON and decode it to a Python dict. The current example reads one JSON at a time; you could also read multiple files.

    with beam.Pipeline(options=pipeline_options) as p:
        records = (
            p |
            ReadFromText(known_args.input, coder=JsonCoder()) |
            'Splitting records' >> beam.ParDo(Split())
        )

The code also splits the inputs by their type tag to identify images.

class Split(beam.DoFn):
    def process(self, record):

        _type = record['type']
        if _type == 'self' or _type == 'link':
            return [{
                'post': record,
                'image': None
            }]
        elif _type == 'extMedia':
            return [{
                'post': record,
                'image': record['content']
            }]
        else:
            return None

Next, we get the image from the data, store it to GCS, apply the VisionAPI, and finally return another dict for our images table. We resize the image to ensure we don’t hit the Vision API’s 10MiB file limit per request.

    def process(self, record):
        logging.info('Image: ' + record['image'])
        tmpuri = self.tmp_image_loc + record['post']['id'] + '.jpg'
        # Download the image, upload to GCS
        urllib.urlretrieve(record['image'], tmpuri)
        self.write_gcp(tmpuri, self.outputloc + record['post']['id'] + '.jpg', self.bucket)
        labels = self.get_vision(tmpuri, record['post']['id'])

The resulting data contains the unique ID of the post, the subreddit, and label, and its specific topicality (the relevancy of the detected feature in the image) and its score.

Lastly, we write the results back to BigQuery:

            posts | 'Write to BQ' >> beam.io.WriteToBigQuery(
                known_args.output,
                schema='date_iso:INTEGER,author:STRING,type:STRING,title:STRING,subreddit:STRING,content:STRING,link:STRING,num_comments:INTEGER,upvotes:INTEGER,id:STRING',
                create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)

In order to run this pipeline, upload the file to Cloud Storage, download it to Cloud Shell and execute the following script:

python -m DataFlowReddit \
  --project ${PROJECT} \
  --runner DataflowRunner \
  --input ${INPUT} \
  --temp_location gs://${BUCKET}/tmp/ \
  --bucket ${BUCKET} \
  --staging_location gs://${BUCKET}/stg/ \
  --tmp /tmp/ \
  --useBigQuery true \
  --output reddit.posts \
  --imgOutput reddit.images \
  --requirements_file requirements.txt \
  --max_num_workers 24

During execution, you can always launch StackDriver to analyze the logs for any failures or progress:

Stackdriver
Stackdriver

* As a disclaimer – it is generally speaking not a brilliant idea to run a very expensive operation – like multiple REST calls – on a pipeline. For the sake of simplicity, we will stick to the idea.

A first look at the data

To take a look at the resulting data, we run a simple BigQuery query on the Web UI to take a look at the post prominent features of some hand-picked subreddit, for instance, /r/pics, where people post pictures of all sorts of things. Keep in mind, the BigQuery Preview feature is free.

Big Query Example
Big Query Example

We can see an overview of posts, upvotes, comments, the content (in this case, a link to the image), and the id. We will use this data to process the data further in DataLab.

Analyzing the data with DataLab

For analyzing the data, we use DataLab. DataLab is pretty much comparable to Apache Zeppelin – a live, web-based notebook that enables us to analyze data and visualize it in notebooks that can be updated lived and easily shared, based on Jupyter.

It exposes all Google Cloud components in a simple Python environment where we can work on our data with Pandas Dataframes. If you are familiar with Apache Spark, this will come naturally.

In order to get our data, we use the %%bq directive in DataLab to store our query results form BigQuery into a variable. We then expose it to a Pandas Dataframe to take a look at the top results – you can run further data processing here.

%%bq query --name pics
SELECT 
id,upvotes,num_comments,title, CAST(TIMESTAMP_SECONDS(date_iso) AS DATETIME) AS dt
FROM `reddit.posts`
WHERE lower(subreddit) = 'r/pics'
ORDER BY dt desc
LIMIT 1000; 
%%bq query --name picsImg
SELECT description,count(*) as c,sum(score) as score FROM `reddit.images`
where (lower(subreddit) like '%pics') and score>0.7 
group by description
order by c desc
LIMIT 1000
import pandas as pd
import google.datalab.storage as storage
from google.datalab import Context
import google.datalab.bigquery as bq
import pandas as pd
from io import BytesIO

# Variables
project = Context.default().project_id
BUCKET='your-bucket'
  
# Get dataframe
df_data = pics.execute(output_options=bq.QueryOutput.dataframe()).result()
df_data.head(10)

df_img_data = picsImg.execute(output_options=bq.QueryOutput.dataframe()).result()
df_img_data_25 = df_img_data.head(25)
df_img_data.head(10)
DataLab
DataLab

Next, we plot our results using our Pandas DataFrame from before.

df_plot_data = df_img_data_25[['description','c']]
df_plot_data.head(10)
ax = df_plot_data.plot(kind='bar',x='description',title='Top image labels')
ax.set_xlabel('Description')
ax.set_ylabel('Count')
Top labels in /r/pics
Top labels in /r/pics

As we can see, apparently having a lot of (seemingly beautiful) sky in your pictures gets you far. Trees, water or girls seem to help as well.

We can also take a look at another popular subreddit in combination with /r/pics. As the unnamed subreddit (for SEO reasons…) is focussed on nature, we get a much broader result concerning “nature” related labels –

Popular Nature Labels
Popular Nature Labels

Finally, let’s look at how the Vision API labeled one of the top posts, “Almost slept through this amazing sunrise at Monument valley, was glad I went out anyway! USA (OC)[1920×1920]” by /u/crpytodesign with 30k+ upvotes and ID ae99d5a9e877f9ce84087516f1170c70.

https://redd.it/8q9m30

By simply running another %%bq directive, we can get the labels:

Example Image Labels
Example Image Labels

Last but not least, let’s generate a nice and simple wordcloud on DataLab using the wordcloud library:

Entity Wordcloud
Entity Wordcloud

Granted, this is not exactly an in-depth analysis – but it does illustrate the point on how to work with the data we gathered.

A word about cost

Google Cloud bills you based on usage and the services we are using scale automatically. While certain services are offered free of charge, it is still important to understand that the cost of the deployed solution will be based on your job’s efficiency and data volume.

Dataflow, for instance, has the following cost structure based (as of June 2018):

https://cloud.google.com/dataflow/pricing

While you can control the maximum number of workers, an inefficient job will rack up costs fairly quickly.

BigQuery’s pricing is based on the data that your query processes –

https://cloud.google.com/bigquery/pricing

Which results in an inefficient query that has to read a huge set of data will result in an equally huge bill. Keep in mind that using LIMIT operations will not affect this, at it depends on the columns and resulting data volume that your query processes.

A similar concept applies to the ML APIs – just take a look at my billing data at the time of writing this article:

Vision API Pricing
Vision API Pricing

Conclusion

While this exercise was simple in nature, it did illustrate certain key concepts –

  • How to utilize a fully-managed Cloud environment on Google Cloud
  • How to gather data and use Cloud Storage as a location for further processing
  • How to use Cloud Dataflow to process data without having to worry about scaling nodes and even be prepared for a streaming application
  • How to simply integrate powerful Machine Learning models
  • How to use resources on-demand
  • How to analyze data with simple notebooks and even chart the data

If we compare this effort to a “traditional” Hadoop approach, we stop some major differences and advantages –

  • Simple, non-demanding development environment
  • Existing tools and frameworks most of us are familiar with or at least can get familiar with very quickly (Python, Apache Beam <-> Spark/Storm…, DataLab <-> Zeppelin/Jupyter, BigQuery <-> Hive/Impala etc.)
  • Barely any effort to manage environments, nodes, scaling, high-availability and other administrative tasks
  • High throughput and performance without big optimization work

We also noticed some drawbacks and caveats to the approach –

  • Google Cloud is billed on-demand – while it does prove to lower the overall TCO of a data environment, it is easy to accidentally run expensive queries, start too many workers, or to rack up a high bill by being careless
  • We do lack the full control a completely Open Source solution might provide, given enough developer resources

Regardless of this – and other, more enterprise-level considerations – Google Cloud provided a great backend for a solution which would have been undoubtedly more complicated to realize using traditional Hadoop methodologies and tools.

In the next iteration of the article, we will run some more in-depth analysis and run a similar pipeline on other types of posts, such as text-based submissions.

All development was done under Fedora 27 4.16.13-200.fc27.x86_64 with 16 AMD Ryzen 1700 vCores @ 3.6Ghz and 32GiB RAM

Continue Reading

Analyzing Twitter Location Data with Heron, Machine Learning, Google’s NLP, and BigQuery

Introduction

In this article, we will use Heron, the distributed stream processing and analytics engine from Twitter, together with Google’s NLP toolkit, Nominatim and some Machine Learning as well as Google’s BigTable, BigQuery, and Data Studio to plot Twitter user’s assumed location across the US.

We will show how much your Twitter profile actually tells someone about you, how it is possible to map your opinions and sentiments to parts of the country without having the location enabled on the Twitter app, and how Google’s Cloud can help us achieve this.

About language, locations, social media, and privacy

While it is safe to assume that most Twitter users do not enable the Location Services while using the Social network, we can also assume that a lot of people still willingly disclose their location – or at least something resembling a location – on their public Twitter profile.

Furthermore, Twitter (for the most part) is a public network – and a user’s opinion (subtle or bold) can be used for various Data Mining techniques, most of which do disclose more than meets the eye.

Putting this together with the vast advances in publicly available, easy-to-use cloud-driven solutions for Natural Language Processing (NLP) and Machine Learning (ML) from the likes of Google, Amazon or Microsoft, any company or engineer with the wish to tap this data has more powerful tool sets and their disposal than ever before.

Scope of the Project

For this article, we will write a Heron Topology that does the following –

  • Read from Twitter, given certain seed keywords, filtering out users that do not disclose any location information, either as metadata or profile data
  • Use the Google NLP service to analyze the tweets and a user’s location data
  • Use Nominatim (based on OpenStreetMap data) to apply reverse-geocoding on the results
  • Use the DBSCAN cluster with a Haversine distance metric to cluster our results
  • Write the results back to Google BigTable or BigQuery

Then, we’ll visualize the results with Cloud Studio.

Architecture

The architecture for this process is fairly simple:

Architecture (simplified)

Heron serves as our Stream processing engine and local ML, Nominatim on Postgres serves as Geo-Decoder.

On Google Cloud, we use the NLP API to enrich data, BigTable and BigQuery for storage and Data Studio for visualization.

BigTable (think HBase) is used for simple, inexpensive mass-inserts, while BigQuery is used for analytics. For the sake of simplicity, I’ll refer to one of my old articles which explains quite a bit about when to use BigTable/Hbase and when not to.

Hybrid Cloud

While the notion of “Hybrid Cloud” warrants its own article, allow me to give you an introduction what this means in this context.

For this article, I heavily utilized the Google Cloud stack. The Google NLP API provides me simple access to NLP libraries, without extensive research or complex libraries and training sets.

Google BigTable and BigQuery provide two serverless, powerful data storage solutions that can be easily implemented in your programming language of choice – BigTable simply uses the Apache HBase Interface.

Google Data Studio can access those Cloud-based sources and visualize them similar to what e.g. Tableau can achieve, without the need to worry about the complexity and up-front cost that come with such tools (which doesn’t imply Data Studo can do all the things Tableau can).

At the same time, my Nominatim instance as well as my Heron Cluster still run on my local development machine. In this case, the reason is simply cost – setting up multiple Compute Engine and/or Kubernetes instances simply quickly exceeds any reasonable expense for a little bit of free-time research.

When we translate this into “business” terminology – we have a “legacy” system which is heavily integrated in the business infrastructure and the capital expense to move to a different technology does not warrant the overall lower TCO. Or something along those lines…

The following section describes the implementation of the solution.

Reading from Twitter

First off, we are getting data from Twitter. We use the twitter4j library for this. A Heron Spout consumes the data and pushes it down the line. We use a set of keywords defined in the conf.properties to consume an arbitrary topic.

Here, we ensure that a tweet contains location information, either from Twitter or via the user’s own profile.

if (location != null) {
    Document locDoc = Document.newBuilder()
            .setContent(location).setType(Type.PLAIN_TEXT).build();

    List<Entity> locEntitiesList = language.analyzeEntities(locDoc, EncodingType.UTF16).getEntitiesList();
    String locQuery = getLocQueryFromNlp(locEntitiesList);


    // Build query
    NominatimSearchRequest nsr = nomatimHelper.getNominatimSearchRequest(locQuery);

    // Add the results to a query, if accurate
    JsonNominatimClient client = nomatimHelper.getClient();
    ArrayList<String> reverseLookupIds = new ArrayList<>();
    List<Address> addresses = new ArrayList<>();
    if (!locQuery.isEmpty()) {
        addresses = client.search(nsr);
        for (Address ad : addresses) {
            logger.debug("Place: {}, lat: {}, long: {}", ad.getDisplayName(),
                    String.valueOf(ad.getLatitude()),
                    String.valueOf(ad.getLongitude()));

            Location loc = LocationUtil.addressToLocation(ad);

            // Filter out points that are not actual data points
            String osmType = ad.getOsmType();

            if (osmType != null && (osmType.equals("node") || osmType.equals("relation") ||
                    osmType.equals("administrative")) && loc.isWithinUSA()) {
                locList.add(loc);
                reverseLookupIds.add(loc.getReverseLookupId());
            }
        }
    }

You could re-route users that do not use location data to an alternate pipeline and still analyze their tweets for sentiments and entries.

Understanding Twitter locations

The next bolt applies the Google Natural Language Toolkit on two parts of the tweet: It’s content and the location.

For an example, let’s use my own tweet about the pointless “smart” watch I bought last year. If we analyze the tweet’s text, we get the following result:

(Granted, this isn’t the best example – I simply don’t tweet that much)

For this article, we won’t focus too much on the actual content. While it is a fascinating topic in itself, we will focus on the location of the user for now.

When it comes to that, things become a little more intriguing. When people submit tweets, they have the option to add a GPS location to their tweet – but unless you enable it, this information simply returns null. Null Island might be a thing, but not a useful one.

However, we can also assume that many users use their profile page to tell others something about themselves – including their location. As this data gets exposed by Twitter’s API, we get something like this:

While a human as well as any computer can easily understand my profile – it means Atlanta, Georgia, United States Of America, insignificant little blue-green planet (whose ape-descended lifeforms are so amazingly primitive that they still think digital smart watches are a great idea), Solar System, Milky Way, Local Group, Virgo Supercluster, The Universe – it’s not so easy for the more obscure addresses people put in their profile.

A random selection –

  • What used to be Earth
  • Oregon, (upper left)
  • in a free country
  • MIL-WI
  • Oban, Scotland UK 🎬🎥🎥🎬
  • United States
  • your moms house
  • Between Kentucky, Ohio
  • Savannah Ga.
  • Providece Texas A@M Universty
  • USA . Married over 50 yrs to HS Sweetheart
  • 24hrs on d street hustling

(All typos and emojis – the tragedy of the Unicode standard – were copied “as is”)

In case you are wondering, “your moms house” is in Beech Grove, IN.

The sample set I took most of this data from had only 25 entries and were parsed by hand. Out of those, 16 used ANSI standard INCITS 38:2009 for the state names, but in various formats.

A whole 24% used what I called “other” – like “your moms house”.

On a bigger scale, out of a sample set of 22,800 imported tweets, 15,630 (68%) had some type of location in their profile, but only 11 had their actual GPS location enabled.

Points scored

For the time, we can conclude that most Twitter users tell us something about their location – intentional or not. For a more scientific approach, here’s a link – keep in mind that my data is a random selection.

However – using user-entered data always results in messy, fuzzy, non-structured data. This has been a problem way before the terms “Machine Learning” or “Analytics” exceeded any marketing company’s wildest dreams. Levenshtein-Distance record matching, anyone?

Using NLP to identity entities & locations

At this point, Google’s NLP toolkit comes into play again. We use the NLPT to get all locations from the user’s self-entered “place” to identify everything that has the LOCATION metadata flag.

This is simple for something like this:

“Oregon” is clearly the location we need. We were able to strip of “upper left” – and could even weigh this based on the specific salience value.

However, more obscure queries result in more random results:

But even here, we get the core of the statement – USA – with a high confidence level.

The more random place descriptions (such as “a free country”) naturally only produce low-quality results – which is why we should filter results with a low salience score. While this only means “relative to this set of text, this entity is relatively important/unimportant”, it does serve as a rough filter.

In order to use more standardized data, we can also use the wikipedia_url property of the NLP toolkit (if available) and extract a more useful string. This results in “Baltimore, MD” to be turned into “Baltimore Maryland”, for instance.

However, “Atlanta” turns into “Atlanta (TV Series)” – so use it with caution.

public static List<Cluster<Location>> dbscanWithHaversine(ArrayList<Location> input) {
    DBSCANClusterer<Location> clusterer = new DBSCANClusterer<>(EPS, MIN_POINTS, new HaversineDistance());
    return clusterer.cluster(input);
}
public class HaversineDistance implements DistanceMeasure {
    @Override
    public double compute(double[] doubles, double[] doubles1) throws DimensionMismatchException {
        if (doubles.length != 2 || doubles1.length != 2)
            throw new DimensionMismatchException(doubles.length, doubles1.length);

        Location l1 = new Location("A", doubles[0], doubles[1],0,"N/A");
        Location l2 = new Location("B", doubles1[0], doubles1[1],0,"N/A");
        return MathHelper.getHaversineDistance(l1, l2);
    }
}
public static double getHaversineDistance(Location loc1, Location loc2) {
    Double latDistance = toRad(loc2.getLatitude() - loc1.getLatitude());
    Double lonDistance = toRad(loc2.getLongitude() - loc1.getLongitude());
    Double a = Math.sin(latDistance / 2) * Math.sin(latDistance / 2) +
            Math.cos(toRad(loc1.getLatitude())) * Math.cos(toRad(loc1.getLatitude())) *
                    Math.sin(lonDistance / 2) * Math.sin(lonDistance / 2);
    Double c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1 - a));
    return R * c;
}

Reverse Geocoding & Clustering indecisive answers

The next step, assuming we received unstructured location data, we will try to convert that data into a query for a location service that supports reverse geocoding. Location services and their respective APIs are plentiful online – as long as it is able to convert a query of location data into a set of potential matches, we will be able to use that data.

In this case, we are using the Nominatim client from OpenStreetMap. Given the data volume, it is advisable to host Nominatim on Kubernetes or your local development machine – the openstreetmap servers will block your IP if you accidentally DDos them or simply don’t respect their Fair Use Policy – and the velocity of streaming tends to violate basically every Fair Use clause in existence.

OpenStreetMap will return a list of potential matches. Take this example when our location is “Springfield” and we limit the results to 20:

Springfield data points

As you can see, this is not conclusive. So we need to find a way to figure out which result is most accurate.

Fun Fact: Without a country boundary on the US with Nominatin, this is what “Detroit Michigan” produces:

Using clustering to approximate locations

In order to figure out where on the map our result is, we use Density-based spatial clustering of applications with noise (DBSCAN), a clustering algorithm that maps points by their density and also takes care of any outliers.

DBSCAN illustration

I found this article’s description of the algorithm most conclusive. Short version – for a dataset of n-dimensional data points, a n-dimensional sphere with the radius ɛ is defined as well as the data-points within that sphere. If the points in the sphere are > a defined number of min_points, a cluster is defined. For all points except the cener, the same logic is applied recursively.

As DBSCAN requires the ɛ parameter to be set to the maximum distance between two points for them to be considered as in the same neighborhood. In order to set this parameter to a meaningful value, we use the Haversine distance to get the orthodromic distance on a sphere – in our case, a rough approximation of the earth and therefore a result in kilometeres between locations.

The Haversine function is defined as such –

where

  • d is the distance between the two points (along a great circle of the sphere; see spherical distance),
  • r is the radius of the sphere,
  • φ1, φ2: latitude of point 1 and latitude of point 2, in radians
  • λ1, λ2: longitude of point 1 and longitude of point 2, in radians

In our case , r is defined as the radius of the earth, 6,371 km.

To combine those, we can use the org.apache.commons.math3.ml package. All we need to do is implement the DistanceMeasure interface (as well as the function for the Haversine distance itself).

public static List<Cluster<Location>> dbscanWithHaversine(ArrayList<Location> input) {
    DBSCANClusterer<Location> clusterer = new DBSCANClusterer<>(EPS, MIN_POINTS, new HaversineDistance());
    return clusterer.cluster(input);
}
public class HaversineDistance implements DistanceMeasure {
    @Override
    public double compute(double[] doubles, double[] doubles1) throws DimensionMismatchException {
        if (doubles.length != 2 || doubles1.length != 2)
            throw new DimensionMismatchException(doubles.length, doubles1.length);

        Location l1 = new Location("A", doubles[0], doubles[1],0,"N/A");
        Location l2 = new Location("B", doubles1[0], doubles1[1],0,"N/A");
        return MathHelper.getHaversineDistance(l1, l2);
    }
}
public static double getHaversineDistance(Location loc1, Location loc2) {
    Double latDistance = toRad(loc2.getLatitude() - loc1.getLatitude());
    Double lonDistance = toRad(loc2.getLongitude() - loc1.getLongitude());
    Double a = Math.sin(latDistance / 2) * Math.sin(latDistance / 2) +
            Math.cos(toRad(loc1.getLatitude())) * Math.cos(toRad(loc1.getLatitude())) *
                    Math.sin(lonDistance / 2) * Math.sin(lonDistance / 2);
    Double c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1 - a));
    return R * c;
}

By putting 2 and 2 together, we get a DBSCAN implementation that accepts an ɛ in kilometers.

While choosing the parameters for the DBSCAN algorithm can be tricky, we use 150km / 93mi as the max. Radius of a cluster and assume that a single point is valid for a cluster. While this, in theory, produces a lot of noise, it is an accurate statement for clustering our location set.

For interpreting our clusters (and choosing the one that is seemingly correct), we rely on the average “importance” value of OpenStreetMap, which aggregates multiple important metrics (from e.g., Wikipedia) to “score” a place.

If the user’s location contains part of a state as String (e.g., Springfield, IL), we increase the importance score during execution.

Springfield, data points

 

Springfield, avg. importance by cluster

In our Springfield example, Springfield, IL is the state capital of Illinois – and that is one of the reasons why the OpenStreetMap data ranks it higher than the entire Springfield, MA cluster (which consists of Springfield, MA, Springfield, VT and Springfield, NH – despite their combined population being bigger than Illinois’ state capitol).

Assuming we have multiple points in a cluster, we take the average between all coordinates in it. This is accurate enough in a ~90mi radius and results in the rough center of the (irregular) polygon that is our cluster.

While I’ve just explained that Springfield, IL could be considered an accurate result, in order to illustrate the averaging, we simply remove Springfield, IL from the equation and our “best” cluster looks like this:

Sample Cluster for Springfield, MA

(The red dot in the middle is the average coordinate result)

Finally, we retrofit the calculated location data to a US state. For this, we have 2 options –

  • Call Nominatim again, resulting in another relatively expensive API call
  • Approximate the result by using a local list of rough state boundaries

While both methods have their pros and cons, using a geo provider undoubtedly will produce more accurate results, especially in cities like NYC or Washington DC, were we have to deal with close state borders to any given point.

For the sake of simplicity and resource constraints, I’m using a singleton implementation of a GSON class that reads a list of US states with rough boundaries that I’ve mapped from XML to JSON.

In our case, the result is either New Hampshire or Illinois, depending if we remove Springfield, IL nor not.

Other examples

Now, what tells us that somebody who states they are from “Springfield” simply likes the Simpsons?

Well, nothing. While it is advisable to store multiple potential location results and re-visit that data (or even use a different algorithm with a proper training set based on that), the architecture and algorithms works surprisingly well – some random profiling produced mostly accurate results, despite various input formats:

Sample Tweet locations across the US by quantity

(The big dot in the middle represents the location “USA”)

Original Location Result Accurate
Philadelphia Philadelphia, Philadelphia County, Pennsylvania, United States of America TRUE
Brooklyn, NY BK, Kings County, NYC, New York, 11226, United States of America TRUE
nebraska Nebraska, United States of America TRUE
United States United States of America TRUE
Truckee, CA Truckee, Donner Pass Road, Truckee, Nevada County, California, 96160, United States of America TRUE
Lafayette, LA Lafayette, Tippecanoe County, Indiana, United States of America TRUE
Minot, North Dakota Minot, Ward County, North Dakota, United States of America TRUE
Rocky Mountain hey! Rocky Mountain, Harrisonburg, Rockingham County, Virginia, United States of America TRUE
Living BLUE in Red state AZ! Arizona, United States of America TRUE
Earth Two, Darkest Timeline Earth Tank Number Two, Fire Road 46, Socorro County, New Mexico, United States of America FALSE
The Golden State Golden, Jefferson County, Colorado, United States of America FALSE
Atlanta, GA Atlanta, Fulton County, Georgia, United States of America TRUE
thessaloniki Thessaloniki Jewelry, 31-32, Ditmars Boulevard, Steinway, Queens County, NYC, New York, 11105, United States of America FALSE
newcastle Newcastle, Placer County, California, 95658, United States of America TRUE
Afton, VA Afton, Lincoln County, Wyoming, 83110, United States of America FALSE
Gary, IN / Chicago, IL Chicago, Cook County, Illinois, United States of America TRUE
Canada Canada, Pike County, Kentucky, 41519, United States of America FALSE
Southern California Southern California Institute of Architecture, 960, East 3rd Street, Arts District, Little Tokyo Historic District, LA, Los Angeles County, California, 90013, United States of America TRUE
San Francisco Bay Area San Francisco Bay Area, SF, California, 94017, United States of America TRUE
Southern CA Southern California Institute of Architecture, 960, East 3rd Street, Arts District, Little Tokyo Historic District, LA, Los Angeles County, California, 90013, United States of America TRUE

(All of these results come with latitude and longitude, state data and the full user profile and tweet metadata)

More importantly, tuning those results is just an exercise in careful profiling. We could filter out obvious countries that are not the US, tune the model parameters or the API calls.

Tweet locations across the US by avg. importance

(In this example, big points indicate a high confidence level; often points in the geographical center of a state hint that the user simply said they were from e.g. “Arizona”, “AZ” or “Living BLUE in Red state AZ!”)

Summary

While the example shown here is a simple proof of concept, extending the concept has plenty of opportunities –

  • Fine-tune the model, filtering obvious outliers
  • Build a data model that connects location data with the tweets, all other available metadata
  • Store multiple salience values per analysis, tuning the model based on the data
  • Run the topology on scale and apply it to all tweets concerning a certain topic, analyzing the big picture and calculating for false positives
  • Run regression or other analysis over users entries with the same ID and potential mismatches, tracking changes in the location; write a pipeline that flags users which use their location and retrofit all old results to an accurate GPS
  • Store users without location data and apply the above logic to those

One thing we can conclude is that using a combination of well-known, powerful local Big Data tools in combination with managed, equally powerful Cloud solutions opens the door for a massive variety of new analytics opportunities that required a much higher level of involvement and cost only a few years ago.

The next steps will be to combine this data with the actual location results, create heatmaps, fine-tune the model, and eventually move the whole solution to the Google Cloud Platform.

Sample entity analysis

All development was done under Fedora 27 with 16 AMD Ryzen 1700 vCores @ 3.2Ghz and 32GiB RAM. Nominatim planet data from 2018-03-15 was stored on a 3TB WD RED Raid-1 Array

Software Stack: Heron 0.17.6, Google Cloud Language API 1.14.0, Google Cloud BigTable API 1.2.0, Google Cloud BigQuery API 0.32.0-beta, Google Data Studio, Nominatim 3.1.0, PostgreSQL 9.6.8

Continue Reading