June 05, 2017

Integrating Algorithmia with Apache Spark

Intro Slide Algorithmia and Spark

A couple of weeks ago we gave a talk at the Seattle Spark Meetup about bringing together the flexibility of Algorithmia's deep learning algorithms and Spark's robust data processing platform. We highlighted the strengths of both platforms and covered a basic introduction on how to integrate Algorithmia with Spark's Streaming API. In this talk you'll see how we went from use case idea to implementation in only a few lines of code.

Below are our slides from our Seattle Spark Meetup. Let us know what you think @Algorithmia or @Platypii.

Algorithmia Intro

Algorithmia is a marketplace for algorithms and microservices that was created to bridge the gap between those who develop the algorithms and those who want to consume them. We bring together the research teams from Universities that are developing complex algorithms with the developers who want to add artistic filters to their image apps, use various video processing algorithms to transform videos into data, or keep their website family friendly with nudity detection.

Our platform hosts these algorithms, functions and models and makes them easy to chain together to create complete product solutions that are available via an API. 

Algorithms as building blocks

We have more than 3,000 algorithms and microservices available via an API endpoint.

We are language agnostic and have 14 language clients available to access any algorithm on the platform via our API.

Our algorithms include everything from utility functions to complete machine and deep learning microservices that are all available in a few lines of code.

Spark slide - why we love it

Spark's Machine Learning Library:

  • Spark’s MLlib gives users the ability to apply analytics on streaming data such as click-stream data.
  • With standard statistical libraries and the ability to perform operations on not only RDDs, but now dataframes and datasets, their machine learning library has become more accessible to users.
  • Contains building blocks for recommender systems such as Collaborative Filter technique.

Spark Streaming:

  • Can pull from several different data sources: Kafka for messaging data, Flume and S3 for web server log files, Twitter for social media or any custom TCP connection making it easy and flexible to consume and integrate data.
  • Access to map, reduce, join and other functions that make data processing easier.
  • You can apply the MLlib, Spark’s graph API or any other third party API like Algorithmia on data streams.

SQL-based operations:

  • A lot of data processing and analysis can be done with Spark’s SQL which allows you to perform sql operations and different data sources.

But, some processes aren’t incredibly easy to do on Spark such as utilizing GPU environments for deep learning projects.

How Algorithmia Enhances Spark

While Spark has many time-proven machine learning algorithms such as K-Means, LDA, and Random Forest, Algorithmia provides machine and deep learning algorithms that are microservices providing an end-to-end solution.

Even though it is possible to do deep learning on Spark, it isn't as common as you think as GPU's aren't readily available on Spark and there is a noted lack of tutorials by developers that have used GPU's on Spark. Here is an interesting conversation about deep learning using Spark that highlights why you might want to try deep learning on Spark and why you may want to try other solutions.

If you don't want to set up deep learning libraries on each of your Spark clusters or train your own models you can leverage Algorithmia’s API to perform several different types of image processing or other GPU dependent machine learning algorithms such as transforming videos into color.

Spark excels at scaling data processing, but for the computationally intensive part of the pipeline Algorithmia's on-demand GPU's and the many deep learning algorithms available make it a solid choice for deep learning.

Algorithmia and Spark diagram pipeline

To give context on how to use Algorithmia in Spark, we took the Spark pipeline diagram and added our piece of the puzzle into the pipeline.

Whether you’re getting data from Kafka, TCP sockets, Flume, Twitter or S3 you can deal with large data streams using Spark Streaming. Spark will divide the data into batches and then you can pipe the live streaming input data into an Algorithmia algorithm that in a few lines of code will call the API and you'll get your results in a JSON formatted blob. Finally you can perform any number of Spark's SQL-like operations on your results.

Fashion classification algorithm on Algorithmia using Twitter data

Using Spark and Algorithmia, we'll go through an image classification use case using Twitter for our data stream.

First we got tweets using the TwitterUtils library and then grabbed the images from each tweet that fell under our target queries.

Next we piped the streaming data into an algorithm using Algorithmia’s API, calling the Deep Fashion algorithm which classifies the articles of clothing in an image.

Then we simply counted the clothing articles that were found, reducing them by key to show how you would go about using Spark and Algorithmia for a social media marketing campaign.

real time marketing slide

Say you have an online catalog of clothes and you want to promote certain items on social media. With the combination of Algorithmia and Spark you can reach new customers by recognizing what people are looking at on Twitter in real time for the popular fashion hashtags “OOTD” which means “Outfit Of The Day” or “fashionblogger”. You can then serve tweets for the clothing items you want to promote, based on clothing articles that are being tweeted about the most often during some time interval.

Deep fashion algorithm image

The Deep Fashion algorithm detects clothing items in images and then returns a list of discovered clothing articles as well as annotates the input image with bounding boxes for each article of clothing found in the image.

This is perfect for when you have hundreds of images that would otherwise have to be labeled by hand or when you want to discover what people are wearing as in our social media campaign use case.

Deep fashion algorithm description

The Deep Fashion algorithm is based off of the faster-crnn project found on GitHub that in turn was inspired by Cornell University's paper:  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.

To learn about any algorithm on Algorithmia, all you need to do is check out the algorithm’s description page where you can find details about the inputs, outputs, examples, credits and even permissions and pricing.

Algorithm Pricing page

Here’s an example of the pricing tab where you can find out the estimated cost of running the algorithm and its permissions.
Next is an example of how to call the Deep Fashion algorithm in a few lines of code.

Scala code for calling algorithm on Algorithmia

The above slide shows how to call an algorithm using the Algorithmia API in Scala. Simply import the Algorithmia client in the language of your choice, then create the client object and specify the algorithm using the algorithm path and version that is found at the bottom of each algorithm description page. Then pipe in your data, which in this case is a URL path, and use the .as method to turn the result into the data type you would like to work with.

Deep Fashion input code

Here are the JSON formatted inputs that are available for the Deep Fashion algorithm.

The image string is required while the rest of the fields are optional:

  • “Output”: the output data connection path (hosted files on Algorithmia).
  • “Threshold”: the minimum confidence for a label to be returned.
  • “Tags_only”: set this if you only want the tags returned instead of both the tags and annotated images.

Deep Fashion algorithm outputs

Here is a sample output from a much larger JSON object that also outputs these labels with their respective coordinates:

  • “Shorts”
  • “Blouse”
  • “Tank top”
  • “Sun glasses”
  • “Top handle bag”

Note that the output also includes a file that contains the original image with bounding boxes for each item (only headscarf bounding box is shown for clarity).

Scala Spark code for calling Algorithmia

The above contains the majority of the code for this project which you can also find on GitHub so you can run it yourself.

Basically we use TwitterUtils to create a data stream, then get the media url from each tweet that contains the hashtags we want. Next we pipe that data into our algorithm and get the articles from each image using the Deep Fashion algorithm. Finally we create a tuple of article names and their counts and sum those together within each partition and reduce those to unique article names with their counts.

Code output Scala

Here are the counts that were retrieved in a sample data stream with the hashtags “fashionblogger” and “Outfit of the day” using the code from the previous slide.

Coming back to our marketers story, say you grab the outputs every 15 minutes or so and see that there are more skirts posted than any other article of clothing.

You can then pull from your database or other data source, the marketing material you want to promote, and then tweet it with the hashtags #skirt #fashionblogger and #OOTD among other tags that your target audience follows.

Nordstrom scarf

If this were Nordstrom for example and they found that scarves were the most tweeted about image during a time period then they could in turn tweet an image of a scarf they are trying to sell along with the link where to buy it and popular fashion hashtags.

Image similarity algorithm

Another possibility for using algorithms to better target your marketing campaigns is to match an image based on its similarity to another image rather than matching specific articles of clothing. You can use an Image Similarity algorithm to find images that are similar in appearance to ones in your product catalog and the only thing that you would have to change is the algorithm path in your code.

Image Classification algorithms

All of these algorithms are available via a standardized API and integrate with Spark seamlessly. 

The next slide will showcase the Emotion Recognition algorithm to see what else you can do with image classification algorithms.

Emotion Recognition algorithm output

Let's move away from our social media campaign for a minute and look at another use case. For instance, if you were interested in people’s reactions in a usability study, then you could apply the emotion detection algorithm to your images or videos using one of our video processing algorithms. Or maybe you’re an online publication and have a database full of images that you want categorized for easy retrieval later. In order to tag thousands of photos quickly you could use Algorithmia and Spark to tag your photos without having to set up a GPU environment yourself.

The emotion recognition algorithm takes an image path and returns the bounding box of the face detected and the emotion of the face with the confidence returned. If there are more than one faces detected it will detect the emotion of each person in the photo.

Algorithmia Spark package

Here is the Algorithmia Spark package to get started using any of 3,000 algorithms that are available on Algorithmia.

Note that you get 5,000 free credits every month on Algorithmia which is plenty to get started.

We'd love to see what you build using Spark and Algorithmia and if you need some inspiration check out our use case gallery.


Here's 50,000 credits
on us.

Algorithmia AI Cloud is built to scale. You write the code and compose the workflow. We take care of the rest.

Sign Up