October 19, 2017

Adding multilingual support to any algorithm: pre-translation in NLP

We often get asked about if we're planning on adding any non-English NLP algorithms. As much as we would love to train NLP models on other languages, there aren't many usable training datasets in these languages. And, due to the linguistic structure of these languages, training with pre-existing approaches doesn't always give the best results.

Until better training sets can be generated, one passable solution is to translate the text to English before sending it to the algorithm.

In order to make it easier to integrate language translation within your algorithms, we've added Google Translate as a wrapper in our marketplace. We're going to look at the pros and cons of pre-translation in NLP algorithms.

Use Case 1: Social Media Image Recommender

Last month, we launched a microservice for finding the best social-share image with Social Media Image Recommender. This algorithm helps content-creators to automate the process of picking the best image for their article that would be shared on social media.

In order to determine the similarity between the article's text and the images' tags, the microservice makes use of word2vec. However, word2vec only works on English text -- so we added Google Translate, detecting and translating non-English content before sending it to word2vec.

Now the algorithm works on all of the human languages supported by Google Translate. This is awesome, because most NLP tools don't work on other languages, but here we were able to support a lot of them just by adding a few simple lines of code to pre-translate the text.

Use Case 2: Sentiment Analysis

Algorithmia's Guide to Sentiment Analysis Algorithms

Like most other NLP algorithms, Sentiment Analysis, works well on English because the majority of NLP research has been done on English language. This is especially true for the Sentiment Analysis algorithm, because it relies heavily on a model which was trained on a golden dataset.

For this popular algorithm, we've added the option of specifying the source language, or letting the algorithm automatically detect it. This allows the algorithm to work on a translated version of the text, which might not yield perfect results, but still works fairly well considering that we merely pre-translated the text.


While pre-translation works for many algorithms, there's one important requirement: the output must be independent from the input. What does that mean?

Sentiment Analysis is an example that follows this rule. Regardless of the input, the output is a range between two numbers, and doesn't return any part of the input (such as extracted words or phrases). Pre-translating the input might not give perfect results, but in a world that doesn't have good NLP tools for other languages it works pretty well.

This would also work for Named Entity Recognition, but not quite as well: NER returns parts of the original text back to the user. We could pre-translate the text, detect the entities, and translate it back to it's original word with its corresponding entity tag. The double translation may cause information loss, and might return a completely different word in the output. This is not ideal, and is why pre-translation is not recommended for semi-independent NLP algorithms for the sake of consistent outputs.

A completely non-independent example would be Parsey McParseface. This algorithm completely breaks down a given sentence into a structured format. Each word is tagged with a part-of-speech tag (noun, verb, etc). Double-translation would result in incoherence in the structured response, given that the fact that languages generally have different grammatical structure.

Code Example

  1. First, create a free account on Algorithmia.
  2. After creating your account, go to your profile page and navigate to the Credentials tab. There you will find your API key. Copy this key, and use it instead of "YOUR_API_KEY" in the code below.
  3. Next, install the Python Algorithmia client using the command "pip install algorithmia".
  4. Now, by running the code snippet (in Python) below, we'll be able to translate text on the fly!

[code python]
import Algorithmia

client = Algorithmia.client("your_api_key")

algo_input = {
"action": "translate",
"text": "Me gustan los aguacates"

translated_text = client.algo("translation/GoogleTranslate/0.1.1").pipe(algo_input).result["translation"]

# Prints: I like avocados
print translated_text


Pre-translation is a valuable tool for supporting multiple languages in NLP algorithms. You can do this until a reliable source for a dataset is discovered, or better tools are developed. Until then, we have to get creative with our approaches in machine learning problems.

Please let us know if you've ever used pre-translation in your NLP algorithm @algorithmia on Twitter!

Here's 50,000 credits
on us.

Algorithmia AI Cloud is built to scale. You write the code and compose the workflow. We take care of the rest.

Sign Up