October 26, 2017

Deep Dive into Object Detection with Open Images, using Tensorflow

The new Open Images dataset gives us everything we need to train computer vision models, and just happens to be perfect for a demo! Tensorflow's Object Detection API and its ability to handle large volumes of data make it a perfect choice, so let's jump right in...

pears and apples

Open Images is a dataset created by Google that has a significant number of freely licensed annotated images. Initially it contained only classification annotations, or in simpler terms it had labels that described what, but not where. After a major version update to 2.0, more annotations were added - of particular importance were the introduction of object detection annotations. These new annotations not only described what was in a picture, but where it was located, by defining the bounding box (bbox) coordinates for specific objects in an image.


The object detection dataset consists of 545 trainable labels. These labels consist of everything from Bagels to Elephants - a major step up compared to similar datasets such as the Common Objects in Context dataset which contains only 90 labels for comparison. Not only that, but the labels in Open Images contain a hierarchical structure. This means it's even possible to create specialist classifiers for individual subsections of the whole dataset, wow!

This tutorial will describe the steps in detail of how to create your own object detector trained on the Open Images dataset, and how to export it to the Algorithmia marketplace.

Before we go any further, we should let you know about some caveats regarding this demo.


  • This deep dive tutorial assumes that you have a good working knowledge of git, python, bash, and conventional linux operations. Our example is strictly defined within the debian/linux operating system environment however, with some tweaking it should work for most other environments.

  • The complete dataset is ~6.2 TB downloaded and uncompressed. You might want to tweak our image downloader to resize images as they come in.

  • The Tensorflow framework is super memory hungry. It will expect to have sufficient host memory to run, otherwise it will crash with difficult to decypher exceptions. It's recommended to have at least 32 GB of RAM, although you can use scratch space instead.

  • The Open Images dataset is comprehensive and large, but many of its classes are unbalanced which effects our precision of underrepresented classes. As this introductory tutorial, we leave more comprehensive dataset improvements such as SMOTE to the reader.

All of the scripts and files we describe in this tutorial can be found in our open images github repository.

Still with us? Great, lets get started.

Tensorflow Object Detection

The Tensorflow project has a number of quite useful framework extensions, one of them is the Object Detection API.

As the namesake suggests, the extension enables Tensorflow users to create powerful object detection models using Tensorflow's directed compute graph infrastructure. It's crazy powerful, but a little difficult to use as the documentation is a bit light. In this article we'll walk you through each step and describe why.

Step 1: Formatting your data

The Open Images dataset is separated into a number of components:

  • the image index file
  • the bounding box annotations file
  • class descriptions
  • trainable classes files.

[code bash]
#!/usr/bin/env bash
# downloads and extracts the openimages bounding box annotations and image path files
mkdir data
wget http://storage.googleapis.com/openimages/2017_07/images_2017_07.tar.gz
tar -xf images_2017_07.tar.gz
mv 2017_07 data/images
rm images_2017_07.tar.gz

wget http://storage.googleapis.com/openimages/2017_07/annotations_human_bbox_2017_07.tar.gz
tar -xf annotations_human_bbox_2017_07.tar.gz
mv 2017_07 data/bbox_annotations
rm annotations_human_bbox_2017_07.tar.gz

wget http://storage.googleapis.com/openimages/2017_07/classes_2017_07.tar.gz
tar -xf classes_2017_07.tar.gz
mv 2017_07 data/classes
rm classes_2017_07.tar.gz

In the Open Images dataset, all data is formatted in the CSV format. CSV is great for having a low footprint and easy for spreadsheets to parse. However, as a format it isn't very human readable and there are other alternatives that are easier to work with programmatically. For these reasons we decided to convert our annotations and images files into JSON, so we can work with them in a simpler fashion.

It should also be mentioned that the annotations file contains 600 different labels, only 545 of them are strictly trainable. We're going to need to cross-reference with thetrainable-classes.txt file to filter out only the trainable labels.

The image index file contains the image url and ID for every image in the entire dataset, even images that don't contain bbox annotations!

source file

Translating Class Definitions

The trainable_classes.txt file contains encoded labels, which is totally fine for training but can be a headache during evaluation. Lets quickly use the class_descriptions.csv file to create a translated trainable classes file.

[code python]
def translate_class_descriptions(trainable_classes_file, descriptions_file):
with open(trainable_classes_file, 'rb') as file:
trainable_classes = file.read().replace(' ', '').split('\n')
description_table = {}
with open(descriptions_file) as f:
for row in csv.reader(f):
if len(row):
description_table[row[0]] = row[1].replace("\"", "").replace("'", "").replace('`', '')
output = []
for elm in trainable_classes:
if elm != '':
return output

def save_classes(formatted_data, translated_path):
with open(translated_path, 'w+') as f:
json.dump(formatted_data, f)

And the procedure to make the function requests, and argument parsing:

[code python]
parser = argparse.ArgumentParser()
parser.add_argument('--trainable_classes_path', dest='trainable_classes', required=True)
parser.add_argument('--class_description_path', dest='class_description', required=True)
parser.add_argument('--trainable_translated_path', dest='trainable_translated_path', required=True)

if __name__ == '__main__':
args = parser.parse_args()
trainable_classes_path = args.trainable_classes
description_path = args.class_description
translated_path = args.trainable_translated_path
translated = translate_class_descriptions(trainable_classes_path, description_path)
save_classes(translated, translated_path)

As you can see, we perform a simple string replacement (with filter) for each element, in exactly the same format as the original trainable_classes.txt file. This will help us considerably when it comes time for evaluation and inference, so it's good that we got it out of the way first.

source file

Formatting Metadata

Lets first format our annotations file. We can do that by translating our csv rows into JSON elements, and then create a running list of image ids.

We then run a simple deduplication script over our id list, and save it so that we can filter out images we don't need, saving us bandwidth and disk space.

Since we're here, lets also load the trainable classes file, and cross-reference with our annotations to filter out any non-trainable class.

[code python]
# Lets extract not only each annotation, but a list of image id's.
# This id index will be used to filter out images that don't have valid annotations.
def format_annotations(annotation_path, trainable_classes_path):
annotations = []
ids = []
with open(trainable_classes_path, 'rb') as file:
trainable_classes = file.read().split('\n')

with open(annotation_path, 'rb') as annofile:
for row in csv.reader(annofile):
annotation = {'id': row[0], 'label': row[2], 'confidence': row[3], 'x0': row[4],
'x1': row[5], 'y0': row[6], 'y1': row[7]}
if anno['label'] in trainable_classes:
ids = dedupe(ids)
return annotations, ids

[code python]
def dedupe(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]

We then follow suit with our image index file by again translating CSV rows into JSON elements. It should be noted that the image indices file contains vast quantities of image related metadata, however, in our circumstance we only care for the image id and the URL.

[code python]
def format_image_index(images_path):
images = []
with open(images_path, 'rb') as f:
reader = csv.reader(f)
dataset = list(reader)
for row in dataset:
image = {'id': row[0], 'url': row[2]}
return images

Filtering is done by constructing an output array consisting only of image indicies that contain ids that have bounding box annotations, and all other elements are removed.

[code python]
# Lets check each image and only keep it if it's ID has a bounding box annotation associated with it.
def filter_image_index(dataset, ids):
output_list = []
for element in dataset:
if element['id'] in ids:
return output_list

We then construct an easier to use primitive by refactoring our annotations, grouping them based on image ids. We call these grouped elements "points" for clarity.

[code python]
# Gathers annotations for each image id, to be easier to work with.
def points_maker(annotations):
by_id = {}
for anno in tqdm(annotations, desc="grouping annotations"):
if anno['id'] in by_id:
by_id[anno['id']] = []
groups = []
while len(by_id) >= 1:
key, value = by_id.popitem()
groups.append({'id': key, 'annotations': value})
return groups

Finally the saving function and our procedure:

[code python]
def save_data(data, out_path):
with open(out_path, 'w+') as f:
json.dump(data, f)

parser = argparse.ArgumentParser()
parser.add_argument('--annotations_input_path', dest='anno_path', required=True)
parser.add_argument('--image_index_input_path', dest='index_in_path', required=True)
parser.add_argument('--point_output_path', dest='point_path', required=True)
parser.add_argument('--image_index_output_path', dest='index_out_path', required=True)
parser.add_argument('--trainable_classes_path', dest='trainable_path', required=True)

if __name__ == "__main__":
args = parser.parse_args()
anno_input_path = args.anno_path
image_index_input_path = args.index_in_path
point_output_path = args.point_path
image_index_output_path = args.index_out_path
trainable_classes_path = args.trainable_path
annotations, valid_image_ids = format_annotations(anno_input_path, trainable_classes_path)
images = format_images(image_index_input_path)
points = points_maker(annotations)
filtered_images = filter_images(images, valid_image_ids)
save_data(images, image_index_output_path)
save_data(points, point_output_path)

Now we have our annotations formatted into labels, our image indices filtered to only contain used ids, and everything is in JSON!

Still following? Excellent, lets start processing our image URLs then.

source file

Image Downloading

As many of you might have realized, downloading ~660k web scaled images is a monstrous task. Thankfully downloading images is partially an asynchronous task, which is something we can take advantage of by multi-threading our application.

First, let's look at our parallel processing function as it's not quite the standard multiprocessing.pool.starmap affair. We like using this specific version since visualizing our code performance is something that matters to us for long running scripts such as this. Essentially what's important to note is that the array parameter denotes the iterable you plan to parallel map over, and function denotes the function you plan to parallelize.

[code python]
# This is a nice parallel processing tool that uses tqdm
# to help visualize time-to-completion.
def parallel_process(array, function, n_jobs=16, use_kwargs=False, front_num=3):
A parallel version of the map function with a progress bar.

array (array-like): An array to iterate over.
function (function): A python function to apply to the elements of array
n_jobs (int, default=16): The number of cores to use
use_kwargs (boolean, default=False): Whether to consider the elements of array as dictionaries of
keyword arguments to function
front_num (int, default=3): The number of iterations to run serially before kicking off the parallel job.
Useful for catching bugs
[function(array[0]), function(array[1]), ...]
#We run the first few iterations serially to catch bugs
if front_num > 0:
front = [function(**a) if use_kwargs else function(a) for a in array[:front_num]]
#If we set n_jobs to 1, just run a list comprehension. This is useful for benchmarking and debugging.
if n_jobs==1:
return front + [function(**a) if use_kwargs else function(a) for a in tqdm(array[front_num:])]
#Assemble the workers
with ProcessPoolExecutor(max_workers=n_jobs) as pool:
#Pass the elements of array into function
if use_kwargs:
futures = [pool.submit(function, **a) for a in array[front_num:]]
futures = [pool.submit(function, a) for a in array[front_num:]]
kwargs = {
'total': len(futures),
'unit': 'it',
'unit_scale': True,
'leave': True
#Print out the progress as tasks complete
for f in tqdm(as_completed(futures), **kwargs):
out = []
#Get the results from the futures.
for i, future in tqdm(enumerate(futures)):
except Exception as e:
return front + out

Looking at our download function, we can see that it uses a global save_directory_path defined later in our function, this denotes the directory in which we plan to save our files. Unfortunately in python, most parallel mapping tools do not support "constant" parameter inputs, and in this case it made the most sense to provide this variable as a script specific global.

Our downloader function primarily uses the requests library and attempts to download each image from it's URL. In this example if for any reason an exception is thrown, we skip that image. Obviously there are situations where this approach is substandard, so use at your own risk.

The successfully downloaded image is saved as a binary stream to a file with it's name defined by the image id. This makes it easier to search and load images quickly and efficiently.

[code python]
def download(element):
image_content = None
dir_path = save_directory_path
browser_headers = [
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704 Safari/537.36"},
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743 Safari/537.36"},
{"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:44.0) Gecko/20100101 Firefox/44.0"}
response = requests.get(element['url'],
image_content = response.content
except Exception:
if image_content:
complete_file_path = os.path.join(dir_path, element['id']+'.'+element['url'].split('.')[-1])
with open(complete_file_path, "wb") as f:

Finally we have our procedure:

[code python]
parser = argparse.ArgumentParser()
parser.add_argument('--images_path', dest='images_path', required=True)
parser.add_argument('--images_output_directory', dest='images_output_directory', required=True)

if __name__ == "__main__":
args = parser.parse_args()
images_path = args.images_path
save_directory_path = args.images_output_directory
except OSError:
pass # already exists
with open(images_path, 'rb') as f:
image_urls = json.load(f)
parallel_process(image_urls, download)

Whoa, that's gonna take a while! Make sure that you don't have bandwidth caps before downloading. ~660k images is a lot of images and we advise you to double check that you have enough storage space to cope.


source file

Image Verification and Dimension Reduction

Now we have a ton of images, but they are all different sizes, and some of them might be broken! Let's go ahead and verify them, but instead of verifying and resizing in two separate commands, let's get efficient and combine the verification and resize operations.

[code python]
# As we traverse the annotations list, lets check each image id to make sure it's valid.
def process_images(saved_images_path, resized_images_path, points):
cleaned_points = []
for point in tqdm(points, desc="checking if images are valid from label index"):
stored_path = os.path.join(saved_images_path, point['id'] + '.jpg')
im = Image.open(stored_path)
# Now that the image is verified,
# lets rescale it and overwrite.
im.thumbnail((256, 256))
if resized_images_path:
resized_path = os.path.join(resized_images_path, point['id'] + '.jpg')
im.save(resized_path, 'JPG')
im.save(stored_path, 'JPG')
return cleaned_points

We check the image for each label element for validity, first we inspect it and ensure that nothing is broken, if that's the case we go ahead and re-scale if necessary, if an output directory is not defined, we overwrite.

If anything goes wrong during image processing, we know that the image is not formatted correctly and we filter it out of our label's list.

Note: Our thumbnail dimensions are set to reduce training cost but aren't of any particular "standard". We set something small as to reduce the overhead when creating TFRecords. Some object detection networks are designed to work with a number of image dimensions and aspect ratios, but resizing here is not strictly necessary for training. It does help, though.

Finally, our load/save and procedure components to the script:

[code python]
def load_dataset(file_path):
with open(file_path, 'rb') as f:
annotations = json.load(f)
return annotations

def save_dataset(data, file_path):
with open(file_path, 'w+') as f:
json.dump(data, f)

parser = argparse.ArgumentParser()
parser.add_argument('--image_directory', dest='image_directory_path', required=True)
parser.add_argument('--image_saving_directory', dest='resized_directory_path')
parser.add_argument('--datapoints_input_path', dest='datapoints_input_path', required=True)
parser.add_argument('--datapoints_output_path', dest='datapoints_output_path', required=True)

if __name__ == "__main__":
args = parser.parse_args()
images_directory = args.image_directory_path
resized_directory = args.resized_directory_path
points_input_path = args.datapoints_input_path
points_save_path = args.datapoints_save_path
points = load_dataset(points_input_path)
filtered_points = process_images(images_directory, resized_directory, points)
save_dataset(filtered_points, points_save_path)

Run that process for the training, testing, and validation sets and we're almost there. If you want to preserve the original files, provide a resized_directory path variable which will define where we save the resized/verified images to.

source file

Defining the Label Map

Tensorflow requires a label_map protobuffer file for evaluation, this object essentially just maps a label index (which is an integer value used in training) with a label keyword. If you train without an evaluation step you can avoid this, however it will help when performing inference later.

[code python]
# now we create the pbtxt file, there's no writer for this so we have to make one ourselves

def save_label_map(label_map_path, data):
with open(label_map_path, 'w+') as f:
for i in range(len(data)):
line = "item {\nid: " + str(i + 1) + "\nname: '" + data[i] + "'\n}\n"

parser = argparse.ArgumentParser()
parser.add_argument('--trainable_classes_path', dest='trainable_classes', required=True)
parser.add_argument('--label_map_path', dest='label_map_path', required=True)

if __name__ == '__main__':
args = parser.parse_args()
trainable_classes_file = args.trainable_classes
class_description_file = args.class_description
label_map_path = args.label_map_path
save_label_map(label_map_path, trainable_classes_file)

There are no available writing tools to generate label_map files for Tensorflow, and for large label sets like ours it can be super cumbersome to write one manually. Because of this we decided to create an automated string replacement tool that satisfies the label map format requirements.

source file

The last step before we start constructing our model is to create TFRecord files.

TFRecord Creation

Tensorflow records are an interesting construct. They're used nearly universally across Tensoflow objects as a dataset storage medium, and harbour a bunch of complexity, but the documentation on using your own dataset is sparse.

Thankfully we did all the hard work for you. This section will walk you through everything you need to start using a Tensorflow record!

First we must generate a "class number" or label index integer for each label. These integers are used directly by the neural network's cross-entropy loss function, which is used to gauge the performance of the network in the classification task. We define the class number based on the order in which they are defined in the trainable_classes file.

[code python]
def generate_class_num(points):
enum_points = []
with open(trainable_classes_file, 'rb') as file:
trainable_classes = file.read().split('\n')
for point in tqdm(points):
for anno in point['annotations']
anno['class_num'] = trainable_classes.index(anno['label'])+1
return output

To create a record for an object detection project, we need a few components. Some are on a per image basis while some are per annotation.

Unfortunately the API for creating "examples" or single elements in a TFRecord is a bit convoluted. You don't provide an array of annotations, but instead a series of arrays for each individual component of an annotation. For these "per annotation" components, we include bounding box coordinates, the labels "text" or definition, and a unique integer value to denote that particular class.

Note: If using your own dataset, make sure that your bounding box coordinates are relative to the image coordinates, rather than absolute. If your dataset's annotation data is defined in absolute coordinates, make sure you convert them to relative coordinates before resizing your images! We almost got burned by that, learn from us :D

[code python]
# Construct a record for each image.
# If we can't load the image file properly lets skip it
def group_to_tf_record(point, image_directory):
format = b'jpeg'
xmins = []
xmaxs = []
ymins = []
ymaxs = []
class_nums = []
class_ids = []
image_id = point[0]['id']
filename = os.path.join(image_directory, image_id + '.jpg').decode()
image = Image.open(filename)
width, height = image.size
with tf.gfile.GFile(filename, 'rb') as fid:
encoded_jpg = bytes(fid.read())
return None
key = hashlib.sha256(encoded_jpg).hexdigest()
for anno in point['annotations']:
tf_example = tf.train.Example(features=tf.train.Features(feature={
'image/height': dataset_util.int64_feature(height),
'image/width': dataset_util.int64_feature(width),
'image/key/sha256': dataset_util.bytes_feature(key.encode('utf8')),
'image/filename': dataset_util.bytes_feature(bytes(filename)),
'image/source_id': dataset_util.bytes_feature(bytes(image_id)),
'image/encoded': dataset_util.bytes_feature(encoded_jpg),
'image/format': dataset_util.bytes_feature(format),
'image/object/bbox/xmin': dataset_util.float_list_feature(xmins),
'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs),
'image/object/bbox/ymin': dataset_util.float_list_feature(ymins),
'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs),
'image/object/class/text': dataset_util.bytes_list_feature(class_ids),
'image/object/class/label': dataset_util.int64_list_feature(class_nums)
return tf_example

Whoa that's a ton of stuff, what's all that code doing?

First, we load the image file for this particular point and encode it as a byte array. It's important to load the image this way since the object detection API's internal image handling logic is fragile and may not represent images how you would expect them to.

Next, we iterate over the point object's annotations and create element arrays. It should be noted that if any of the object arrays are missing or not the same length, Tensorflow will throw a bunch of exceptions.

Now that we've created what is analogous to a "row" in a database, we should write the data to a file - a TFRecord file!

While we have the write logic contained within the scripts main procedure for brevity, it could easily be placed in a separate function if you're so inclined.

[code python]
def load_points(file_path):
with open(file_path, 'rb') as f:
points = json.load(f)
return points

if __name__ == "__main__":
trainable_classes_file = sys.argv[1]
record_storage_path = sys.argv[2]
annotations_file = sys.argv[3]
saved_images_root_directory = sys.argv[4]
annotations = load_points(annotations_file)
with_class_num = generate_class_num(annotations)
writer = tf.python_io.TFRecordWriter(record_storage_path)
for group in tqdm(annotations, desc="writing to file"):
record = group_to_tf_record(group)
if record:
serialized = record.SerializeToString()

Great we're almost there now. We compiled the individual processing scripts into a series of gists for your use. Be sure to run these steps multiple times to create the Training, Testing, and Validation TFRecord files as we'll need them for our next step.

source file

Step 2: Setting up the Object Detection API

So all of our data is formatted properly into TFRecords files, and we're just about ready to begin training. At this point we should start introducing elements from the object detection API.

The object detection API contains a couple of useful scripts that we can take advantage of. Namely, the eval.py and train.py scripts in the main directory. Installation is a bit of a pain though, so we'll walk you through a quick setup to get things moving.

Setting up our Environment

First, you'll need to get your system dependencies in place, so boot up a terminal and follow along!

Install the necessary system dependencies through pypi:

[code bash]
pip install pillow
pip install lxml
pip install jupyter
pip install matplotlib
pip install protobuf>=2.6

And most importantly, (if not already installed):

[code bash]
# For CPU
pip install tensorflow
# For GPU
pip install tensorflow-gpu

It should be noted that tensorflow-gpu is compiled with very specific CUDA and CUDNN versions, so it might make sense to compile the tensorflow project from source if your environment differs.

Next, let's obtain the models git repository which can be found here, and compile it's cython components:

[code bash]
git clone https://github.com/tensorflow/models.git
# cd to the main project root path first
cd models/research
protoc object_detection/protos/*.proto --python_out=.

That should get you along nicely, now lets make sure our environment variables are set (you may want to permanently set them, more on this here):

[code bash]
# while still in the models/research directory
export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim
# and if you're using tensorflow-gpu and haven't set your cuda path yet:
export LD_LIBRARY_PATH=/usr/local/cuda/lib64

Great now we're fully setup, if you want to run a quick test to make sure everything works, try running this and see if it works:

[code bash]
# while still in the models/research directory
python object_detection/builders/model_builder_test.py

Transfer Learning

Object detection is a difficult challenge that necessitates the use of deep learning techniques. This normally requires that we train a model with potentially hundreds of layers and millions of parameters! As you might imagine even our 660k image dataset would most likely be insufficient.

Thankfully there's a solution! All object detection model configurations in the Object Detection API support transfer learning. What this means is that we're able to take an existing pre-trained image classifier (which is trained on millions of images), and use it to jump start our detector.

Exactly how transfer learning works is beyond the scope of this deep dive, but to get a more intuitive understanding I recommend you check out the link above.

Great, so we can use pre-trained models, but where do we get them from?

Good question! Deep in the object detection API repository you can find this handy guide, which describes each classifier model. All of them are easy to swap in and out which is very convenient for testing.

So go ahead and download one of these files, and unzip them to a special directory - this will help us later.

Configuring our Object Detection Schema

We've accomplished a lot here and we're almost ready to start training, but first we need to configure our graph buffer configuration.

In the Object Detection API, the standard way of defining a model for training is by creating or tweaking a config file. This file defines how tensorflow interprets your request, where to take data from and where to save data to.
There is a bunch of information that's contained within this file, so lets break it down into manageable chunks.

This is the start of the model configuration. We're using the faster_rcnn object detection template here, which is where the faster_rcnn object comes from. This can be replaced with other architectures by contrasting with this page, but in in this demo we'll only be looking at faster_rcnn.

  • num_classes is the total number of classification labels, with 0 denoting the background class.
  • The image_resizer is important, and there are two main types of resizing, fixed_shape_resizer and keep_aspect_ratio_resizer. Image dimensionality is important for object detection. It should be noted that fixed_shape_resizer will pad the minor dimension instead of skewing or warping, which greatly improves stability in the face of natural web images.

[code json]
model {
faster_rcnn {
num_classes: 545
image_resizer {
fixed_shape_resizer {
height: 350
width: 350

The rest of the model class defines the hyper parameters of the various layers. For most circumstances the default hyper parameters will get you pretty far.

[code json]
feature_extractor {
type: 'faster_rcnn_resnet101'
first_stage_features_stride: 16
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 16
width_stride: 16
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
initializer {
truncated_normal_initializer {
stddev: 0.01
first_stage_nms_score_threshold: 0.1
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 14
maxpool_kernel_size: 2
maxpool_stride: 2
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: true
dropout_keep_probability: 0.5
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.009999999776482582
iou_threshold: 0.6000000238418579
max_detections_per_class: 100
max_total_detections: 300
score_converter: SOFTMAX
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0

Lots of boilerplate stuff right? Still, it's important for tensorflow to understand exactly how to construct it's computational graph, and exposing that level of detail gives you more fine grained control when you need it.

Lets look at something that isn't boilerplate, the `train_config`

[code json]
train_config: {
batch_size: 20
optimizer {
adam_optimizer: {
learning_rate {
exponential_decay_learning_rate: {initial_learning_rate:0.00001}
fine_tune_checkpoint: "/media/deepstorage/model/faster-rcnn/model.ckpt"
from_detection_checkpoint: True
batch_queue_capacity: 50
gradient_clipping_by_norm: 10
data_augmentation_options {
random_horizontal_flip {

Lots of important pieces of information here so lets break it down:

  • batch_size - this defines the number of work elements in your batch. Tensorflow requires a fixed number and doesn't take into consideration GPU memory or data size. This number is highly dependent on your GPU hardware and image dimensions, and isn't strictly necessary for quality results. Tensorflow requires each input array to have the same dimensionality, which means that any batch_size > 1 requires an image_resizer of fixed_shape_resizer. For more information on batching, check out this link.
  • optimizer - this is super important as it defines how your weights get updated by backpropegation. The default mode is a standard momentum_optimizer which is a flexible version of stochastic gradient descent (SGD). This works great for most kinds of systems, but for large sparse arrays like our output array the adam optimizer works best. If you want to check out the other options, look at this file.
  • fine_tune_checkpoint - here we define the directory and filename prefix of our pre-trained model file. This is why we saved the file in a directory all on its own. Don't worry about the fact that you don't have a model.ckpt file, Tensorflow will figure it out.
  • from_detection_checkpoint: True - not described in any of the documentation, but required for your pretrained object detection checkpoint to work correctly. If you use a pure "classification" checkpoint, leave this as false.
  • batch_queue_capacity - another important parameter, Tensorflow contains a streaming pipeline that allows you to load a reservoir of training batches into memory, but isn't dynamically set by your available host memory. This number defaults to 300 which even with our images being dramatically downscaled, was deemed to be too high for our high performance training machine. Adjust accordingly.
  • gradient_clipping_by_norm - this is necessary to avoid exploding gradients. We set the value of 10 through experimentation but it can be adjusted.
  • data_augmentation_options - setting some augmentation options can dramatically increase our dataset's size, while improving the robustness of our detector. For information as to what options are available, take a look at this file.

Ok, so we've got the training_config completed to our liking, our GPU is happily able to chug along now with no nasty OOM errors. Lets take a look at the eval_config:

[code json]
eval_config: {
num_examples: 3000
num_visualizations: 20

Much shorter right? The default parameters are actually for the most part OK here, especially since the evaluation step is mostly for visualizing generalization and robustness. If your system begins to hang when both training & evaluation steps are running, it might be worth it to reduce the num_examples value. If you want to take a look at the whole list of options, check out the eval file.

Finally lets look at our two reader configurations:

[code json]
train_input_reader: {
tf_record_input_reader {
input_path: "PATH_TO_RECORD_FILE/train_545.record"
label_map_path: "PATH_TO_LABEL_MAP/label_map_545.pbtxt"

eval_input_reader: {
tf_record_input_reader {
input_path: "PATH_TO_RECORD_FILE/test_545.record"
label_map_path: "PATH_TO_LABEL_MAP/label_map_545.pbtxt"
shuffle: false

Pretty simple right? We set shuffle to false because we want to see how the network improves from one evaluation to the next, but you can set that to true if you'd rather get a more stochastic result.

Ok, so far we've manipulated and formatted our dataset metadata, downloaded, verified and resized all of our image files and created our record files. We've loaded and prepared the object detection API, and now created our config file.

We're finally ready to begin training!

source file

Step 3: Training and Production

Everything is setup to begin training, but first let's describe the training and evaluation process quickly.

There are two important scripts in the object detection API directory: eval.py and train.py. It's true that we don't need to run the eval.py script as it doesn't contribute to training, however, it provides us with invaluable training insight that can be easily viewed and shared using Tensorboard. Describing how to get Tensorboard setup is outside of the scope of this example, however, the documentation in the link above should be more than enough to get you started.

The following scripts are used at the command line, and should be run in separate terminal sessions. We recommend using the screen tool for simplicity.

[code bash]
python object_detection/train.py \
--logtostderr \

The training_output directory will contain the all important checkpoint files necessary for inference and serving once your model is sufficiently trained. Logging to std err means that you'll have a more verbose output, which is useful for debugging.

[code bash]
python object_detection/eval.py \
--logtostderr \

And finally, in another screen - run the tensorboard daemon

[code bash]
# from tensorboard source directory
tensorboard \

With all of those scripts running, you're on your way to training your neural network! Training may take some time, so make sure to check back with your running Tensorboard instance to inspect the generalization of your model. It should also be noted that the object detection API will not stop when it "runs out of data", the best way to detect when it's completed a single pass is when the average precision begins to flat line.


With Tensorboard we can even check out some sample images and see what our evaluation looks like at a glance.


Sweet, looks like we actually trained something that's able to detect things. Let's look at putting this into production.

Frozen Graph Generation

Awesome, we're almost at the finish line now. We've trained our model and we like the results, but we can't easily use our model files for inference in its current format.

Tensorflow has a concept known as exporting a metagraph. Freezing a graph allows us to combine the model structure (the configuration file) along with the weight and gradient data into a single binary protobuffer file.

For most inference techniques, we do that by executing a script called export_inference_graph.py which again is found in the object_detection repository.

[code bash]
python export_inference_graph.py --input_type image_tensor \
--pipeline_config_path /PATH/TO/CONFIG/FILE.config \
--trained_checkpoint_prefix /PATH/TO/TRAINED/OUTPUT/DIRECTORY/model.ckpt \
--output_directory /PATH/TO/FROZEN/DIRECTORY

After that's done, you now have this frozen_inference_graph.pb file in your frozen directory. Ignore the rest of the gobbley-gook in there and upload it to the data API, along with our previously defined label_map.pbtxt so we can convert our encoded classes into things like cat, dog, and apple.

Serving Inferences with Algorithmia

We have everything we need now to create a useful algorithm on Algorithmia! Our first step is for you to create a new algorithm and define its language as a python3 algorithm for Tensorflow support. Make sure to state that our algorithm requires access to the internet and requires a GPU for processing, or our inferences will take a boatload of time.

Let's look at our actual algorithm file now. We'll break it up into chunks and talk about each section individually.

[code python]
import numpy as np
import tensorflow as tf
from PIL import Image
import Algorithmia
import os
import multiprocessing
from . import label_map_util

We need a couple of extra files from the object_detection repository to get things to work, namely the label_map_util.py and string_int_label_map_pb2.py scripts. Both files are provided in our repository

[code python]
# This is code for most tensorflow object detection algorithms
# In this example it's tuned specifically for our open images data example.

client = Algorithmia.client()
TEMP_COLLECTION = 'data://.session/'
BOUNDING_BOX_ALGO = 'util/BoundingBoxOnImage/0.1.x'
SIMD_ALGO = "util/SmartImageDownloader/0.2.14"
MODEL_FILE = "data://zeryx/openimagesDemo/ssd.pb"
LABEL_FILE = "data://zeryx/openimagesDemo/label_map.pbtxt"

class AlgorithmError(Exception):
def __init__(self, value):
self.value = value

def __str__(self):
return repr(self.value)

As per all of our standard Python algorithms - we define any constant, reused parameters in advance, particularly files and algorithms that we may be interacting with multiple times. By defining everything in advance we make it easier to change things later.

We also describe the AlgorithmError object, this helps us throw more concise exceptions.

[code python]
def load_model():
path_to_labels = client.file(LABEL_FILE).getFile().name
path_to_model = client.file(MODEL_FILE).getFile().name
detection_graph = tf.Graph()
with detection_graph.as_default():
od_graph_def = tf.GraphDef()
with tf.gfile.GFile(path_to_model, 'rb') as fid:
serialized_graph = fid.read()
tf.import_graph_def(od_graph_def, name='')
label_map = label_map_util.load_labelmap(path_to_labels)
categories = label_map_util.convert_label_map_to_categories(
label_map, max_num_classes=NUM_CLASSES, use_display_name=True)
category_index = label_map_util.create_category_index(categories)
return detection_graph, category_index

def load_labels(label_path):
label_map = label_map_util.load_labelmap(label_path)
categories = label_map_util.convert_label_map_to_categories(
label_map, max_num_classes=NUM_CLASSES, use_display_name=True)
category_index = label_map_util.create_category_index(categories)
return category_index

This is our standard Tensorflow object detection preload snippet. Pay close attention to how path_to_model is used to setup the detection_graph object. As you can see it is defined through the global tf object, which makes further refinement of this process tricky.

Our label map gets converted into a category_index, which is useful for easy label lookups in our inference function.

[code python]
def load_image_into_numpy_array(image):
(im_width, im_height) = image.size
return np.array(image.getdata()).reshape(
(im_height, im_width, 3)).astype(np.uint8)

def get_image(url):
output_url = client.algo(SIMD_ALGO).pipe({'image': str(url)}).result['savePath'][0]
temp_file = client.file(output_url).getFile().name
renamed_path = temp_file + '.' + output_url.split('.')[-1]
os.rename(temp_file, renamed_path)
return renamed_path

Here we specify how we download our images using the Smart Image Downloader, and how we load it into a Numpy array of proper dimensions for Tensorflow.

[code python]
def generate_gpu_config(memory_fraction):
config = tf.ConfigProto()
# config.gpu_options.allow_growth = True
config.gpu_options.per_process_gpu_memory_fraction = memory_fraction
return config

This is a very important component that reduces Tensorflow's memory hogging nature. It also reduces bottlenecks and OOM errors when running the inference script on algorithmia. If per_process_gpu_memory_fraction is not defined, it defaults to 1.

Defining the allow_growth variable means that we only allocate as much GPU memory as strictly necessary.

[code python]
# This function runs a forward pass operation over the frozen graph,
# and extracts the most likely bounding boxes and weights.
def infer(graph, image_path, category_index, min_score, output):
with graph.as_default():
with tf.Session(graph=graph, config=generate_gpu_config(0.6)) as sess:
image_np = load_image_into_numpy_array(Image.open(image_path).convert('RGB'))
height, width, _ = image_np.shape
image_np_expanded = np.expand_dims(image_np, axis=0)
image_tensor = graph.get_tensor_by_name('image_tensor:0')
boxes = graph.get_tensor_by_name('detection_boxes:0')
scores = graph.get_tensor_by_name('detection_scores:0')
classes = graph.get_tensor_by_name('detection_classes:0')
num_detections = graph.get_tensor_by_name('num_detections:0')
(boxes, scores, classes, num_detections) = sess.run(
[boxes, scores, classes, num_detections],
feed_dict={image_tensor: image_np_expanded})
boxes = np.squeeze(boxes)
classes = np.squeeze(classes).astype(np.int32)
scores = np.squeeze(scores)
for i in range(len(boxes)):
confidence = float(scores[i])
if confidence >= min_score:
ymin, xmin, ymax, xmax = tuple(boxes[i].tolist())
ymin = int(ymin * height)
ymax = int(ymax * height)
xmin = int(xmin * width)
xmax = int(xmax * width)
class_name = category_index[classes[i]]['name']
'coordinates': {
'y0': ymin,
'y1': ymax,
'x0': xmin,
'x1': xmax
'label': class_name,
'confidence': confidence

This is the big meat and potatoes. This is our main inference function, so let's unpack this.

We define the GPU memory fraction to an easy 0.6, but it can be adjusted as necessary. We format our image data into a Numpy array, and extract its dimensions for the inference process. We then extract Tensorflow tensor handles that are defined in the output of our graph. After that we actively run the inference step by using the sess.run function.

The inference step is by far the most time consuming process, but after that's complete we can format the results into a useful form. We filter out boxes with a cross entropy value less than min_score and format it into an easy to parse JSON format.

As you might have noticed we return our results here as updates to our mutable list output instead of a regular return. We'll show you why we do this in our apply function later.

[code python]
def draw_boxes_and_save(image, output_path, box_data):
request = {}
remote_image = TEMP_COLLECTION + image.split('/')[-1]
temp_output = TEMP_COLLECTION + '1' + image.split('/')[-1]
request['imageUrl'] = remote_image
request['imageSaveUrl'] = temp_output
request['style'] = 'basic'
boxes = []
for box in box_data:
coords = box['coordinates']
coordinates = {'left': coords['x0'], 'right': coords['x1'],
'top': coords['y0'], 'bottom': coords['y1']}
text_objects = [{'text': box['label'], 'position': 'top'},
{'text': 'score: {}%'.format(box['confidence']), 'position': 'bottom'}]
boxes.append({'coordinates': coordinates, 'textObjects': text_objects})
request['boundingBoxes'] = boxes
temp_image = client.algo(BOUNDING_BOX_ALGO).pipe(request).result['output']
local_image = client.file(temp_image).getFile().name
return output_path

If the user requires a graphic result, we can use our bounding box on image algorithm to quickly create a graphical representation of our detection results. By using this logic we can quickly create images just like:


[code python]
def apply(input):
output_path = None
min_score = 0.50
if isinstance(input, str):
image = get_image(input)
elif isinstance(input, dict):
if 'image' in input and isinstance(input['image'], str):
image = get_image(input['image'])
raise Exception("AlgoError3000: 'image' missing from input")
if 'output' in input and isinstance(input['output'], str):
output_path = input['output']
if 'min_score' in input and isinstance(input['min_score'], float):
min_score = input['min_score']
raise AlgorithmError("AlgoError3000: Invalid input")
manager = multiprocessing.Manager()
box_output = manager.list()
p = multiprocessing.Process(target=infer,
args=(GRAPH, image, CAT_INDEX,
min_score, box_output))
box_output = [x for x in box_output]
box_output = sorted(box_output, key=lambda k: k['confidence'])
if output_path:
path = '/tmp/image.' + output_path.split('.')[-1]
im = Image.open(image).convert('RGB')
image = draw_boxes_and_save(path, output_path, box_output)
return {'boxes': box_output, 'image': image}
return {'boxes': box_output}

GRAPH, CAT_INDEX = load_model()

Finally, let's look at our apply function, the heart of any algorithm on Algorithmia. In this function we are provided with an input which can be of multiple types. We first must process this input into an expected schema type which is what the first half of the function is doing.

However as you might notice we're using some multiprocessing functionality, specifically a managed list, and a Process. Why would we ever want to use a multi-threading suite for what is essentially a sequential algorithm?

Tensorflow today is defined with the global variable tf. When the function inference exits, the variable still contains its set properties and values. One of these values is our GPU memory context, which is only released when the tf variable is released. Because of this, we can run into issues with Tensorflow not releasing GPU memory when it should, which can cause lots of complications later on down the road. By running the Tensorflow application in a separate thread, and then killing the thread, we kill the Tensorflow GPU memory context without influencing performance!

After we extract our results from our managed list, we can quickly finish off with some post processing and return it!

And finally, at the bottom of this script, you can see that we run the load_model() script in a global state. This means that we pre load the frozen graph into host memory, which dramatically reduces API request latency and variability.

And that's it. We're done! If you want to see a working demo algorithm of this object detector take a look here.

Here's 50,000 credits
on us.

Algorithmia AI Cloud is built to scale. You write the code and compose the workflow. We take care of the rest.

Sign Up