Modern cyber attacks, such as Botnets and Ransomware, are becoming increasingly dependent on (seemingly) randomly generated domain names. Those domains are used as a way to establish Command & Control with their owners, which is a technique called Domain Fluxing. The recent WannaCry ransomware was famously stopped simply by registering one of those domain names.
The ability to quickly classify a domain name as *safe* or *malicious* is a critical task in the cybersecurity world. It can help alert security experts of any suspicious activity or even block that activity. Such a system will have two requirements:
- Needs to be accurate, you don’t want to block your users from accessing safe websites
- Needs to be scalable, able to handle thousands of transactions per second
There are plenty of approaches to this problem, especially in the academic world (S. Yadav - 2010, J. Munro - 2013). The fine folks at H2O.ai also have an excellent code sample we found here. This blog post will briefly describe how H2O’s implementation works and how you can deploy and scale it on Algorithmia.
How it works
The classifier is a basic linear regression model trained on pre-processed labeled dataset. The dataset contains a domain name on each line and a class “legit/dga”. For the pre-processing stage, a python script was created to extract features from each domain name and feed it as an input to the linear regression model. The extracted features are:
- Shannon entropy
- Character length of the domain name
- Proportion of vowel to non-vowel characters
- Number of common words found in the domain name (from this word list)
After extracting those features from each domain in the dataset, H2O’s H2OGeneralizedLinearEstimator was used to train the model and print out the confusion matrix. Here’s the code used for training:
print('\nModel: Logistic regression with regularization')
model = H2OGeneralizedLinearEstimator(model_id='MaliciousDomainModel', family='binomial', alpha=0, Lambda=1e-5)
model.train(x=['length', 'entropy', 'p_vowels', 'num_words'], y='malicious', training_frame=train, validation_frame=valid)
You can take a look at the entire script used for training here.
Deploying to Algorithmia
The great thing about H2O is the ability to extract trained models as POJO files for high-performance classification. Those POJO files are easy to deploy to Algorithmia as regular Java algorithms. You can take a look at the extracted POJO file here - keep in mind this is mostly computer generated.
One caveat about this algorithm is that the pre-processing stage is done with Python and the classification stage is done with Java. Typically this would be an easy task to do with Algorithmia since Algorithmia enables chaining of algorithms from different programming languages. However because this algorithm was originally developed outside of Algorithmia, the author chose to use Jython to run the pre-processing Python script. The downside is that is complicates the runtime environment, the upside however is that it can all run within the same compute node, achieving much lower latency.
Our objective is to run the algorithm as-is without much changes to the original code. We created a simple wrapper that points to the location of the Jython JAR file and the trained model. You can see our code here.
Scaling the H2O.ai model
At this stage the algorithm was running smoothly on Algorithmia with an average runtime of 10ms. One trick we’ve learned from scaling similar pipelines is to enable batch scoring. In this case, classifying 10 domains takes only 17ms. Now our algorithm can take a single domain name (a single string) or a batch of domain names (an array of strings).
So how does it scale? Scales pretty well.
Algorithmia works with the concept of “Slots”. A compute node in the Algorithmia cluster can hold a configurable number of Slots, which are Docker containers initialized just-in-time to fulfill an incoming request.
When an API call is made, the request is routed to a compute node, that compute node assigns it to a Slot, loads the algorithm (or model) into that Slot, feeds it the JSON input, and returns the JSON output all the way back to the client that made the API call. The Algorithmia cluster makes intelligent decisions as to what Slot to leave “loaded” (i.e. in memory) to process additional requests, or “evacuate” (i.e. destroy the container) to release those resources for another API call from another user. An initialized Slot is never shared across users or algorithms, ensuring complete memory isolation in a multi-tenant environment. You can read more about how all this works from our blog post on Building an OS for AI.
In our benchmark above, we make 50 parallel API calls, each classifying 10 domain names in batches, over and over. That’s 10 transactions per API call.
The Algorithmia cluster assigns those incoming API calls across the available compute nodes, resulting in fully horizontally-distributed experience. The client making those calls does not need to configure or do any devop planning ahead of time, such as launching servers or warming up containers. From the benchmark, we go from 0tps (transactions per second) to 4750tps in couple seconds - completely devops free.
This was a great exercise to show how Algorithmia supports H2O models right out of the box, and more importantly the level of production and scale you can achieve in a multi-tenant environment. If you have a H2O model you’d like to productionize, send us a note on firstname.lastname@example.org and we’ll be happy to help!