Scraping and extracting structured data from web pages can often be a challenge. There’s typically issues with fetching data, dealing with pagination, handling AJAX, and more.
The Analyze URL microservice is a useful tool for transforming messy unstructured data into clean structured data via a simple REST API.
We created this microservice by combining several simple functions into a more complex web service. Check out the Analyze URL source code here.
By chaining these functions together, where the result of one is applied to the next, we get what computer science calls a pipeline.
Let’s take a look at how Analyze URL is an example of a microservice composed of a few functions used to scrape a URL and return structured data from the web page.
tl;dr Don’t like reading? Just want to extract structured data from web pages? We got you. Check out the Web Page Inspector demo built off of Analyze URL. Web Page Inspector instantly retrieves clean, structured data from any URL.
What is Analyze URL?
The Analyze URL microservice is an easy way to scrape metadata from any URL. The service accepts a URL as a string, and returns a summary of the web page (using the Summarizer microservice), timestamp, thumbnail, plain text content, the title, URL, and status code for the page.
Either way, the microservice is built from several underlying functions (or algorithms, if you will) that extract and produce structured data.
Here we’re making an important distinction between algorithms and microservices (for purposes of convenience we’ll skip the semantics and use algorithm and function interchangeably)
- Algorithms are deterministic. Meaning given the same input, it will always produce the same output.
- When we compose several algorithms together we get a microservice, a sort of self-contained service that achieves a specific task predictably.
When we say Analyze URL is a microservice, we’re implying that it’s the result of several algorithms (or functions) working behind the scenes to produce some desired result. In this case, a bunch of structured data from a URL.
Check out the source code behind Analyze URL to see what we mean by functions working together.
Why You Need Analyze URL
While it’s pretty straight-forward to scrape a single web page, it’s not as simple when you want to scrap any URL, and consistently extract structured data. It’s much harder to do this when scraping hundreds or thousands of different URLs at a time.
Analyze URL works as a simple API endpoint that’s always on and available.
How to Extract Structured Data
input = "//blog.algorithmia.com/predictive-algorithms-track-real-time-health-trends/"
client = Algorithmia.client('YOUR API KEY')
algo = client.algo('web/AnalyzeURL/0.2.14')
"summary": "In this tutorial, we're going to build a real-time health dashboard using predictive algorithms to track a person's blood pressure trends over time.",
"text": "Build tomorrow's smart apps today This is a guest post by Chris Hannam, a professional Python and Java developer. Want to contribute your own how-to post Let us know contact us here. We’ve shown how to use predictive algorithms to track economic development. In this tutorial, we’re going to build a real-time health dashboard for tracking a person’s blood pressure readings, do time series analysis, and then graph the trends over time using predictive algorithms. ...",
"title": "Predictive Algorithms to Track Your Health Data In Real Time - Algorithmia",
This is just the start of composing algorithms into microservices on Algorithmia. Think of it as Algorithms as a Microservice where you have access to an ever-growing library of more than 2,200 algorithms.
For instance, you could create a new microservice by running a URL through site map, and then piping the output into Analyze URL to extract structured data for an entire domain. Oh, hey, look a simple Python script to do just that.
Because algorithms makes it easy to turn business logic into a web service, you can now instantly deploy your backend code as an API for public or private consumption. Every algorithm runs as it's own microservice, making them composable, interoperable, and secure. No servers, NoOps. You’re good to go.
Check out the Web Page Inspector demo that uses Analyze URL to instantly retrieves clean, structured data from any URL.
Let us know what you think @Algorithmia.