August 23, 2017

Rapidly Extract Information from Public Websites

We have a lot of fun, heavy-hitting algorithms in our marketplace: deep-learning tools like Image Tagger and pipelining mechanisms such as Video Metadata Extraction are designed to bring the power of Machine Learning to your app via easy-to-use APIs.

But sometimes, all you need to do is extract some simple information from publicly available sources: for example, finding all the email addresses of a company's C-Suite, or summarizing the topic pages of a FAQ. You could accomplish some of it with a Python script and some RegEx magic, but that wouldn't bring the benefits of a remote API: datacenter-grade network connections, multiple IPs, and distributed parallel processing. And it wouldn't give you access to more complex algos such as automatic tagging or sentiment analysis. With Algorithmia, you get all the benefits of the cloud without having to build and host your own workers, plus the combined experience of our growing network of experienced algorithm developers.

Site Spidering

  • You've got to start somewhere! Give a URL to GetLinks, and it hands back all the links it can find in the page. SiteMap goes even further, recursing down to a specified depth and finding the links inside each linked page.
  • Eliminate unreachable links fom your results with FindBrokenLinks (or recursively via BrokenLinksScanner).
  • If you want to figure out which pages are the most important, consider running PageRank on your lists of links, or checking their social importance via ShareCounts, so you can focus on just the most linked-to and talked-about content.

Page Summarization, Tagging, and Analysis

  • To extract just the text of the main page content (minus HTML and navigation/header/footer), try out Url2Text
  • If what you need is an automatic summary of the content, fire up SummarizeURL (or Summarizer if you have already extracted the content's text).
  • Get a list of suggested topics for each page via AutoTagURL (or AutoTag if you already have the content)
  • Run SentimentAnalysis to figure out if a reader might consider the content "positive" or "negative".

Data Extraction

  • Need to pull out the images from a site? Use Getimagelinks to find the images, then download them with SmartImageDownloader.
  • If you're looking for email addresses, EmailExtractor will scan a piece of text and find all the email addresses, returning them as a simple list (this works better with raw HTML than the output of Url2Text). If what you're after is phone numbers, try out PhoneNumberExtraction instead.
  • For more detailed data extraction, such as "get all the italicised links inside the first DIV", HTMLDataExtractor allows you use XPath to select very specific elements out of a URL or raw HTML.


  • If you want something that wraps up a bunch of the above algos into a single call, try out MegaAnalyzeURL
  • Do you have a website you like, but what to know what others might be relevant? DiscoverSimilarWebsites can aid in locating similar content.
  • To create screenshots or PDF extracts of a page, you can rely on URL2PNG, URL2Thumb, or URL2PDF so long as Flash and WebGL don't need to be rendered.
  • If the content isn't in your language, use LanguageIdentification and GoogleTranslate to identify and/or translate it to your own.

As with any content you might acquire, be sure you have permission to use and/or republish anything you grab. We make it easy to pull down information, but don't want to see you involved in a copyright dispute or receive complaints about sending spam without permission!

Have fun out there, and if you build something awesome, let us know!

-icon comes from nounproject

Jon Peck

Jon Peck

More Posts

Here's 50,000 credits
on us.

Algorithmia AI Cloud is built to scale. You write the code and compose the workflow. We take care of the rest.

Sign Up