Proxies and Bax

Bax is the library that we use for retrieving and parsing data from the broader internet (our crawler).

In the architecture, there are:

  • Pages - these parse data into a usable format.

  • Retriever - this retrieves data directly or through proxies or APIs

  • Configuration - this class handles configuration data

  • Utils - Various text, HTML, and URL methods needed in creating usable data

  • Parsers - Tools that parse and extract data

Basepage "Bax::Page"

The base "page" is the "Bax::Page".

There are several parsers available that use the data from that page. These do things like find the ad_networks on the page, give the content tags, summarize the content, provide key phrases from the page for topic identification, etc. Each one of these parsers are separate. Their "result" methods are then mixed back into the main page parser.. so you can call them without calling the "page_parser" class that generated them.

So, for example.. getting the social accounts that are on a page. You retrieve the page:page = Bax::Page.get("https://audienti.com")and you can then dopage.social_profilesand you will get a list of social profiles, returning an array of hashes with various identified accounts, like Twitter, Facebook, YouTube, LinkedIn, etc. This is generated from the social_parser, and the actual method is the Bax::Page::SocialParser.social_profiles method. The Bax::Page delegates the social_parser" method to this class.. and its parsed on demand.

Then, there are special cases, where you need to do some custom parsing. The/pages/classes are child classes of the Bax::Page class.. with custom attributes on them.

For example, if you want to retrieve the company page of a person on LinkedIn.. and get their company name, company size, etc. this is done with a custom page that uses the Bax::Page and all its parsers as a foundation. There are page parsers for Linkedin, Instagram, Google search results, etc.

Why not use APIs? Scraping versus crawling versus API retrievals

We do not scrape--we crawl. And, our crawlers respect corporate norms that keep us from collecting private data. If they don't want their data viewed and parsed, they need to mark it as such.

Clearly, businesses want us to use their APIs.. but the data is often different or unavailable through their API, which is why you sometimes have to retrieve pages from them directly versus going through their API.

An example. LinkedIn doesn't let you do a company search effectively from their API.. or retrieve people that work at a company. It violates their business model.. as they want you instead to run ads at them.. and to target an entire company's domain, which makes them more money, at the cost of our customer's money.

Another example. Google crawls the entire web, storing everyone's content.. no one calls them a "scraper." Google calls other people "scrapers" because they don't want other people having that content, or to do to them what they did to others--namely crawling their public content. But, the reality is that most scaled services do what you could call "scraping or crawling".. for verification, for data enrichment, etc.

Another example. In slack, or in Facebook, you get a "snippet" of a page that shows the page, page title, a summary, and an image? That's a crawl/scrape of that page. Almost every service does it.

And services that are "walled gardens", limit you through their API, to drive you to paying for ads, or to keep their users inside their walls, where they make more money.

Our intentions with using crawled data isn't nefarious--or an intention to bypass legal rules, or even anything illegal. Rather, it's simply doing what everyone else does to gather supplemental data that our service can use.

Last updated