Bax

The Bax gem is a gem we wrote that encapsulates our HTML parsing.

THIS GEM IS DEPRECATED.. we have broken it into a number of other, smaller gems.

This was extracted originally from our crawler in a previous version of the platform. It does a lot of the groundwork for parsing automatically ,so you can focus on unique business logic.

The Bax gem interfaces to api.audienti.com and uses a key to use our proxies for accessing various services. You can bypass by using the via: 'direct' parameter on your request.

#doc - This is a Nokogiri::HTML object. It lets you query with css and xpath selectors to retrieve elements from a page. It's returned on any page we return.

#meta - This is all the meta from the head of the page, extracted from the page, and turned into a hash. In many cases, if you're parsing, you will have some of the information you need in here (much easier than a css extractor). Most sites implement OpenGraph (og:) or Twitter (twitter:) meta information. so any page with a 200 response will have a meta hash.. and you can pull data directly from that. some sites, like Pinterest, put ALL information for the page in meta.. so it makes it easy.. most sites put og: and twitter: meta on the page.. which makes getting much of the information far easier.

#write_file_to_temp - This just does that, it writes file to temp. It takes the current page and writes its HTML to the TMP folder then you can either review it, or open it in a browser and look at it. The javascript does not execute but you can see the structure that is there. It is helpful when you need to figure stuff out and you want to see the page.

#cassettes folder - This uses the VCR gem. The VCR gem us used in tests. It simply records HTML interactions one time then on subsequent times, it just plays back the answer (does not do the request). It writes the yml. It stores the information into the cassettes forlder. When you want a new version, you simply delete the cassette and it will retrieve the results again.

Last updated