guide
  • Introduction
  • Guiding Principles
    • Mission Statement
    • Conflict Resolution Process
  • Operating Model
    • Working Together
    • Holacracy
      • Meetings
      • Specific Roles
      • Terms and Definitions
      • Finer Points
      • Holacracy-Asana Key
    • Getting Things Done
      • Daily, Weekly, Monthly, and Annual Reviews
      • GTD-Asana Key
    • Transparency
    • Language
    • Budgeting
    • By Department
      • Engineering Operations
  • General Guidelines
  • Employment Policies
    • Equal Opportunity Employment
    • At-Will Employment
    • Code of Conduct in the Community
    • Complaint Policy
    • Drug and Alcohol Policy
    • Vacation, Holiday, and Paid Time Off (PTO) Policy
    • Supplemental Policies for Remote Employees and Contractors
    • Supplemental Policy for Bonus, Commissions, and other Performance-based Payments
    • Supplemental Policies for Hourly International Contractors or Workers
    • Supplemental Policies for Hourly International Contractors or Workers
    • Disputes and Arbitration
  • Benefits and Perks
    • Health Care
    • Vacation, Holiday and Paid Time Off (PTO) Policy
    • Holiday List
  • Hiring Documents
    • Acknowledgement of Receipt
    • Partner Proprietary Information and Inventions Agreement
  • Engineering Wiki
    • Code Snippets
      • Front End Code Snippets
    • Setup
      • 1: Overview of development using Audienti
      • 2: How to setup your dev environment on Docker
      • 2a: Setting up on our cloud your dev server
      • 3: Connect to Production using the VPN
      • 4: Import data into your development environment
    • Deployment
      • Docker based deployment of back end (manual)
    • Culture
      • How our development team works
      • Code Best Practices
    • Tips
      • Setting up a new development machine
      • Importing data to Development environment
      • GIT workflow and work tracking
      • Using Slack
      • Using Rubocop
      • Our Code Standards
      • General suggested best practices
      • Tracking your time
      • Naming Iterations
    • Migrations
      • Postgres
      • ElasticSearch
      • Redis
    • Database and System Maintenance
      • Redis Howtos
      • Elasticsearch HowTos
      • Postgres HowTos
      • Administration recipes
      • App maintenance crash course notes
    • Front End
      • 2016 Plan
      • Deploy
      • Assets
      • SearchLogic
      • How to create UI components
      • OMA Standard Tables
    • Monitoring and Alerting
      • Monitoring Systems
      • Monitoring individual controller actions
      • Get notified when a metric reaches a certain threshold
      • Instrumenting your models using Oma Stats
      • Configuring Graphite Charts
      • Tracking your results with StatsD
      • Logging Fields
      • Updating Kibana Filtering
    • Testing
      • Coverage
      • Elasticsearch mapping config synchronization
      • Testing Gotchas
      • Rspec Preloader
      • Test Best Practices
    • Models
      • Backlinks
    • Queueing and Worker System
      • Queueing and Job Overview
    • Processors
      • Rebuilding Spot Instances
      • Deploying processors
      • Running processors in development
      • Reverting to the previous build on a failed deployment
    • Processors / Opportunity Pipeline
      • Opportunity Pipeline
      • Diagram
    • Processors / Enrichment Pipeline
      • Diagram
      • Clustering
    • Processors / Backlink Pipeline
      • Diagram
      • Backlink Pipeline external APIs
      • Backlink pipeline logic
    • Processors / Automation Pipeline
      • Diagram
      • Automation Pipeline Overview
      • Agents
      • Running in development
    • Messaging and Social Accounts
      • Overview
    • API
      • Audienti API
    • Algorithms
    • Troubleshooting
      • Elasticsearch
    • Big Data Pipeline Stuff
      • Spark
    • Our Product
      • Feature synopsis of our product
    • Research
      • Backend framework comparison
      • Internet marketing Saas companies
    • Code snippets
      • Commonly Used
      • Not Used
    • Miscellaneous
      • Proxies and Bax
    • Legacy & Deprecated
      • Search criteria component
      • Classes list
      • Target Timeline
      • Twitter processor
      • Asset compilation
      • Test related information
      • Interface to EMR Hadoop jobs
      • Mongo Dex Indexes to be Built
      • Mongodb errors
      • Opportunity pipeline scoring
      • Graph Page
      • Lead scoring
      • Insights
      • Shard keys
      • Setting up OMA on local
      • Clone project to local machine
      • Getting around our servers in AWS
  • Acknowledgements
  • Documents That Receiving Your First Payment Triggers Acknowledgement and Acceptanace
Powered by GitBook
On this page
  • Basepage "Bax::Page"
  • Why not use APIs? Scraping versus crawling versus API retrievals
  1. Engineering Wiki
  2. Miscellaneous

Proxies and Bax

Bax is the library that we use for retrieving and parsing data from the broader internet (our crawler).

In the architecture, there are:

  • Pages - these parse data into a usable format.

  • Retriever - this retrieves data directly or through proxies or APIs

  • Configuration - this class handles configuration data

  • Utils - Various text, HTML, and URL methods needed in creating usable data

  • Parsers - Tools that parse and extract data

Basepage "Bax::Page"

The base "page" is the "Bax::Page".

There are several parsers available that use the data from that page. These do things like find the ad_networks on the page, give the content tags, summarize the content, provide key phrases from the page for topic identification, etc. Each one of these parsers are separate. Their "result" methods are then mixed back into the main page parser.. so you can call them without calling the "page_parser" class that generated them.

So, for example.. getting the social accounts that are on a page. You retrieve the page:page = Bax::Page.get("https://audienti.com")and you can then dopage.social_profilesand you will get a list of social profiles, returning an array of hashes with various identified accounts, like Twitter, Facebook, YouTube, LinkedIn, etc. This is generated from the social_parser, and the actual method is the Bax::Page::SocialParser.social_profiles method. The Bax::Page delegates the social_parser" method to this class.. and its parsed on demand.

Then, there are special cases, where you need to do some custom parsing. The/pages/classes are child classes of the Bax::Page class.. with custom attributes on them.

For example, if you want to retrieve the company page of a person on LinkedIn.. and get their company name, company size, etc. this is done with a custom page that uses the Bax::Page and all its parsers as a foundation. There are page parsers for Linkedin, Instagram, Google search results, etc.

Why not use APIs? Scraping versus crawling versus API retrievals

We do not scrape--we crawl. And, our crawlers respect corporate norms that keep us from collecting private data. If they don't want their data viewed and parsed, they need to mark it as such.

Clearly, businesses want us to use their APIs.. but the data is often different or unavailable through their API, which is why you sometimes have to retrieve pages from them directly versus going through their API.

An example. LinkedIn doesn't let you do a company search effectively from their API.. or retrieve people that work at a company. It violates their business model.. as they want you instead to run ads at them.. and to target an entire company's domain, which makes them more money, at the cost of our customer's money.

Another example. Google crawls the entire web, storing everyone's content.. no one calls them a "scraper." Google calls other people "scrapers" because they don't want other people having that content, or to do to them what they did to others--namely crawling their public content. But, the reality is that most scaled services do what you could call "scraping or crawling".. for verification, for data enrichment, etc.

Another example. In slack, or in Facebook, you get a "snippet" of a page that shows the page, page title, a summary, and an image? That's a crawl/scrape of that page. Almost every service does it.

And services that are "walled gardens", limit you through their API, to drive you to paying for ads, or to keep their users inside their walls, where they make more money.

Our intentions with using crawled data isn't nefarious--or an intention to bypass legal rules, or even anything illegal. Rather, it's simply doing what everyone else does to gather supplemental data that our service can use.

PreviousMiscellaneousNextLegacy & Deprecated

Last updated 7 years ago