guide
  • Introduction
  • Guiding Principles
    • Mission Statement
    • Conflict Resolution Process
  • Operating Model
    • Working Together
    • Holacracy
      • Meetings
      • Specific Roles
      • Terms and Definitions
      • Finer Points
      • Holacracy-Asana Key
    • Getting Things Done
      • Daily, Weekly, Monthly, and Annual Reviews
      • GTD-Asana Key
    • Transparency
    • Language
    • Budgeting
    • By Department
      • Engineering Operations
  • General Guidelines
  • Employment Policies
    • Equal Opportunity Employment
    • At-Will Employment
    • Code of Conduct in the Community
    • Complaint Policy
    • Drug and Alcohol Policy
    • Vacation, Holiday, and Paid Time Off (PTO) Policy
    • Supplemental Policies for Remote Employees and Contractors
    • Supplemental Policy for Bonus, Commissions, and other Performance-based Payments
    • Supplemental Policies for Hourly International Contractors or Workers
    • Supplemental Policies for Hourly International Contractors or Workers
    • Disputes and Arbitration
  • Benefits and Perks
    • Health Care
    • Vacation, Holiday and Paid Time Off (PTO) Policy
    • Holiday List
  • Hiring Documents
    • Acknowledgement of Receipt
    • Partner Proprietary Information and Inventions Agreement
  • Engineering Wiki
    • Code Snippets
      • Front End Code Snippets
    • Setup
      • 1: Overview of development using Audienti
      • 2: How to setup your dev environment on Docker
      • 2a: Setting up on our cloud your dev server
      • 3: Connect to Production using the VPN
      • 4: Import data into your development environment
    • Deployment
      • Docker based deployment of back end (manual)
    • Culture
      • How our development team works
      • Code Best Practices
    • Tips
      • Setting up a new development machine
      • Importing data to Development environment
      • GIT workflow and work tracking
      • Using Slack
      • Using Rubocop
      • Our Code Standards
      • General suggested best practices
      • Tracking your time
      • Naming Iterations
    • Migrations
      • Postgres
      • ElasticSearch
      • Redis
    • Database and System Maintenance
      • Redis Howtos
      • Elasticsearch HowTos
      • Postgres HowTos
      • Administration recipes
      • App maintenance crash course notes
    • Front End
      • 2016 Plan
      • Deploy
      • Assets
      • SearchLogic
      • How to create UI components
      • OMA Standard Tables
    • Monitoring and Alerting
      • Monitoring Systems
      • Monitoring individual controller actions
      • Get notified when a metric reaches a certain threshold
      • Instrumenting your models using Oma Stats
      • Configuring Graphite Charts
      • Tracking your results with StatsD
      • Logging Fields
      • Updating Kibana Filtering
    • Testing
      • Coverage
      • Elasticsearch mapping config synchronization
      • Testing Gotchas
      • Rspec Preloader
      • Test Best Practices
    • Models
      • Backlinks
    • Queueing and Worker System
      • Queueing and Job Overview
    • Processors
      • Rebuilding Spot Instances
      • Deploying processors
      • Running processors in development
      • Reverting to the previous build on a failed deployment
    • Processors / Opportunity Pipeline
      • Opportunity Pipeline
      • Diagram
    • Processors / Enrichment Pipeline
      • Diagram
      • Clustering
    • Processors / Backlink Pipeline
      • Diagram
      • Backlink Pipeline external APIs
      • Backlink pipeline logic
    • Processors / Automation Pipeline
      • Diagram
      • Automation Pipeline Overview
      • Agents
      • Running in development
    • Messaging and Social Accounts
      • Overview
    • API
      • Audienti API
    • Algorithms
    • Troubleshooting
      • Elasticsearch
    • Big Data Pipeline Stuff
      • Spark
    • Our Product
      • Feature synopsis of our product
    • Research
      • Backend framework comparison
      • Internet marketing Saas companies
    • Code snippets
      • Commonly Used
      • Not Used
    • Miscellaneous
      • Proxies and Bax
    • Legacy & Deprecated
      • Search criteria component
      • Classes list
      • Target Timeline
      • Twitter processor
      • Asset compilation
      • Test related information
      • Interface to EMR Hadoop jobs
      • Mongo Dex Indexes to be Built
      • Mongodb errors
      • Opportunity pipeline scoring
      • Graph Page
      • Lead scoring
      • Insights
      • Shard keys
      • Setting up OMA on local
      • Clone project to local machine
      • Getting around our servers in AWS
  • Acknowledgements
  • Documents That Receiving Your First Payment Triggers Acknowledgement and Acceptanace
Powered by GitBook
On this page
  • What is Clustering
  • Problems we are trying to solve
  • General processing and guidelines
  • Rules for making a collection of profiles the same influencer
  • Rules for excluding a collection of profiles from the same influencer
  • Rules for rebuilding the remaining influencers
  • Rules for not allowing a discovered profile into the system.
  1. Engineering Wiki
  2. Processors / Enrichment Pipeline

Clustering

What is Clustering

Clustering in our system (making influencers that make sense) is a core algorithm of our platform. The goals of clustering are to:

  • Make all the profiles that make sense together, together on a single influencer

  • To keep influencers that shouldn't be together from actually being together (aggregation mistakes)

Problems we are trying to solve

The problems we have in enrichment stem around false positives, and mis-inclusion. False positives happen when we parse a page and find profiles that don't actually belong to the influencer.. where they have mentioned someone else, but have been likely referenced by the influencer. Or, the user themself has referenced a third party site that isn't valid in their profile.

Every time we add a mention, we go through enrichment, which can change the associations of profiles together. This allows us to groom the influencer as we get more data. The current idea of this is to do the following:

  • On new influencer creation, this happens with a profile. This profile is labeled with the source_type = mention and a source_id = mention_id.

  • As an influencer goes through enrichment, the discovered profiles are labeled with a source_type = profile and a source_id = the profile_id that was being enriched.

  • Every profile is simhashed.

  • This creates parent_profiles and child_profiles. A parent can be either a profile, or a mention. A child can only be a mention.

General processing and guidelines

  • After a new mention is added, the influencer will be checked through engagement if their profiles have changed. If they have changed (new profile added due to mention), they will go through enrichment.

  • The influencer will be enriched. At the end of enrichment, the influencer profiles will be clustered.

  • During clustering, based on the inclusion/exclusion criteria below, the influencer will be "repacked", potentially breaking up an influencer into smaller influencers, or aggregating two influencers into a single influencer.

  • The influencer will be sent for scoring, and will be rescored.

Rules for making a collection of profiles the same influencer

  • If all profiles have the same source mention from enrichment, leading to the same root profile, and are within the SIMHASH_PROXIMITY_VALUE, then they are part of the same influencer.

  • If a profile does not have the same parent profile from enrichment, but is within the HIGH_SIMHASH_PROXIMITY_VALUE, then they will be merged and included in the same influencer.

  • If the profile has a parent profile that is sourced from FullContact, then this child profile should be included in the influencer.

Rules for excluding a collection of profiles from the same influencer

  • If the profile is not a mention-created root profile, and the simhash is not within the SIMHASH_PROXIMITY_VALUE from any root URL in the project, they will be removed from the main influencer.

Rules for rebuilding the remaining influencers

  • If profiles are left that are not part of the original URL, and do not have a mention-derived profile, then they will be merged based on their simhash proximity using the SIMHASH_PROXIMITY_VALUE into a new influencer.

Rules for not allowing a discovered profile into the system.

  • The discovered profile is not a site profile (Twitter/Facebook), but references a site profile of a Alexa Top 1000 site (NYTimes, etc).

PreviousDiagramNextProcessors / Backlink Pipeline

Last updated 7 years ago