Clustering

What is Clustering

Clustering in our system (making influencers that make sense) is a core algorithm of our platform. The goals of clustering are to:

Make all the profiles that make sense together, together on a single influencer
To keep influencers that shouldn't be together from actually being together (aggregation mistakes)

Problems we are trying to solve

The problems we have in enrichment stem around false positives, and mis-inclusion. False positives happen when we parse a page and find profiles that don't actually belong to the influencer.. where they have mentioned someone else, but have been likely referenced by the influencer. Or, the user themself has referenced a third party site that isn't valid in their profile.

Every time we add a mention, we go through enrichment, which can change the associations of profiles together. This allows us to groom the influencer as we get more data. The current idea of this is to do the following:

On new influencer creation, this happens with a profile. This profile is labeled with the source_type = mention and a source_id = mention_id.
As an influencer goes through enrichment, the discovered profiles are labeled with a source_type = profile and a source_id = the profile_id that was being enriched.
Every profile is simhashed.
This creates parent_profiles and child_profiles. A parent can be either a profile, or a mention. A child can only be a mention.

General processing and guidelines

After a new mention is added, the influencer will be checked through engagement if their profiles have changed. If they have changed (new profile added due to mention), they will go through enrichment.
The influencer will be enriched. At the end of enrichment, the influencer profiles will be clustered.
During clustering, based on the inclusion/exclusion criteria below, the influencer will be "repacked", potentially breaking up an influencer into smaller influencers, or aggregating two influencers into a single influencer.
The influencer will be sent for scoring, and will be rescored.

Rules for making a collection of profiles the same influencer

If all profiles have the same source mention from enrichment, leading to the same root profile, and are within the SIMHASH_PROXIMITY_VALUE, then they are part of the same influencer.
If a profile does not have the same parent profile from enrichment, but is within the HIGH_SIMHASH_PROXIMITY_VALUE, then they will be merged and included in the same influencer.
If the profile has a parent profile that is sourced from FullContact, then this child profile should be included in the influencer.

Rules for excluding a collection of profiles from the same influencer

If the profile is not a mention-created root profile, and the simhash is not within the SIMHASH_PROXIMITY_VALUE from any root URL in the project, they will be removed from the main influencer.

Rules for rebuilding the remaining influencers

If profiles are left that are not part of the original URL, and do not have a mention-derived profile, then they will be merged based on their simhash proximity using the SIMHASH_PROXIMITY_VALUE into a new influencer.

Rules for not allowing a discovered profile into the system.

The discovered profile is not a site profile (Twitter/Facebook), but references a site profile of a Alexa Top 1000 site (NYTimes, etc).

PreviousDiagram NextProcessors / Backlink Pipeline

Last updated 7 years ago