> For the complete documentation index, see [llms.txt](https://omalab.gitbook.io/guide/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://omalab.gitbook.io/guide/engineering-wiki/processors-enrichment-pipeline/clustering.md).

# Clustering

## What is Clustering

Clustering in our system (making influencers that make sense) is a core algorithm of our platform. The goals of clustering are to:

* Make all the profiles that make sense together, together on a single influencer
* To keep influencers that shouldn't be together from actually being together (aggregation mistakes)

## Problems we are trying to solve

The problems we have in enrichment stem around false positives, and mis-inclusion. False positives happen when we parse a page and find profiles that don't actually belong to the influencer.. where they have mentioned someone else, but have been likely referenced by the influencer. Or, the user themself has referenced a third party site that isn't valid in their profile.

Every time we add a mention, we go through enrichment, which can change the associations of profiles together. This allows us to groom the influencer as we get more data. The current idea of this is to do the following:

* On new influencer creation, this happens with a profile. This profile is labeled with the source\_type = mention and a source\_id = mention\_id.
* As an influencer goes through enrichment, the discovered profiles are labeled with a source\_type = profile and a source\_id = the profile\_id that was being enriched.
* Every profile is simhashed.
* This creates parent\_profiles and child\_profiles. A parent can be either a profile, or a mention. A child can only be a mention.

## General processing and guidelines

* After a new mention is added, the influencer will be checked through engagement if their profiles have changed. If they have changed (new profile added due to mention), they will go through enrichment.
* The influencer will be enriched. At the end of enrichment, the influencer profiles will be clustered.
* During clustering, based on the inclusion/exclusion criteria below, the influencer will be "repacked", potentially breaking up an influencer into smaller influencers, or aggregating two influencers into a single influencer.
* The influencer will be sent for scoring, and will be rescored.

## Rules for making a collection of profiles the same influencer

* If all profiles have the same source mention from enrichment, leading to the same root profile, and are within the SIMHASH\_PROXIMITY\_VALUE, then they are part of the same influencer.
* If a profile does not have the same parent profile from enrichment, but is within the HIGH\_SIMHASH\_PROXIMITY\_VALUE, then they will be merged and included in the same influencer.
* If the profile has a parent profile that is sourced from FullContact, then this child profile should be included in the influencer.

## Rules for excluding a collection of profiles from the same influencer

* If the profile is not a mention-created root profile, and the simhash is not within the SIMHASH\_PROXIMITY\_VALUE from any root URL in the project, they will be removed from the main influencer.

## Rules for rebuilding the remaining influencers

* If profiles are left that are not part of the original URL, and do not have a mention-derived profile, then they will be merged based on their simhash proximity using the SIMHASH\_PROXIMITY\_VALUE into a new influencer.

## Rules for not allowing a discovered profile into the system.

* The discovered profile is not a site profile (Twitter/Facebook), but references a site profile of a Alexa Top 1000 site (NYTimes, etc).


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://omalab.gitbook.io/guide/engineering-wiki/processors-enrichment-pipeline/clustering.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
