Overview of the Mention/Profile/Cluster Process
Last updated
Last updated
You can see the current generation of our platform at: https://audienti.com/.
In this picture, each "row" is a cluster. Underneath their name and description, there are listed the "profiles" that were identified as part of that cluster. More on that in a bit.
Application overview
The goal of our platform is to generate a list of prospects for a company, that can be targeted with multiple channels (paid ads and organic) simultaneously. The platform tries to find people that are talking right now, or interested right now, in various things.
The entity relationship diagram (ERD)/data structure for our system is basically this:
A keyword has_many mentions
A profile has many mentions
A parent has many profiles. Each parent = cluster.
A parent can be a type of Person or Company.
Our customer adds keywords which are phrases that would indicate someone they are targeting. For example, if they are targeting a pregnant woman, they might use the keyword "having a baby". A "mention" is a mention of this keyword on the internet--in social networks (Twitter/Facebook, etc). or in blogs, forums, and news sites. Each mention has a source--an author of the mention. For the mention, this mention author is a social network profile, also know as a profile.
For example, if we are tracking the keyword "marketing tools", and the Twitter account "wef3" says "marketing tools are great", then there will be 1 mention in our database, and 1 profile--with the id of twitter:wef3.
By default, most social networks don't give you the full details of a user. To get them, you have to make a separate query. So, we have a process for each profile that is generated called "enrichment" that retrieves this information, and formats it in a standard way, and saves it into our the "twitter:wef3" profile record. Our system runs in a batch mode, updating these records every 30 days to keep information current.
During the enrichment process, often one social profile mentions another social profile. So, our system identifies these references and adds them as a new profile as well (which goes through enrichment). We have a configuration for the depth we will go down this tree doing this activity. We also store that one profile references another profile. This is stored in the parent profile record.
Every time a new profile is added (by either a new mention, or a identification via enrichment), it has to be associated with a Person or a Company. This is the clustering algorithm I'm looking for help with. Essentially, a "parent (person/company) is really just a collection of profiles. But, having them collected allows us to do things like run Facebook ads for people that have mentioned "having a baby" on Twitter--across channel actions and activity, which is a unique value proposition in the market.
The goals for the process/algorithm are:
To produce a person that looks like a person, and a company that looks like a company. I want to opt on the side of clean data.
If the profile matches an existing Person/Company, we want to add the profile to the existing person or company.
If it does not match an existing person/company, we will create a new person or company with 1 profile (the new one).
As profiles are re-enriched, we will determine that some are invalid. They should be able to removed without a big recalculation.
Essentially, a "parent" is really just a collection of profiles. A Person or a Company is just a collection of profiles. But, having them collected allows us to do things like run Facebook ads for people that have mentioned "having a baby" on Twitter--across channel actions and activity, which is a unique value proposition in the market.
The technical/architectural desirables would be:
We do not need to re-calculate the entire persons or companies list or rerun something like K-means when we want to associate a single new profile to the system. Our system is adding data all the time. We would prefer to be able to associate the profile dynamically, without having to do a big batch run.