Adding a New Service

Our system performs clustering using the features from the Profile.

Profile Features

name - This is the name of the individual or company
- Transformations
  - Removal of other attributes--simplification to the first name last name in firstname lastname order. So, "John Smith III" becomes "John Smith".
  - Removal of .com/.net and other domain name tails. "audienti.com" becomes "audienti".
  - Removal of common company/corporate endings. "Juvomaster LLC" becomes "Juvomaster".
description - This is a description that is provided by the profile.
- The description, unlike the name, is not consistent across the profiles, as it's contextualized to the service. Simply putting it as a word won't work.
- Transformations
  - Stop word removal
  - Stemming of the words
  - downcasing of words
location - the stated location of the profile
- Some profiles will have this, some will not.
- We also carry versions of this in the form of the fields: country, territory (state), city, address.
- Transformation
  - This should be transformed into a Longitude and latitude.
  - Approximate distance between 2 points should be the criteria used for proximity.
references - Listings of other profiles in the profile.
- This is a common way we cluster. When a Twitter profile for example, mentions an instagram profile, it creates a relationship between the two. The reference is "validated" if its bidirectional. Its not if it's not. It can "loop".
- Transformation
  - Convert to a standard profile_id
gender - the gender of the profile
- If a profile is for a person, then our current system uses a gender detector we have written to try to identify the gender from the name. If this works, we mark the profile with this gender.
- Transformation
  - We use our gender identifier to do this. It has the most common 10k or so names in it by gender, and we score/match them up.
image_url - a picture of the profile
- If the profile is a person, in theory we could try to do facial recognition. But right now, this field is not used in any way.
lang - the language of the profile
- This can be used to validate, but is not unique enough to cluster with.
Other attributes that could be used are: follower counts, friend counts, like counts, share counts.

Existing clustering algorithm

In the current version of the application, we do clustering.
Our current clustering algorithm does a "rough" clustering by using the name as a single feature (with the modifiations above).
Once this is done, a second classification is done. This classification then breaks apart "people" and "company" profiles, and then performs a secondary classification on these to create a person/company.
Note that while I expected/believed that the algorithm used the references between profiles, this does not seem to have been in production.
Net: Our current is VERY simplistic. Too simplistic to work.

PreviousOverview of the Mention/Profile/Cluster Process NextActivity and Status Tracking

Last updated 7 years ago