Adding a New Service
Our system performs clustering using the features from the Profile.
Profile Features
name - This is the name of the individual or company
Transformations
Removal of other attributes--simplification to the first name last name in firstname lastname order. So, "John Smith III" becomes "John Smith".
Removal of .com/.net and other domain name tails. "audienti.com" becomes "audienti".
Removal of common company/corporate endings. "Juvomaster LLC" becomes "Juvomaster".
description - This is a description that is provided by the profile.
The description, unlike the name, is not consistent across the profiles, as it's contextualized to the service. Simply putting it as a word won't work.
Transformations
Stop word removal
Stemming of the words
downcasing of words
location - the stated location of the profile
Some profiles will have this, some will not.
We also carry versions of this in the form of the fields: country, territory (state), city, address.
Transformation
This should be transformed into a Longitude and latitude.
Approximate distance between 2 points should be the criteria used for proximity.
references - Listings of other profiles in the profile.
This is a common way we cluster. When a Twitter profile for example, mentions an instagram profile, it creates a relationship between the two. The reference is "validated" if its bidirectional. Its not if it's not. It can "loop".
Transformation
Convert to a standard profile_id
gender - the gender of the profile
If a profile is for a person, then our current system uses a gender detector we have written to try to identify the gender from the name. If this works, we mark the profile with this gender.
Transformation
We use our gender identifier to do this. It has the most common 10k or so names in it by gender, and we score/match them up.
image_url - a picture of the profile
If the profile is a person, in theory we could try to do facial recognition. But right now, this field is not used in any way.
lang - the language of the profile
This can be used to validate, but is not unique enough to cluster with.
Other attributes that could be used are: follower counts, friend counts, like counts, share counts.
Existing clustering algorithm
In the current version of the application, we do clustering.
Our current clustering algorithm does a "rough" clustering by using the name as a single feature (with the modifiations above).
Once this is done, a second classification is done. This classification then breaks apart "people" and "company" profiles, and then performs a secondary classification on these to create a person/company.
Note that while I expected/believed that the algorithm used the references between profiles, this does not seem to have been in production.
Net: Our current is VERY simplistic. Too simplistic to work.
Last updated