Adding a New Service

Our system performs clustering using the features from the Profile.

Profile Features

  • name - This is the name of the individual or company

    • Transformations

      • Removal of other attributes--simplification to the first name last name in firstname lastname order. So, "John Smith III" becomes "John Smith".

      • Removal of .com/.net and other domain name tails. "audienti.com" becomes "audienti".

      • Removal of common company/corporate endings. "Juvomaster LLC" becomes "Juvomaster".

  • description - This is a description that is provided by the profile.

    • The description, unlike the name, is not consistent across the profiles, as it's contextualized to the service. Simply putting it as a word won't work.

    • Transformations

      • Stop word removal

      • Stemming of the words

      • downcasing of words

  • location - the stated location of the profile

    • Some profiles will have this, some will not.

    • We also carry versions of this in the form of the fields: country, territory (state), city, address.

    • Transformation

      • This should be transformed into a Longitude and latitude.

      • Approximate distance between 2 points should be the criteria used for proximity.

  • references - Listings of other profiles in the profile.

    • This is a common way we cluster. When a Twitter profile for example, mentions an instagram profile, it creates a relationship between the two. The reference is "validated" if its bidirectional. Its not if it's not. It can "loop".

    • Transformation

      • Convert to a standard profile_id

  • gender - the gender of the profile

    • If a profile is for a person, then our current system uses a gender detector we have written to try to identify the gender from the name. If this works, we mark the profile with this gender.

    • Transformation

      • We use our gender identifier to do this. It has the most common 10k or so names in it by gender, and we score/match them up.

  • image_url - a picture of the profile

    • If the profile is a person, in theory we could try to do facial recognition. But right now, this field is not used in any way.

  • lang - the language of the profile

    • This can be used to validate, but is not unique enough to cluster with.

  • Other attributes that could be used are: follower counts, friend counts, like counts, share counts.

Existing clustering algorithm

  • In the current version of the application, we do clustering.

  • Our current clustering algorithm does a "rough" clustering by using the name as a single feature (with the modifiations above).

  • Once this is done, a second classification is done. This classification then breaks apart "people" and "company" profiles, and then performs a secondary classification on these to create a person/company.

  • Note that while I expected/believed that the algorithm used the references between profiles, this does not seem to have been in production.

  • Net: Our current is VERY simplistic. Too simplistic to work.

Last updated