EngineeringWiki
  • Introduction
  • Top level Overview of the application
  • FAQs
  • Back End
    • Agent Pipeline
    • Mention Pipeline
    • Profile Pipeline
    • Errors
    • Overview of the Mention/Profile/Cluster Process
    • Adding a New Service
    • Activity and Status Tracking
  • Setup
    • Overview
    • How to Setup Your Local Machine
    • Setup - Cloud Machine
    • Infrastructure
    • Docker
    • Bash Commands
    • Setting up front end in Ubuntu 16.04 desktop
  • Gems/Libraries
    • Bax
    • Creating fixtures for Unit Tests
    • Audienti-Retriever
    • Scour
    • Haystack
    • Audienti-Indexer
    • Audienti-Api
    • Handler
    • Blackbook
    • Allusion
  • Code
    • Multi-step Modal Wizard
    • Structure
    • Audienti DataTables
    • Javascript
      • Passing Props From Root
      • Looping in JS
      • Binding Actions to App
      • CSSTransitionGroup
      • Code Best Practices
      • Reducer Updating an Array with Item in Middle
      • Organizing Javascript
      • Filter Array by Id
    • Design Language
  • Working
    • PostgresSQL
    • S3
    • Terminology
    • Interview Tests
    • Application Descriptions
    • Best Practices
      • Code Organization
      • Code Documentation (using Yard)
      • Git Workflow
      • Tasks and Queues
      • Working in Backend
    • Profiles & Enrichment
      • Profile ID Rules
  • Management
    • API Management
    • Bastion
    • Splash Proxy
    • Rancher
      • OpenVPN Server
      • Traefik Reverse Proxy
  • Teams & Interviews
    • Interview Questions
  • Culture
    • What Makes a World Class Engineer
  • Situational Statuses
    • 2017-11-03
    • 2018-01-09
  • Operations
Powered by GitBook
On this page
  • Profile Features
  • Existing clustering algorithm
  1. Back End

Adding a New Service

Our system performs clustering using the features from the Profile.

Profile Features

  • name - This is the name of the individual or company

    • Transformations

      • Removal of other attributes--simplification to the first name last name in firstname lastname order. So, "John Smith III" becomes "John Smith".

      • Removal of .com/.net and other domain name tails. "audienti.com" becomes "audienti".

      • Removal of common company/corporate endings. "Juvomaster LLC" becomes "Juvomaster".

  • description - This is a description that is provided by the profile.

    • The description, unlike the name, is not consistent across the profiles, as it's contextualized to the service. Simply putting it as a word won't work.

    • Transformations

      • Stop word removal

      • Stemming of the words

      • downcasing of words

  • location - the stated location of the profile

    • Some profiles will have this, some will not.

    • We also carry versions of this in the form of the fields: country, territory (state), city, address.

    • Transformation

      • This should be transformed into a Longitude and latitude.

      • Approximate distance between 2 points should be the criteria used for proximity.

  • references - Listings of other profiles in the profile.

    • This is a common way we cluster. When a Twitter profile for example, mentions an instagram profile, it creates a relationship between the two. The reference is "validated" if its bidirectional. Its not if it's not. It can "loop".

    • Transformation

      • Convert to a standard profile_id

  • gender - the gender of the profile

    • If a profile is for a person, then our current system uses a gender detector we have written to try to identify the gender from the name. If this works, we mark the profile with this gender.

    • Transformation

      • We use our gender identifier to do this. It has the most common 10k or so names in it by gender, and we score/match them up.

  • image_url - a picture of the profile

    • If the profile is a person, in theory we could try to do facial recognition. But right now, this field is not used in any way.

  • lang - the language of the profile

    • This can be used to validate, but is not unique enough to cluster with.

  • Other attributes that could be used are: follower counts, friend counts, like counts, share counts.

Existing clustering algorithm

  • In the current version of the application, we do clustering.

  • Our current clustering algorithm does a "rough" clustering by using the name as a single feature (with the modifiations above).

  • Once this is done, a second classification is done. This classification then breaks apart "people" and "company" profiles, and then performs a secondary classification on these to create a person/company.

  • Note that while I expected/believed that the algorithm used the references between profiles, this does not seem to have been in production.

  • Net: Our current is VERY simplistic. Too simplistic to work.

PreviousOverview of the Mention/Profile/Cluster ProcessNextActivity and Status Tracking

Last updated 7 years ago