Mention Pipeline
The mention pipeline is the primary data generator for the system, and its actions cause other systems and pipelines to run/function.
The goal of the mention pipeline is to every day retrieve a list of "mentions" of a keyword. This is done in a map/reduce type flow, given in 4 stages: Queueing, Retrieving, Converting, and Segmenting.
Queueing (.queue__mention__retrievals on Wordsmaster)
Each keyword should be called every hour, calling its .queue__mention__retrievals method. This simply calls the ActivityChecks::WordsmasterMentions class that checks whether the Wordsmaster needs to retrieve or not. How does a Wordsmaster konw if it needs to retrieve? It checks its Activity to see if it has a log item that states it has been queued in the last 24 hours. If so, then it does NOT need to be queued. If it does not, then it queues the item (and writes to the activity log).
Each Wordsmaster should be called every hour by an hourly Job.
Retrieving (Pipeline::Mention::RetrieveJob)
Retrieving each wordsmaster is done with the retrieve job and uses Bax, our HTML and proxy processing library. The entire job of the retrieve_job is to do a retrieval for the raw mentions from the source. This is done by calling the Retriever and uses the following syntax (currnetly) in the job.
Retriever is a universal "wrapper" for a number of differerent retrievers, providing a consistent front end interface for it. The :from in the config is the name of the retrieval source. As of right now, we support the following for mention retrievals:
These will return items for each source, which are the raw retriever items. Note that each retriever does have the ability to produce mentions. However, we process each mention and write them as separate jobs, because each page write can take several seconds to retrieve and resolve all the information from the mention's page (for example, off of a web page).
Each mention is queued to the "convert_job" and then the job is finished.
Converting (Convert Job)
The convert job takes the raw mention, converts it to our standard format, retrieves additional information related to the mention, and writes it to our database. Each ConvertJob is only for 1 mention. The ConvertJob uses a .convert method on the Retriever itself for the conversion.
The conversion also updates a number of other tables, such as DailyWordsmasterCounts and things where data is stored about rates and trends in our mention pipeline.
SegmentJob
The segmentjob takes each mention, and determines if it matches a segment. If so, it marks it on a MentionSegment table in cassandra. This tables is what is 'queried' for mention segments, so each Agent/workflow will only access a mention one time.
MentionRouter and MentionRouteJob-Post creation processing
When a mention is created, there is a MentionRouter that is called for this mention. This MentionRouter basically processes the mention to see what other jobs should be called for that mention. In the case of a profile, it will check to see if the profile is present in our system. If not, it will create it. Then, it will queue an enricohment job for the profile. If the profile is present, but out of date (hasn't been enriched in 30+ days), then it will also call the enrichjob on the profile.
Profile Enrichment is the ProfilePipeline and that is covered on a separate Wiki entry.
Last updated