Setting up OMA on local

(created by nicholas; last update by william, apr 2014)

Required infrastructure

  • elasticsearch - 0.9 branch

  • redis

  • mongodb

  • postgresql

Setting up local

There is a postgresql backup on the team's dropbox.

Create a local development database:

 createdb oma_dev

Load the backup data:

gunzip -c marketfu_production.sql.gz  | psql oma_dev

Make sure you have a config/database.yml which sets up the PG database. Example:

development:
  username: postgre_username
  database: oma_dev
  adapter: postgresql
test:
  username: postgre_username
  database: oma_test
  adapter: postgresql

Now copy config/application.yml.example to application.yml and make sure that names of development and test Postgres databases match those in database.yml. At this point you should be able to start the application.

bundle exec thin start -R oma.ru -p 3000 -e development

Viewing the landing page

After running the app you'll notice http://localhost:3000 gets you redirected tohttp://getoma.comand you're not viewing the development app anymore.

The problem is that the landing page expects a subdomain tied to a company. First you need to make sure that domains likeomadev.localhost.dev (this is an example) also point to localhost.

Ubuntu 12.04 : You can achieve this by editing /etc/hosts file:

127.0.0.1       localhost
127.0.0.1       localhost.dev
127.0.0.1       omadev.localhost.dev

Note: this works only for explicitly declared domains, in case you want a generic solution (*localhost.dev) consider using dnsmasq.

Now you should "Sign Up" theomadevcompany and expect it's landing page to be available athttp://omadev.localhost.dev:3000. The signup form is here:http://localhost.dev:3000/company_signup.

In the process you'll receive an activation email with an incorrect link. Replace the production domain (omadev.omaengine.com) with the local one(omadev.localhost.dev:3000) and paste into the browser to activate the newly created account.This isn't right and should be fixed.

Running the backend processors

Crawler

The crawler is run with rake tasks. The entire crawler consists of several running processors:

  1. crawler

  2. link

  3. hydra

  4. attribute

  5. issues

  6. writer

In addition there's a token_tap task which provides rate limiting and there are commands to turn the crawler on and off.

The token tap needs to be running at all times

bundle exec rake opportunity_pipeline:token_tap

Queue a domain for crawling

rails console:

rc = RedisCrawler::Console.new
rc.queue domain_id

When the crawler is running it will start immediately.

run the crawler

Crawler and link processor

The link and crawler need to run at the same time. The crawler task will fetch pages from the internet and store them in redis while the link processor will analyze those pages for new links to crawl and forward the crawled page to the hydra.

bundle exec rake redis_crawler:crawler
bundle exec rake redis_crawler:link

Hydra processor

The hydra processor checks links that are not part of the crawl-domain for their status codes. when all the links are checked it will push the page to the attribute processor

bundle exec rake redis_crawler:hydra

Attribute processor

The attribute processor analyzes the page and extracts attributes from it.

bundle exec rake redis_crawler:attribute

Issue processor

The issue processor analyzes the page for issues.

bundle exec rake redis_crawler:issue

Writer

This processor writes out the page to mongodb, elasticsearch and S3

bundle exec rake redis_crawler:writer

Opportunities

The opportunity pipeline is another important part of our infrastructure and pulls in data from social media and third party resources.

Sources: 1. facebook 2. twitter 3. bing news 4. forum 5. serps 6. profile providers for enrichment e.g. fullcontact api

initiating opps retrieval for a project

rails console:

> include OpportunityPipeline::Console
> queue_project_keywords_for(project.id)
> queue_states
2013-03-13 15:56:51 UTC
Enrichment queue: 0
Facebook queue: 9
Twitter queue: 9
News queue: 9
Forum queue: 9
Serps queue: 9
Twitter Write Count: 0

retrieving news opps

News opportunities are retrieved by running a news retriever and a news processor

bundle exec rake opportunity_pipeline:news_retriever
bundle exec rake opportunity_pipeline:news

retrieving facebook opps

facebook opportunities are retrieved by running a facebook retriever and a facebook processor

bundle exec rake opportunity_pipeline:facebook_retriever
bundle exec rake opportunity_pipeline:facebook

retrieving twitter opps

twitter opportunities are retrieved by running a twitter retriever and a twitter processor

bundle exec rake opportunity_pipeline:twitter_retriever
bundle exec rake opportunity_pipeline:twitter

retrieving forum opps

forum opportunities are keyword mentions on forums. forum opportunities are retrieved by running a forum retriever and a forum processor

bundle exec rake opportunity_pipeline:forum_retriever
bundle exec rake opportunity_pipeline:forum

retrieving serps opps

serps are loaded from the serps table and then treated as a source for opportunities. so generate a mention and a potential lead and load these up for enrichment

bundle exec rake opportunity_pipeline:forum_retriever
bundle exec rake opportunity_pipeline:forum

Enrichment

During the opportunity retrieval we've identified entities, these are potentially contactable items such as websites/social media users. The enrichment fase consists of digging in and trying to find more about them + hopefully isolating contact details.

bundle exec rake opportunity_pipeline:enrichment

some more processing

The mentions and leads section still doesn't work as we still need to write to elasticsearch and do some twitter processing

resqueworker: bundle exec rake resque:work QUEUE=twitter_processor
mention_resqueworker: bundle exec rake resque:work QUEUE=opp_mention_writer_queue

Last updated