Setting up OMA on local
(created by nicholas; last update by william, apr 2014)
Required infrastructure
elasticsearch - 0.9 branch
redis
mongodb
postgresql
Setting up local
There is a postgresql backup on the team's dropbox.
Create a local development database:
Load the backup data:
Make sure you have a config/database.yml which sets up the PG database. Example:
Now copy config/application.yml.example to application.yml and make sure that names of development and test Postgres databases match those in database.yml. At this point you should be able to start the application.
Viewing the landing page
After running the app you'll notice http://localhost:3000 gets you redirected tohttp://getoma.comand you're not viewing the development app anymore.
The problem is that the landing page expects a subdomain tied to a company. First you need to make sure that domains likeomadev.localhost.dev (this is an example) also point to localhost.
Ubuntu 12.04 : You can achieve this by editing /etc/hosts file:
Note: this works only for explicitly declared domains, in case you want a generic solution (*localhost.dev) consider using dnsmasq.
Now you should "Sign Up" theomadevcompany and expect it's landing page to be available athttp://omadev.localhost.dev:3000. The signup form is here:http://localhost.dev:3000/company_signup.
In the process you'll receive an activation email with an incorrect link. Replace the production domain (omadev.omaengine.com) with the local one(omadev.localhost.dev:3000) and paste into the browser to activate the newly created account.This isn't right and should be fixed.
Running the backend processors
Crawler
The crawler is run with rake tasks. The entire crawler consists of several running processors:
crawler
link
hydra
attribute
issues
writer
In addition there's a token_tap task which provides rate limiting and there are commands to turn the crawler on and off.
The token tap needs to be running at all times
Queue a domain for crawling
rails console:
When the crawler is running it will start immediately.
run the crawler
Crawler and link processor
The link and crawler need to run at the same time. The crawler task will fetch pages from the internet and store them in redis while the link processor will analyze those pages for new links to crawl and forward the crawled page to the hydra.
Hydra processor
The hydra processor checks links that are not part of the crawl-domain for their status codes. when all the links are checked it will push the page to the attribute processor
Attribute processor
The attribute processor analyzes the page and extracts attributes from it.
Issue processor
The issue processor analyzes the page for issues.
Writer
This processor writes out the page to mongodb, elasticsearch and S3
Opportunities
The opportunity pipeline is another important part of our infrastructure and pulls in data from social media and third party resources.
Sources: 1. facebook 2. twitter 3. bing news 4. forum 5. serps 6. profile providers for enrichment e.g. fullcontact api
initiating opps retrieval for a project
rails console:
retrieving news opps
News opportunities are retrieved by running a news retriever and a news processor
retrieving facebook opps
facebook opportunities are retrieved by running a facebook retriever and a facebook processor
retrieving twitter opps
twitter opportunities are retrieved by running a twitter retriever and a twitter processor
retrieving forum opps
forum opportunities are keyword mentions on forums. forum opportunities are retrieved by running a forum retriever and a forum processor
retrieving serps opps
serps are loaded from the serps table and then treated as a source for opportunities. so generate a mention and a potential lead and load these up for enrichment
Enrichment
During the opportunity retrieval we've identified entities, these are potentially contactable items such as websites/social media users. The enrichment fase consists of digging in and trying to find more about them + hopefully isolating contact details.
some more processing
The mentions and leads section still doesn't work as we still need to write to elasticsearch and do some twitter processing
Last updated