guide
  • Introduction
  • Guiding Principles
    • Mission Statement
    • Conflict Resolution Process
  • Operating Model
    • Working Together
    • Holacracy
      • Meetings
      • Specific Roles
      • Terms and Definitions
      • Finer Points
      • Holacracy-Asana Key
    • Getting Things Done
      • Daily, Weekly, Monthly, and Annual Reviews
      • GTD-Asana Key
    • Transparency
    • Language
    • Budgeting
    • By Department
      • Engineering Operations
  • General Guidelines
  • Employment Policies
    • Equal Opportunity Employment
    • At-Will Employment
    • Code of Conduct in the Community
    • Complaint Policy
    • Drug and Alcohol Policy
    • Vacation, Holiday, and Paid Time Off (PTO) Policy
    • Supplemental Policies for Remote Employees and Contractors
    • Supplemental Policy for Bonus, Commissions, and other Performance-based Payments
    • Supplemental Policies for Hourly International Contractors or Workers
    • Supplemental Policies for Hourly International Contractors or Workers
    • Disputes and Arbitration
  • Benefits and Perks
    • Health Care
    • Vacation, Holiday and Paid Time Off (PTO) Policy
    • Holiday List
  • Hiring Documents
    • Acknowledgement of Receipt
    • Partner Proprietary Information and Inventions Agreement
  • Engineering Wiki
    • Code Snippets
      • Front End Code Snippets
    • Setup
      • 1: Overview of development using Audienti
      • 2: How to setup your dev environment on Docker
      • 2a: Setting up on our cloud your dev server
      • 3: Connect to Production using the VPN
      • 4: Import data into your development environment
    • Deployment
      • Docker based deployment of back end (manual)
    • Culture
      • How our development team works
      • Code Best Practices
    • Tips
      • Setting up a new development machine
      • Importing data to Development environment
      • GIT workflow and work tracking
      • Using Slack
      • Using Rubocop
      • Our Code Standards
      • General suggested best practices
      • Tracking your time
      • Naming Iterations
    • Migrations
      • Postgres
      • ElasticSearch
      • Redis
    • Database and System Maintenance
      • Redis Howtos
      • Elasticsearch HowTos
      • Postgres HowTos
      • Administration recipes
      • App maintenance crash course notes
    • Front End
      • 2016 Plan
      • Deploy
      • Assets
      • SearchLogic
      • How to create UI components
      • OMA Standard Tables
    • Monitoring and Alerting
      • Monitoring Systems
      • Monitoring individual controller actions
      • Get notified when a metric reaches a certain threshold
      • Instrumenting your models using Oma Stats
      • Configuring Graphite Charts
      • Tracking your results with StatsD
      • Logging Fields
      • Updating Kibana Filtering
    • Testing
      • Coverage
      • Elasticsearch mapping config synchronization
      • Testing Gotchas
      • Rspec Preloader
      • Test Best Practices
    • Models
      • Backlinks
    • Queueing and Worker System
      • Queueing and Job Overview
    • Processors
      • Rebuilding Spot Instances
      • Deploying processors
      • Running processors in development
      • Reverting to the previous build on a failed deployment
    • Processors / Opportunity Pipeline
      • Opportunity Pipeline
      • Diagram
    • Processors / Enrichment Pipeline
      • Diagram
      • Clustering
    • Processors / Backlink Pipeline
      • Diagram
      • Backlink Pipeline external APIs
      • Backlink pipeline logic
    • Processors / Automation Pipeline
      • Diagram
      • Automation Pipeline Overview
      • Agents
      • Running in development
    • Messaging and Social Accounts
      • Overview
    • API
      • Audienti API
    • Algorithms
    • Troubleshooting
      • Elasticsearch
    • Big Data Pipeline Stuff
      • Spark
    • Our Product
      • Feature synopsis of our product
    • Research
      • Backend framework comparison
      • Internet marketing Saas companies
    • Code snippets
      • Commonly Used
      • Not Used
    • Miscellaneous
      • Proxies and Bax
    • Legacy & Deprecated
      • Search criteria component
      • Classes list
      • Target Timeline
      • Twitter processor
      • Asset compilation
      • Test related information
      • Interface to EMR Hadoop jobs
      • Mongo Dex Indexes to be Built
      • Mongodb errors
      • Opportunity pipeline scoring
      • Graph Page
      • Lead scoring
      • Insights
      • Shard keys
      • Setting up OMA on local
      • Clone project to local machine
      • Getting around our servers in AWS
  • Acknowledgements
  • Documents That Receiving Your First Payment Triggers Acknowledgement and Acceptanace
Powered by GitBook
On this page
  • Workflow
  • create a new migration class
  • Override the needed methods
  • setup a rake task to run the migration
  • Run the migration
  • Other notes
  1. Engineering Wiki
  2. Migrations

ElasticSearch

PreviousPostgresNextRedis

Last updated 7 years ago

Elasticsearch migrations come from our own code and from an external gem. These were developed for two basic cases and extended into a third one.

  1. Change data in the current index.

  2. Pipe data into a new index with a different mapping.

Workflow

create a new migration class

# oma-models/db/elasticsearch/migrations/my_migration.rb
class MyMigration < Oma::Elasticsearch::DataMigration
end

Override the needed methods

I suggest you read the overridable methods and the initialize method from Oma::Elasticsearch::DataMigration so you understand what the options are.

# oma-models/db/elasticsearch/migrations/add_keyword_in_url_to_rank.rb
class AddKeywordInUrlToRank < Oma::Elasticsearch::DataMigration

# with options we decide if the mihration pipes into a new index or not
# read the source code to see available options
# def initialize opts={}
#   super opts.merge new_index:true
# end
# which ES model class are we using
def model
  Rank
end

# which records we want to process ? - all by default
def query
end

# list the needed fields, this will load only the listed fields.
# by default it loads all fields
def needed_fields
    ['url','keyword']
end

# 1. this method gets a hit - standard ES result hash and is expected to return a hit
# 2. here is where we write changes to the individual record
# 3. when sending data to a new index without changes we just return the hit unchanged
# 4. when updating an existing index we only need to returned a hit with changed fields
# 5. when piping to a new index we need to return all fields we want stored
def
process_hit hit
   kyw = hit['_source']['keyword']
   url = hit['_source']['url']
   hit['_source']['keyword_in_url'] = Oma::Text.keyword_in_url?(kyw,url)
   hit # return the modified hit
end
end

setup a rake task to run the migration

# oma-models/db/elasticsearch/migrations/es_migrations.rake
namespace :es do
  namespace :migrations do
    desc "Adds a boolean(keywod_in_url) field to Rank."
    task :add_keyword_in_url_to_rank do
      require_relative'./add_keyword_in_url_to_rank.rb'
      AddKeywordInUrlToRank.perform
    end
  end
end

Run the migration

  1. commit the migration files

  2. push it to github

  3. ssh to worker_1

  4. checkout oma-models code with the migration in it

  5. Optional - pause processing:

    • bundle exec rake console OMA_ENV=production

    • Oma::Resque.pause

  6. run it

    sh # OMA_ENV=production nohup bundle exec rake es:migrations:add_keyword_in_url_to_rank>>

    /var/log/app/add_keyword_in_url_to_rank.log 2>&1 &

The migration will run in the background and produce logs that are viewable on papertrail. The logs might be buffered and not real time.

Other notes

*_You can and should unit test migrations. *_There are examples in code. And it really boosts writing it make sure it works if you develop with a unit test. See in code for examples.

You should not run migrations connecting to Elasticsearch with authentication.At the time of writing authentications is implemented with an Apache proxy that has limits HTTP request sizes. Sipnce migrations work with batches that limit will prevent some data to be stored. That is one of the reasons we don't run them from development machines.

Migrating to a new index is problematicbecause it requires us to pause data generation. It would be better if we would migrate existing data and at the end keep moving newly generated data. Data could be done by fetching base on the updated_at field. This would allow us to pause processing during the time needed to switch to a new index.

** WHEN YOU CREATE A MIGRATION, PLEASE NAME IT WITH A DATE SO WE KNOW APPROXIMATELY WHEN IT WAS RUN.

**Also, on the main worker box there is TMUX. If you are running multiple migrations, it might make sense to start a TMUX session and tail the various workers, to make sure they are working. Splitting terminals and basic TMUX tutorial is listed here.

Import data from production into development environment.
http://lukaszwrobel.pl/blog/tmux-tutorial-split-terminal-windows-easily