ElasticSearch

Elasticsearch migrations come from our own code and from an external gem. These were developed for two basic cases and extended into a third one.

  1. Change data in the current index.

  2. Pipe data into a new index with a different mapping.

Workflow

create a new migration class

# oma-models/db/elasticsearch/migrations/my_migration.rb
class MyMigration < Oma::Elasticsearch::DataMigration
end

Override the needed methods

I suggest you read the overridable methods and the initialize method from Oma::Elasticsearch::DataMigration so you understand what the options are.

# oma-models/db/elasticsearch/migrations/add_keyword_in_url_to_rank.rb
class AddKeywordInUrlToRank < Oma::Elasticsearch::DataMigration

# with options we decide if the mihration pipes into a new index or not
# read the source code to see available options
# def initialize opts={}
#   super opts.merge new_index:true
# end
# which ES model class are we using
def model
  Rank
end

# which records we want to process ? - all by default
def query
end

# list the needed fields, this will load only the listed fields.
# by default it loads all fields
def needed_fields
    ['url','keyword']
end

# 1. this method gets a hit - standard ES result hash and is expected to return a hit
# 2. here is where we write changes to the individual record
# 3. when sending data to a new index without changes we just return the hit unchanged
# 4. when updating an existing index we only need to returned a hit with changed fields
# 5. when piping to a new index we need to return all fields we want stored
def
process_hit hit
   kyw = hit['_source']['keyword']
   url = hit['_source']['url']
   hit['_source']['keyword_in_url'] = Oma::Text.keyword_in_url?(kyw,url)
   hit # return the modified hit
end
end

setup a rake task to run the migration

# oma-models/db/elasticsearch/migrations/es_migrations.rake
namespace :es do
  namespace :migrations do
    desc "Adds a boolean(keywod_in_url) field to Rank."
    task :add_keyword_in_url_to_rank do
      require_relative'./add_keyword_in_url_to_rank.rb'
      AddKeywordInUrlToRank.perform
    end
  end
end

Run the migration

  1. commit the migration files

  2. push it to github

  3. ssh to worker_1

  4. checkout oma-models code with the migration in it

  5. Optional - pause processing:

    • bundle exec rake console OMA_ENV=production

    • Oma::Resque.pause

  6. run it

    sh # OMA_ENV=production nohup bundle exec rake es:migrations:add_keyword_in_url_to_rank>>

    /var/log/app/add_keyword_in_url_to_rank.log 2>&1 &

The migration will run in the background and produce logs that are viewable on papertrail. The logs might be buffered and not real time.

Other notes

*_You can and should unit test migrations. *_There are examples in code. And it really boosts writing it make sure it works if you develop with a unit test. See in code for examples.

You should not run migrations connecting to Elasticsearch with authentication.At the time of writing authentications is implemented with an Apache proxy that has limits HTTP request sizes. Sipnce migrations work with batches that limit will prevent some data to be stored. That is one of the reasons we don't run them from development machines.

Migrating to a new index is problematicbecause it requires us to pause data generation. It would be better if we would migrate existing data and at the end keep moving newly generated data. Data could be done by fetching base on the updated_at field. This would allow us to pause processing during the time needed to switch to a new index.

** WHEN YOU CREATE A MIGRATION, PLEASE NAME IT WITH A DATE SO WE KNOW APPROXIMATELY WHEN IT WAS RUN.

**Also, on the main worker box there is TMUX. If you are running multiple migrations, it might make sense to start a TMUX session and tail the various workers, to make sure they are working. Splitting terminals and basic TMUX tutorial is listed here.http://lukaszwrobel.pl/blog/tmux-tutorial-split-terminal-windows-easily

Last updated