Interface to EMR Hadoop jobs

Goal

Create an "interface" (written in Ruby) between OMA's processors boxes and Amazon's Elastic Map Reduce service.

The interface should allow to:

  • specify a job flow (collection of related jobs)

  • provide parameters to the job flow

  • specify callbacks (on success and on failure)

Additional options (should be taken into account, but not implemented immediately):

  • ability to monitor active jobs (flows)

  • ability to shutdown active jobs (flows)

Implementation

ActiveRecord based implementation (rejected)

Create an ActiveRecord model represented a single job flow instance. Create flow models for each flow kind using AR STI.

A cron task (oma-processors) each hour will check active/finished job flow records and call callbacks for finished.

Possible usage:

# oma-models/lib/models/postgres/job_flows/emr_base.rb

module JobFlows
  class EmrBase < ActiveRecord::Base
  ...

# oma-models/lib/models/postgres/job_flows/pages_es_index_updater.rb
module job_flows
  class PagesEsIndexUpdater < ActiveRecord::Base
  ...

# oma-processors/...

job_flow = ::JobFlows::PagesEsIndexUpdater.create!(domain_id: domain.id)
job_flow.run

active_flow = ::JobFlows::PagesEsIndexUpdater.active.first

JobFlows::EmrBase (and subclasses) uses rslifka/elasticity gem under the hood.

Pros

  • History. Already finished jobs stored in Postgres. It provides info about initial arguments, final statuses, created artifacts (URL of created files etc.).

Cons

  • New ActiveRecord class pollutes oma-models with information about processor implementation details. In particular oma-models depends on rslifka/elasticity gem

  • Callbacks (on job flow success or failure) are implemented as methods of an AR class. Thus there is no advantages of closures.

S3 based implementation (rejected)

Create a ruby class (module?) represented a single job flow instance. Use Amazon S3 as a persistence layer. Save a list of actual job flows (not finished) as a file on S3 (CSV?). Create a ruby class for each particular job flow kind.

Last updated