guide
  • Introduction
  • Guiding Principles
    • Mission Statement
    • Conflict Resolution Process
  • Operating Model
    • Working Together
    • Holacracy
      • Meetings
      • Specific Roles
      • Terms and Definitions
      • Finer Points
      • Holacracy-Asana Key
    • Getting Things Done
      • Daily, Weekly, Monthly, and Annual Reviews
      • GTD-Asana Key
    • Transparency
    • Language
    • Budgeting
    • By Department
      • Engineering Operations
  • General Guidelines
  • Employment Policies
    • Equal Opportunity Employment
    • At-Will Employment
    • Code of Conduct in the Community
    • Complaint Policy
    • Drug and Alcohol Policy
    • Vacation, Holiday, and Paid Time Off (PTO) Policy
    • Supplemental Policies for Remote Employees and Contractors
    • Supplemental Policy for Bonus, Commissions, and other Performance-based Payments
    • Supplemental Policies for Hourly International Contractors or Workers
    • Supplemental Policies for Hourly International Contractors or Workers
    • Disputes and Arbitration
  • Benefits and Perks
    • Health Care
    • Vacation, Holiday and Paid Time Off (PTO) Policy
    • Holiday List
  • Hiring Documents
    • Acknowledgement of Receipt
    • Partner Proprietary Information and Inventions Agreement
  • Engineering Wiki
    • Code Snippets
      • Front End Code Snippets
    • Setup
      • 1: Overview of development using Audienti
      • 2: How to setup your dev environment on Docker
      • 2a: Setting up on our cloud your dev server
      • 3: Connect to Production using the VPN
      • 4: Import data into your development environment
    • Deployment
      • Docker based deployment of back end (manual)
    • Culture
      • How our development team works
      • Code Best Practices
    • Tips
      • Setting up a new development machine
      • Importing data to Development environment
      • GIT workflow and work tracking
      • Using Slack
      • Using Rubocop
      • Our Code Standards
      • General suggested best practices
      • Tracking your time
      • Naming Iterations
    • Migrations
      • Postgres
      • ElasticSearch
      • Redis
    • Database and System Maintenance
      • Redis Howtos
      • Elasticsearch HowTos
      • Postgres HowTos
      • Administration recipes
      • App maintenance crash course notes
    • Front End
      • 2016 Plan
      • Deploy
      • Assets
      • SearchLogic
      • How to create UI components
      • OMA Standard Tables
    • Monitoring and Alerting
      • Monitoring Systems
      • Monitoring individual controller actions
      • Get notified when a metric reaches a certain threshold
      • Instrumenting your models using Oma Stats
      • Configuring Graphite Charts
      • Tracking your results with StatsD
      • Logging Fields
      • Updating Kibana Filtering
    • Testing
      • Coverage
      • Elasticsearch mapping config synchronization
      • Testing Gotchas
      • Rspec Preloader
      • Test Best Practices
    • Models
      • Backlinks
    • Queueing and Worker System
      • Queueing and Job Overview
    • Processors
      • Rebuilding Spot Instances
      • Deploying processors
      • Running processors in development
      • Reverting to the previous build on a failed deployment
    • Processors / Opportunity Pipeline
      • Opportunity Pipeline
      • Diagram
    • Processors / Enrichment Pipeline
      • Diagram
      • Clustering
    • Processors / Backlink Pipeline
      • Diagram
      • Backlink Pipeline external APIs
      • Backlink pipeline logic
    • Processors / Automation Pipeline
      • Diagram
      • Automation Pipeline Overview
      • Agents
      • Running in development
    • Messaging and Social Accounts
      • Overview
    • API
      • Audienti API
    • Algorithms
    • Troubleshooting
      • Elasticsearch
    • Big Data Pipeline Stuff
      • Spark
    • Our Product
      • Feature synopsis of our product
    • Research
      • Backend framework comparison
      • Internet marketing Saas companies
    • Code snippets
      • Commonly Used
      • Not Used
    • Miscellaneous
      • Proxies and Bax
    • Legacy & Deprecated
      • Search criteria component
      • Classes list
      • Target Timeline
      • Twitter processor
      • Asset compilation
      • Test related information
      • Interface to EMR Hadoop jobs
      • Mongo Dex Indexes to be Built
      • Mongodb errors
      • Opportunity pipeline scoring
      • Graph Page
      • Lead scoring
      • Insights
      • Shard keys
      • Setting up OMA on local
      • Clone project to local machine
      • Getting around our servers in AWS
  • Acknowledgements
  • Documents That Receiving Your First Payment Triggers Acknowledgement and Acceptanace
Powered by GitBook
On this page
  1. Engineering Wiki
  2. Database and System Maintenance

App maintenance crash course notes

Don’t be afraid!

The point is to ask the right questions and being able to find answers.

Checks

Look at papertrail, airbrake, syren, when errors come into slack don’t ignore them. Some might not be critical but you should develop a sense of what is going on and which can be ignored.

Queues being too big:

Sometimes is just better to clear the queue and unstuck things by re-queuing instead of being in a stuck state for weeks. Loosing some data is better than being in a total halt. Still this doesn’t mean you can ignore the problem.

worker_2 doesn’t auto deploy:

by itself because of a silly error generating the config json, would be really happy if you figure it out. I did try :( ssh to worker_2 vim hot_update.sh

change from:

{"run_list": [ "["recipe[processor_box::worker_2_attributes]", "recipe[processor_box::deploy]", "recipe[processor_box::hotdeploy]"]","["role[processor_box_worker_2]"]" ]}'

to:

'{"run_list": ["recipe[processor_box::worker_2_attributes]", "recipe[processor_box::deploy]", "recipe[processor_box::hotdeploy]"]}'

then it will auto update

SSH-ing around

  • You have ssh access to all servers listed on AWS console.

  • Shh to aud server first from there you can ssh to all the others using their private IP. Private IP is shown on AWS console.

  • Aliases are setup for easy access like:

  • worker_1, worker_2, spot pri.vat.e.ip, elasticsearch_00x, redis

Papertrail and Airbrake are your friends!

  • Logs are prefixed with class names so you know where they come from.

  • Errors on aibrake contain context informations so you know how code was started (request params for rails, self.perform params for backend)

What is running?

  • Spot instances should have 10 workers each.

  • Two permanent workers (worker_1,worker_2) should have less.

  • Remember it all runs on EC2 and you can always ssh-to them from the aud server you all have access to.

What is running on an individual spot instance?

  • Ssh to it.

  • Look at crontab to see recurring processes.

  • Who set-up those crontab entries? oma-chef did.

  • Look at /etc/init/ folder and see files starting with app-

  • Who set them up? A script? Who set up the script? oma-chef did.

  • Where are the logs? /var/logs/app

  • Why are the logs there because oma-chef said so?

What is running on the permanent instances?

  • ssh aud, ssh worker_1

  • ssh aud, ssh worker_2

  • see previous question

How do you manage oma-chef?

  • All code is in oma-chef git repository.

  • Check it out make changes push it back to master.

  • Ssh to a worker node and in the home folder run chef_rerun.sh.

  • In that script you can find commands to which you can use to shorten the code change / test run cycle.

Statistics - graphite.omamatic.com

  • Click through folders: stats.counters.oma_processors.production at that point folders are organized the same way class names are organized. So it’s the same as navigating through code. If you find a stat_event call in code you should understand how to see it’s chart and vice versa.

Alerts - syren

  • graphite.omamatic.com:8080 it’s a simple UI where you monitor a metrics from statistics and it starts sending messages into alerts room on slack. Some error counters are sensitive but not dramatic. Others are not so sensitive but drammatic.

Panic mode

front-end maintenance mode

PG crash or timing out

  • start pg console

  • kill long running queries

  • restart the server (if killing the queries doesn’t help)

  • think about the queries you just killed and why they take so long or take so much CPU

redis crash

pause processing

  • ssh to redis

  • process is alive and responding

  • try to dump data to disk, restart, reload from disk - this happend after deleting a big chunk of data

  • process is dead

  • start it

data in redis is lost

  • this may cause some problems but not the end of the world

  • restore important settings

  • queues config

  • reload alexa1m - read the classe’s code

  • search in code for “set_setting” think for each of them if the default is ok

ES cluster problems

  • pause processing

  • disable shard re-allocations

  • try to get cluster up, read logs, restart failing nodes

ES queries timeout make cluster slow

  • pause processing

  • try to delete some data

  • old data

  • data belonging to deleted projects

  • keywords

  • domains

  • investigate the query

  • can you optimize it by better filtering data and perform aggregations on smaller dataset

bigger ES problems

  • partitions the data differently

  • routing

  • specialized indexes

  • accept the limits of combinatorial explosions and to it differently, not real time

  • no easy way

Deploy - failures

front-end

  • in essence here is 3 things you need to think off

    • deploying code is just pushing a specific revision to heroku instead of github

    • database needs to be compatible with code you’re deploying, migrating is just a rake task you can always change to get you out of the problem

    • assets need to be compatible with code you’re deploying, building and pushing them is just another rake task all these tasks are executed by the deploy scripts in the root dir and can be executed just by copy pasting them in the console

back-end

  • a backend failure usually not immediately perceived by users

  • you can always just kill all spot instances and start fresh with new ones

Updating ES indexes

  • examples in oma-models/db/elasticsearch

  • there are cases for reindexing data in a new index

  • there are cases for updating data on an existing index

  • there are cases for reindexing data development import from one cluster to another

PreviousAdministration recipesNextFront End

Last updated 7 years ago

http://app.omamatic.com/resque_web/overview
https://devcenter.heroku.com/articles/maintenance-mode