App maintenance crash course notes

Don’t be afraid!

The point is to ask the right questions and being able to find answers.

Checks

Look at papertrail, airbrake, syren, when errors come into slack don’t ignore them. Some might not be critical but you should develop a sense of what is going on and which can be ignored.

Queues being too big:

Sometimes is just better to clear the queue and unstuck things by re-queuing instead of being in a stuck state for weeks. Loosing some data is better than being in a total halt. Still this doesn’t mean you can ignore the problem.

worker_2 doesn’t auto deploy:

by itself because of a silly error generating the config json, would be really happy if you figure it out. I did try :( ssh to worker_2 vim hot_update.sh

change from:

{"run_list": [ "["recipe[processor_box::worker_2_attributes]", "recipe[processor_box::deploy]", "recipe[processor_box::hotdeploy]"]","["role[processor_box_worker_2]"]" ]}'

to:

'{"run_list": ["recipe[processor_box::worker_2_attributes]", "recipe[processor_box::deploy]", "recipe[processor_box::hotdeploy]"]}'

then it will auto update

SSH-ing around

  • You have ssh access to all servers listed on AWS console.

  • Shh to aud server first from there you can ssh to all the others using their private IP. Private IP is shown on AWS console.

  • Aliases are setup for easy access like:

  • worker_1, worker_2, spot pri.vat.e.ip, elasticsearch_00x, redis

Papertrail and Airbrake are your friends!

  • Logs are prefixed with class names so you know where they come from.

  • Errors on aibrake contain context informations so you know how code was started (request params for rails, self.perform params for backend)

What is running?

  • Spot instances should have 10 workers each.

  • Two permanent workers (worker_1,worker_2) should have less.

  • Remember it all runs on EC2 and you can always ssh-to them from the aud server you all have access to.

What is running on an individual spot instance?

  • Ssh to it.

  • Look at crontab to see recurring processes.

  • Who set-up those crontab entries? oma-chef did.

  • Look at /etc/init/ folder and see files starting with app-

  • Who set them up? A script? Who set up the script? oma-chef did.

  • Where are the logs? /var/logs/app

  • Why are the logs there because oma-chef said so?

What is running on the permanent instances?

  • ssh aud, ssh worker_1

  • ssh aud, ssh worker_2

  • see previous question

How do you manage oma-chef?

  • All code is in oma-chef git repository.

  • Check it out make changes push it back to master.

  • Ssh to a worker node and in the home folder run chef_rerun.sh.

  • In that script you can find commands to which you can use to shorten the code change / test run cycle.

Statistics - graphite.omamatic.com

  • Click through folders: stats.counters.oma_processors.production at that point folders are organized the same way class names are organized. So it’s the same as navigating through code. If you find a stat_event call in code you should understand how to see it’s chart and vice versa.

Alerts - syren

  • graphite.omamatic.com:8080 it’s a simple UI where you monitor a metrics from statistics and it starts sending messages into alerts room on slack. Some error counters are sensitive but not dramatic. Others are not so sensitive but drammatic.

Panic mode

front-end maintenance mode

https://devcenter.heroku.com/articles/maintenance-mode

PG crash or timing out

  • start pg console

  • kill long running queries

  • restart the server (if killing the queries doesn’t help)

  • think about the queries you just killed and why they take so long or take so much CPU

redis crash

pause processing

  • ssh to redis

  • process is alive and responding

  • try to dump data to disk, restart, reload from disk - this happend after deleting a big chunk of data

  • process is dead

  • start it

data in redis is lost

  • this may cause some problems but not the end of the world

  • restore important settings

  • queues config

  • reload alexa1m - read the classe’s code

  • search in code for “set_setting” think for each of them if the default is ok

ES cluster problems

  • pause processing

  • disable shard re-allocations

  • try to get cluster up, read logs, restart failing nodes

ES queries timeout make cluster slow

  • pause processing

  • try to delete some data

  • old data

  • data belonging to deleted projects

  • keywords

  • domains

  • investigate the query

  • can you optimize it by better filtering data and perform aggregations on smaller dataset

bigger ES problems

  • partitions the data differently

  • routing

  • specialized indexes

  • accept the limits of combinatorial explosions and to it differently, not real time

  • no easy way

Deploy - failures

front-end

  • in essence here is 3 things you need to think off

    • deploying code is just pushing a specific revision to heroku instead of github

    • database needs to be compatible with code you’re deploying, migrating is just a rake task you can always change to get you out of the problem

    • assets need to be compatible with code you’re deploying, building and pushing them is just another rake task all these tasks are executed by the deploy scripts in the root dir and can be executed just by copy pasting them in the console

back-end

  • a backend failure usually not immediately perceived by users

  • you can always just kill all spot instances and start fresh with new ones

Updating ES indexes

  • examples in oma-models/db/elasticsearch

  • there are cases for reindexing data in a new index

  • there are cases for updating data on an existing index

  • there are cases for reindexing data development import from one cluster to another

Last updated