App maintenance crash course notes
Don’t be afraid!
The point is to ask the right questions and being able to find answers.
Checks
Look at papertrail, airbrake, syren, when errors come into slack don’t ignore them. Some might not be critical but you should develop a sense of what is going on and which can be ignored.
Queues being too big:
Sometimes is just better to clear the queue and unstuck things by re-queuing instead of being in a stuck state for weeks. Loosing some data is better than being in a total halt. Still this doesn’t mean you can ignore the problem.
worker_2 doesn’t auto deploy:
by itself because of a silly error generating the config json, would be really happy if you figure it out. I did try :( ssh to worker_2 vim hot_update.sh
change from:
{"run_list": [ "["recipe[processor_box::worker_2_attributes]", "recipe[processor_box::deploy]", "recipe[processor_box::hotdeploy]"]","["role[processor_box_worker_2]"]" ]}'
to:
'{"run_list": ["recipe[processor_box::worker_2_attributes]", "recipe[processor_box::deploy]", "recipe[processor_box::hotdeploy]"]}'
then it will auto update
SSH-ing around
You have ssh access to all servers listed on AWS console.
Shh to aud server first from there you can ssh to all the others using their private IP. Private IP is shown on AWS console.
Aliases are setup for easy access like:
worker_1, worker_2, spot pri.vat.e.ip, elasticsearch_00x, redis
Papertrail and Airbrake are your friends!
Logs are prefixed with class names so you know where they come from.
Errors on aibrake contain context informations so you know how code was started (request params for rails, self.perform params for backend)
What is running?
Spot instances should have 10 workers each.
Two permanent workers (worker_1,worker_2) should have less.
Remember it all runs on EC2 and you can always ssh-to them from the aud server you all have access to.
What is running on an individual spot instance?
Ssh to it.
Look at crontab to see recurring processes.
Who set-up those crontab entries? oma-chef did.
Look at /etc/init/ folder and see files starting with app-
Who set them up? A script? Who set up the script? oma-chef did.
Where are the logs? /var/logs/app
Why are the logs there because oma-chef said so?
What is running on the permanent instances?
ssh aud, ssh worker_1
ssh aud, ssh worker_2
see previous question
How do you manage oma-chef?
All code is in oma-chef git repository.
Check it out make changes push it back to master.
Ssh to a worker node and in the home folder run chef_rerun.sh.
In that script you can find commands to which you can use to shorten the code change / test run cycle.
Statistics - graphite.omamatic.com
Click through folders: stats.counters.oma_processors.production at that point folders are organized the same way class names are organized. So it’s the same as navigating through code. If you find a stat_event call in code you should understand how to see it’s chart and vice versa.
Alerts - syren
graphite.omamatic.com:8080 it’s a simple UI where you monitor a metrics from statistics and it starts sending messages into alerts room on slack. Some error counters are sensitive but not dramatic. Others are not so sensitive but drammatic.
Panic mode
front-end maintenance mode
https://devcenter.heroku.com/articles/maintenance-mode
PG crash or timing out
start pg console
kill long running queries
restart the server (if killing the queries doesn’t help)
think about the queries you just killed and why they take so long or take so much CPU
redis crash
pause processing
ssh to redis
process is alive and responding
try to dump data to disk, restart, reload from disk - this happend after deleting a big chunk of data
process is dead
start it
data in redis is lost
this may cause some problems but not the end of the world
restore important settings
queues config
reload alexa1m - read the classe’s code
search in code for “set_setting” think for each of them if the default is ok
ES cluster problems
pause processing
disable shard re-allocations
try to get cluster up, read logs, restart failing nodes
ES queries timeout make cluster slow
pause processing
try to delete some data
old data
data belonging to deleted projects
keywords
domains
investigate the query
can you optimize it by better filtering data and perform aggregations on smaller dataset
bigger ES problems
partitions the data differently
routing
specialized indexes
accept the limits of combinatorial explosions and to it differently, not real time
no easy way
Deploy - failures
front-end
in essence here is 3 things you need to think off
deploying code is just pushing a specific revision to heroku instead of github
database needs to be compatible with code you’re deploying, migrating is just a rake task you can always change to get you out of the problem
assets need to be compatible with code you’re deploying, building and pushing them is just another rake task all these tasks are executed by the deploy scripts in the root dir and can be executed just by copy pasting them in the console
back-end
a backend failure usually not immediately perceived by users
you can always just kill all spot instances and start fresh with new ones
Updating ES indexes
examples in oma-models/db/elasticsearch
there are cases for reindexing data in a new index
there are cases for updating data on an existing index
there are cases for reindexing data development import from one cluster to another
Last updated