How to establish and run a techops team
As a company grows, there comes a point when it’s no longer feasible that the founders and programming gurus keep maintaining the servers. Your clients are calling in the middle of the weekend just to tell that a server is down and you didn’t even notice. Sounds familiar? It’s time you spin up the TechOps team.
TechOps stands for Technical Operations. It’s very closely related to a Dev Ops team (Development Operations) and in some organizations those are same. TechOps task is to maintain your fleet of servers, mostly importantly your production servers and to make sure that your production is working by monitoring constantly its performance, both hardware and software. This is extended some what from the traditional sysadmin task, because a TechOps team is primarily responsible for the production environment. If it also needs to get its hands dirty on maintaining servers, then so be it, but the ideology is that those are your valuable guys who make sure that everything is working as they should.
In a small company it’s usually not required to have dedicated guys on the TechOps team. In our company, for example, we have one dedicated guy and two additional guys, who share the TechOps duties. The two guys are actually software designers of the two different products our company runs, which has been a great benefit. This way the TechOps team has always direct knowledge of all the products we are supposed to be operating. The one full time TechOps guy writes automation scripts, monitoring scripts and does most of the small day-to-day operations.
Knowing how your production runs
TechOps should know everything what’s happening in the production servers. This means that they should be capable of understanding how the products actually work, at least in some detail. They can also be your DBA’s (Database Administrator), taking a close look on actual database operations and talking constantly with the programmers who design the applications database layers.
TechOps should manage how new builds of the software is distributed and deployed on the production servers and they should be the only guys who have actually the need to log into the production boxes. You don’t give your developers access to the production servers, because they might not know how the delicate TechOps automation scripts are working on those boxes, so you don’t want any outsiders to mess the systems up.
Monitoring your production
Monitoring is a key part of TechOps. You should always have some kind of monitoring software running, which gathers data from your servers, applications and networks, displays them in an easy and human readable way and triggers warnings and alerts if something unexpected happens. Most common tools for this job is Nagios and Zabbix, which I prefer. They should also store metrics from the past, allowing operators to look for odd patterns and to help with root cause analysis.
What ever your monitoring solution is, you need to maintain it constantly. You can’t let old triggers and items to lay in a degraded state. If you remove a software or a server, you need to remove those items also from your monitoring software. When you add something new to your production environment, it’s the TechOps guys who will also think how the new piece is monitored and to setup this monitoring. They also need to watch the new component for a while so they can adjust alerts for the different meters (zabbix calls these “items”), so an appropriate alert is escalated when needed.
TechOps also usually take shifts to be on standby in case there’s problems in the production. This is mostly combined with the monitoring solution, which will send alerts in form of emails, SMS messages and push notifications into the phone of the on-duty TechOps engineer. Because of this, you need to design proper escalation procedures for your environment:
For an example, our company tracks about 7800 zabbix items from around 100 hosts. We also run Graphite to get analytics from inside our production software, which tracks additional 2800 values.
Design and implement meaningful alerts to your environment
Zabbix divides the alerts into following severities: Information, Warning, Average, High and Disaster. You should plan your alerts so that they will trigger an alert with appropriate severity. We have found the following rule-of-thumb to be an excellent way on this:
- Information: Indicates some anomaly which is not dangerous but should be noted.
- Warning: Indicates a warning condition within system, but does not affect production in any way.
- Average: Action needed during working hours. Does not affect production but the system is in degraded state and needs maintenance. You can go for a lunch before taking action on this.
- High: Immediate action needed during working hours and during alert service duty. Production systems affected.
- Disaster: Immediate action needed around the clock. Money is leaking, production is not working. All is lost.
For example if you have three web frontend servers in a high-availability configuration behind load balancer and one of those servers goes down. This should not affect your production, but you have enough performance that the two remaining servers can handle it just fine, so this is an average problem. If another server goes down, this most likely will affect your production performance and you don’t have any servers to spare, this is a high problem. It’s a disaster if all your servers go down.
Do not accept single point of failures
TechOps should require that all parts of production environment should be robust enough to handle failures. Anything can break, so you must have redundancy designed into your architecture from ground up. You simply should not accept a solution which can’t be clustered for high availability. TechOps must be in constant discussion with the software developers to determine how a certain piece of software can be deployed safely into the production environment.
TechOps should log each incident somewhere, so you can use it later to spot trends and to determine if some frequently occurring problem can be fixed permanently. This also doubles as an audit log for paying the on-duty TechOps engineers appropriate compensation because he woke up at 4 AM at night to fix your money making machine.
Automate as much as you can
Modern tools, specially having your infrastructure in the cloud, allows you to automate a great deal of day-to-day operation tasks. Replacing a failed database instance should be just two clicks away (one for the action and another for confirmation). Remember, you can use scripts to do everything what a human operator does! It’s not always cost efficient to write scripts for everything, but after you do the same thing for the 3rd time, you know that you should be writing a script for it instead of doing it yet again manually.
Chef and Puppet are examples of great pieces of automation software which will help you to manage your servers and the software running inside. They both have a steep learning curve, but it’s well worth it.
Running TechOps is a never-ending learning process. Starting one is hard, but as the team progresses the day-to-day operations become increasingly more efficient and you are rewarded with increasingly better production with higher uptime. It also helps your helpdesk operations because you clients won’t see your servers being down. This ultimately will result with a good ROI, because you will have less downtime with faster reaction times when things to wrong.