One of the biggest challenges for a fast growing startup is keeping everything running while dealing with an ever-increasing load from a an ever-increasing number of users. Here at WePay, that’s a challenge we’re happy to have — We’re tracking to more than double payment volume this year, and looking to double again the year after that.
That’s why we’re excited to have David Nye onboard as our Director of DevOps and IT, working to make our system even more robust to meet the needs of our partners as we grow. I caught up with David to hear a little more about DevOps and what he’s working on here at WePay.
Tell me a bit about your background
My background goes all the way back 30 years in the computer industry, coming out of college. One of the first things I learned was punch cards in college, to reel-to-reel tapes and disk packs. I worked for Digital Equipment on what would be considered Big Iron. Those old two story kind of data centers with disk drives that were as big as this room. Then I ended up working at a called Bolt Barenek Newman. We were the lead DARPA contractor for what we now know as the Internet — you know, the sort of pre-commercial, research-only pre-internet. I worked on that for about 5 years.
I worked at Bay Networks, which was the big competitor to Cisco at the time. I worked at a number of other places — I’ve worked for Intel and Verizon, I’ve worked in the entertainment industry, I built server clusters for render farms. So I’ve really done a lot of different things over the course of my career.
So you’re Director of DevOps. What does that mean for the non-technical folks?
DevOps is a thing where there’s a lot of different definitions or meanings for it. It’s probably best termed as a methodology that development and operations folks use throughout the process to operationalize the product they’re dealing with. DevOps means always thinking about how the code you’re writing will be managed at scale. It’s thinking about metrics.
So the DevOps teams that are usually designed to handle this are usually evangelists — they’re working with the developers the Q/A teams, the operations teams. Sometimes they’re the active participants in those groups, sometimes they’re the support.
And what we’re trying to do here is sort of institutionalize that methodology across those groups, so we can give management the telemetry into what we’re doing on a daily basis.
So what are some concrete steps you’re taking to do that?
Well one of the first things I came in to do was stabilize what we currently have. There was some technical debt that had developed over several months that needed to be looked at. So we added some services, added some servers, created some redundancy, and we’re beginning to add metrics so we can get a better look at what’s happening on our network.
We’ve added a lot of redundancy to the maintenance of the server. We added redundancy there so that if one server goes down there’s a back-up server to go to . Basically, we’ve been systematically locating and eliminating single points of failure across our system. Those always occur, of course. But your ability to stay up during maintenance or during other events depends on hunting out those points of failure and eradicating them at all costs.
How do you define success?
Success for us is 100 percent availability, 100 percent up-time for our partners, combined with a very low latency on that availability. Because those have to work together – you can have 100 percent uptime but if the latency is high, it doesn’t really matter. And we’re getting closer to that goal.
What is one bit of advice you’d give companies that are looking to improve how they implement DevOps?
The big thing is metrics. If you can’t measure it, you don’t know if it happened or not. That’s an old saying, but it’s true. You have to be able to measure — actually the thing that DevOps is known for, our methodology, is metrics. So my advice is work on that first — you will have no idea what’s really wrong until you establish good metrics, and if you don’t know what’s wrong you can’t fix it.
Most companies don’t measure enough. I mean, there’s a tradeoff, you don’t want to log everything because it causes latency. But what I try to tell everyone is that they need to develop with measurement layers available that I can turn on when I need them. If it’s 3 a.m. and there’s a problem, if I can flip a switch and be logging heavily for 5 to 10 minutes I can probably find out what’s happening. Then I can turn it off. But if it’s not available in the code, that switch that lets me do that, everything is so much harder. So you have to build with that in mind.