Giant Robots Smashing Into Other Giant Robots
00:00:00
/
00:34:04

403: Mission Control with Joe Ferris

December 9th, 2021

Joe Ferris is thoughtbot's CTO and Managing Director of the thoughtbot DevOps and maintenance team known as Mission Control. Mission Control is our newest team doing DevOps Support, Maintenance, and SRE (Site Reliability Engineering). The goal of Mission Control, rather than building products or pairing with team members to improve their team like the rest of thoughtbot, is to support those teams and support other client teams in deploying and scaling applications. They have an on-call team and do more complex cloud build-outs with the goal being to empower and educate the teams that we work with so that they are more capable of working in those ecosystems on their own.

Become a Sponsor of Giant Robots!

Transcript:

CHAD: This is the Giant Robots Smashing Into Other Giant Robots Podcast, where we explore the design, development, and business of great products. I'm your host, Chad Pytel. And with me today is Joe Ferris, thoughtbot's CTO and Managing Director of the thoughtbot DevOps and maintenance team known as Mission Control. Joe, welcome back to the show.

JOE: Thanks, Chad. It's been a while.

CHAD: It has been a while. I think you were the first-ever guest, if I'm not mistaken.

JOE: I believe that's right. We talked about null, I think.

[laughter]

CHAD: Yeah. And it would have been with Ben back when I was just a listener and maybe producer. So welcome back to the show. It's been a long time, and a lot has changed at thoughtbot over the years. I've been talking to each of the managing directors of the new teams, and I wanted to be sure to have you on. Why don't we take a little bit of a step back and talk about Mission Control? When we say DevOps and maintenance, what do we mean? And what does Mission Control do?

JOE: Sure. Mission Control is our newest team doing DevOps support, and maintenance, and SRE. It came out of our experiments with DevOps a while ago now, almost two years coming up. Historically, thoughtbot has shied away from getting too much into DevOps. I think a lot of us had some unpleasant experiences earlier in our career around sysadmin tasks and expectations there. Not a lot of people have wanted to be on call historically.

So we've heavily leveraged services like Heroku that take a lot of that burden away from you and avoided doing things like direct to AWS deployments or getting too involved with CI/CD pipelines that were particularly complex. But we've had clients over the years that have requested more interesting or more difficult deployments.

And finally, we had one a couple of years ago, where we said, "All alright, let's just handle this instead of saying no or trying to outsource it." We thought it made sense for them. And after going through it, we came to the conclusion that it was actually pretty good that the ecosystem had evolved a lot and that it was a service worth offering. That began our journey into DevOps, so to speak. So we did some smart DevOps work for a variety of clients over the next year or so before we decided to form an official team doing this new kind of work, which is how we ended up with Mission control.

The goal of Mission Control, rather than building products or pairing with team members to improve their team like the rest of thoughtbot, the goal of Mission Control is to support those teams and support other client teams in deploying and scaling their applications. And we have an on-call team. We will do more complex cloud build-outs. And our goal is to empower and educate the teams that we work with so that they are more capable of working in those ecosystems on their own.

CHAD: You used the acronym SRE earlier in that little spiel. I'm not sure that everyone knows what that is. [laughs] So it stands for Site Reliability Engineer, right?

JOE: That's right. And that's been newer for us. So DevOps is supposed to be the fusion of development and operations. But the operations world is really big. So similar to how everybody has problems getting people to be full-stack enough given the complexity of front end and back end, we have similar problems in design.

We also have that problem in DevOps where both development and operations are huge, rich ecosystems. And so, having developers that are fully experienced at both is hard. So the path of least resistance, when you say are doing DevOps, is definitely just to do operations. And it's been a struggle for us to actually break down those silos and have teams work more on the operation side on their own.

So one of the things that caught our eye with SRE was some of the built-in mechanisms for engaging with the team. The one-sentence pitch for SRE is that it is operations if you approach it like a software problem. It has these concepts of SLOs, Service Level Objectives, and error budgets, which is the amount of time you spent violating your SLO. And part of the process is getting buy-in from the entire team, from the stakeholders down to the developers and the operations team. And so, it provides a natural interaction point between the operations folks and the rest of the team because nobody wants to break the error budget. Once the error budget is exhausted, everybody has to stop building new features and focus on stability until the error budget is cut up again.

So rather than having this unpleasant give or take where we're more coming from the operations side, and we're always pushing for more stability, and everybody else is coming from the product side, and they're always pushing for more features, SRE gives you this useful metric to have that conversation around where we're not always just pushing for more. We're trying to hit a specific goal that we've agreed on. And when we hit the goal, we know that we can keep full throttle moving out new features.

CHAD: Now, is the SRE a developer who is also working on resolving errors before the budget is hit?

JOE: Yeah, a Site Reliability Engineer is a developer. But that's actually not too different from other forms of DevOps. DevOps is supposed to be developers in general. When I say we built an operations team, even if you look at the work that we're doing, a lot of it is development work. We build scripts, and automations, and so on. We don't manually set up EC2 instances, and not everything is toil, even outside of SRE.

But the idea in SRE is that somebody will be more integrated with the development team and make changes to not just the operational stack but also the development stack in service of reliability. I've heard it said that SRE is a particular implementation of DevOps. That makes sense to me.

CHAD: Let's start back in the beginning because you made reference to the fact that historically, a lot of what we deploy was deployed to Heroku. And we did that because, for a lot of the applications that we're building, it made sense. It minimized the operational overhead of deployments. There is a point in some systems that you cross a line. Where do we see that line typically being where you need to start looking at something else?

JOE: I think there can be a few different instigating factors. One of the fastest ways for somebody to want to move to AWS is if they have significant security concerns, particularly for healthcare applications. The security model is more straightforward in AWS to have better isolation. There are options on Heroku, but it requires going to a different Heroku platform using Shield. And you just don't get the same power you get in terms of network isolation models you get on AWS with your own VPC.

So if you're already at the point where you want to start out with a VPC out of the gate and do that kind of isolation, my opinion is you may as well own it and go to AWS. So that's one reason. Another is if you start hitting scaling issues, Heroku is easier for the developers because it's simple and it's very streamlined. But doing complex deployments is difficult, which eliminates some of the options available to somebody doing something like SRE.

So to give one example, one mechanism people can use to make it safer to deploy without potentially introducing bugs or performance degradations is a canary release where when you release, you put the new version out as the canary build. And you route maybe 5% of traffic to that, and you actually collect metrics on performance and error rates on the canary traffic versus the regular traffic.

And then you have some period where you're in experiment mode, which varies depending on the level of stability you're looking to achieve. Once you're confident that the canary release didn't introduce a regression, then it gets promoted to the stable build, and you do that every time you deploy. I have no idea how you would do that on Heroku.

CHAD: I think you'd have to do it at the application level. You'd have to do it with a feature flag system. And it would only be possible to do some of the things that you would be able to do if you're able to do the whole system.

JOE: Right. And I guess you could do weighted random numbers to try and decide whether to canary or not. But one of the benefits of doing it outside the application is there's no way to make a mistake. So, for example, if you introduced a bug in your canary mechanism in the application or you forget to put it behind a feature flag, then you've now deployed to production, and you have an error. Whereas if it's managed by the CI/CD pipeline, you're just deploying a new version of the application. In Heroku land, that would mean you deploy the new slug as a canary build. In most other areas, it means you're deploying a container image.

That's one example of why if you get to the point that you have a lot of traffic in production and you need to manage that traffic while continuing to release features, it can be helpful to work on a platform like AWS where you have a lot more deployment options.

Another one is that SRE is heavily built on observability and metrics, which can be difficult to collect on Heroku. Some of that is just a matter of lineage. Like, the SRE community was built up around tools like Prometheus that are scrape-based. That means you need to have a special metrics endpoint exposed on all of your containers.

In Heroku, there isn't a way to access any of your dynos directly except through the web router, and you can't control which one you get. So using Prometheus on Heroku is not really practical, which means you need to re-implement what everybody else has built for SRE using a different observability tool.

And observability out of the box on Heroku it's easy to get set up, but it's more limited. So doing something like complex SLOs and setting up error budget dashboards and alerting is going to be a significant task. Versus on a platform like Kubernetes where it doesn't sound like it'll be easier, but it is because there are open-source tools that you can just deploy.

CHAD: You mentioned Kubernetes. It's probably worth calling out that that's pretty much what we are using across the board, right?

JOE: For our AWS and other cloud deployments, we have standardized largely on Kubernetes. We started out using simpler containerization platforms like ECS on AWS. But what we found is that the developer tooling is generally not particularly good because there's not enough community momentum behind any of those. And the open-source is limited versus something like Kubernetes there's a massive open-source community.

There is a ton of different tooling that people build that's available for developers and for DevOps. And for these things like SRE, you can use almost entirely open-source software to build out all of the interesting parts of that and deploy that. So what we've been building is basically an SRE Platform as a Service where we collect these open-source components. We deploy them to a managed Kubernetes cluster. And then, applications can immediately start exposing metrics to Prometheus and defining SLOs.

CHAD: So much in the same way where we talked about some of the boundaries where it starts to make sense to not be on Heroku, what are some of the boundaries that teams hit where it makes sense to start thinking about SRE or even just having someone on the team that's focused on that kind of work?

JOE: I think as soon as people start hitting their first scaling challenges. So for an MVP where you're validating a product where you don't actually have production traffic yet, I don't think it makes sense. And I also think I would avoid deploying to something like Kubernetes if you can help it for an MVP.

But for anybody who has scaling concerns, SRE is a very useful mindset. And the sooner you start adopting it, the sooner you'll start to build an application that's made to scale. It can be very difficult to put out those fires while something is not on a platform where you have many options, and nobody has been thinking about observability. It means that you need to be guessing at how to put out the fire as well as simultaneously introducing metrics and potentially planning a cloud migration.

So I think as soon as you start feeling nervous about deploying to production or as soon as you notice that you're spending a lot of time working on performance, it makes sense to bring in SRE. I also think anybody that needs to provide an SLA should for sure implement SRE. It can be used to measure whether or not you're on track to hit an SLA because you basically set SLOs that are stricter than your SLA, and you make sure that you meet it.

CHAD: Is there a way that existing teams can layer on some of the SRE activities without having full-time SRE people?

JOE: I think you can have a team member who does development that also acts as the SRE. If you have a small team, I could see the commitment to it being daunting. I think that could be one good reason to bring in outside specialists if you're not at the point where you can afford to have a full-time SRE in-house. Working with a team that can provide an SRE on-demand like Mission Control could be valuable.

CHAD: I didn't realize that that was going to be a perfect segue into part of the value proposition of Mission Control [laughter] when I asked the question. But I guess that's a really good point. That is part of what we're helping people do is monthly contracts that provide this to them, even if their team can't do it 100% of the time.

JOE: Right, except for pretty large teams. I don't think it makes sense for them to hire a full-time SRE. It's much easier to work with a team like ours that has the experience and has more than one person. Even if you do hire a full-time SRE, you will only have one. So if they go on vacation, or if they get sick, or if it's in the middle of the night, then do you still have an SRE? I think that's one of the benefits of working with a team.

CHAD: And that's been interesting with Mission Control because we introduced Mission Control and made it a formal thing at the same time as going entirely remote. And it's interesting how doing that freed us up in terms of being able to commit to building a different kind of team.

It doesn't necessarily need to be on call after hours if we're going to have an entirely remote team. We can have people on that team that span different time zones. And so, from a thoughtbot perspective, it's interesting how those things went hand in hand for us.

JOE: Yes, it's been immensely helpful for Mission Control, in particular, to be fully remote. There are a lot of options that wouldn't have been available to us if we were a U.S.-centric team. It's been really interesting. I've built out development teams before that were focused on a location. And it's been really interesting to build out this team with a focus on availability and distribution.

For example, one thing that has helped us is having somebody in South America because they don't celebrate U.S. holidays. So even discounting time zones, which are a challenge when you're trying to provide around-the-clock availability, just having that kind of diversity in holiday schedules really helps. So we've been able to build it totally differently than we would have if we were trying to put a bunch of people in an office. And I think it's made it possible for us to have much better coverage with a much smaller team.

Mid-roll Ad

I wanted to tell you all about something I've been working on quietly for the past year or so, and that's AgencyU. AgencyU is a membership-based program where I work one-on-one with a small group of agency founders and leaders toward their business goals.

We do one-on-one coaching sessions and also monthly group meetings. We start with goal setting, advice, and problem-solving based on my experiences over the last 18 years of running thoughtbot. As we progress as a group, we all get to know each other more. And many of the AgencyU members are now working on client projects together and even referring work to each other.

Whether you're struggling to grow an agency, taking it to the next level and having growing pains, or a solo founder who just needs someone to talk to, in my 18 years of leading and growing thoughtbot, I've seen and learned from a lot of different situations, and I'd be happy to work with you. Learn more and sign up today at thoughtbot.com/agencyu. That's A-G-E-N-C-Y, the letter U.

CHAD: So Mission Control I introduced it as maintenance and DevOps. So we're also helping people with different kinds of things beyond operations, right?

JOE: Yeah, particularly with SRE, there's a focus on stability and scaling. And we're also helping people with CI/CD. One of the focuses for us this quarter has been helping people develop CI/CD pipelines that provide safer deploys and providing guidance and a system for developers to implement things like feature flags and beta flags.

Because one of the challenges of making performance improvements is that you don't actually know if you've solved the problem until it's deployed, and deploying something that changes performance is inherently risky. And so, in addition to helping people actually make the performance improvements, we have to demonstrate the process for deploying and testing those improvements.

CHAD: I've worked on fairly big systems in the past. But there have been a couple of different instances over the last maybe year where we've approached the problem in a different way than we have in the past, which has been really interesting to me from a development standpoint. It's the idea of…if you remember, for the food delivery application, we had that conversation about the different ways to build APIs rather than versioning APIs explicitly. And that has been a different approach than the way I would have done things in the past. And it's been a really powerful approach. So, can we talk a little bit more about that approach?

JOE: Sure.

CHAD: Well, specifically, so we have mobile applications that use a back-end API, and not everyone updates their mobile application at the same time instantly. You have bugs basically in the wild that you are fixing or that you're changing in your API, or if you're just introducing API changes.

And so the idea of instead of explicitly versioning API on the server-side and having clients write to a specific API, instead building much more flexible APIs, in particular, having the client tell you what version of the API that they're expecting but through consolidated API endpoints so that the server is much more in control of the behavior than the client being in control of the behavior.

JOE: Yeah, I think the two big changes that were helpful on that project were using GraphQL for some of the APIs, which provides more flexibility generally than a typical REST API and the minimum version requirement. So the application sends the version of the application. And the API will tell the client they have to upgrade if it's a version that isn't compatible with the newer APIs. So when we do have to break backwards compatibility, we force an app upgrade.

CHAD: But in general, you're taking the approach not to break backward compatibility. And you're meeting the client where it's at whenever possible and maintaining backward compatibility in the APIs.

JOE: That's something that we have been teaching developers about generally is backwards and forwards compatibility. We do that with deployments as well. For some of the larger deployments we have where there might be dozens of containers running for a service, it certainly doesn't make sense to stop them all and start new ones because the app would be down for a long time. And it would take too long to catch up to the backlog of requests.

But even a typical blue-green deployment is problematic. So if we have 30 containers running and we spin up 30 new containers, and they all need 15 database connections, then during the deploy, you potentially overload your database or exhaust your connection limit. Plus, you will need to allocate the compute resources for double the normal workload.

So what we've been doing instead is rolling deploys almost everywhere where we spin up a few new containers using the new version and wait until they're fully online, spin down a few old ones, and then repeat that process until everything is up to date. But to do a rolling deploy like that requires backwards compatibility with the services it uses, in particular, at the database. And so, writing Rails migrations that are backwards compatible for one version has been a challenge.

CHAD: And there's not really good tooling in Rails to do multiple stages of things. So if you really want to do that, you have to manage that in your source control basically and say, "Here's a new migration. We're going to merge in and deploy after this one," and that's not so great.

JOE: Right. The other way to do that in the CI/CD pipeline would be to release commits one at a time and wait for them to be rolled out. But depending on how you structure your commit log, that could be pretty tedious. [laughs]

CHAD: Yeah. I've seen as I've worked on this other project we're really striving to do continuous deployment. It's a high traffic, very complex deployment with lots of individual configured tenants. Separating out the concept of a deploy from a release has been very valuable for the application and for the clients. It changes the way that you need to think about how development progresses.

I never before really worked in a system where you're literally sometimes duplicating and preserving old code, putting new code in place, having them both deployed, and then being able to switch between them as part of the release, and then cleaning up the old code later. At the scale that this is at, at the complexity that this is at, it makes sense for that application. It obviously doesn't make sense for everybody to be working that way.

JOE: Right. Breaking up applications to be a little smaller, having components that could be experimented with individually would make some of that easier. The experimentation there separating the release from the deploy some of that is necessary because it's monolithic in so many ways. Like, it's a very big Rails application with one database with ACID compliance, which is a very powerful model. And it provides simplicity in some ways. But then it requires you to take on the complexity of making sure that you release things correctly.

I do think that it would be difficult in this particular situation but for applications that reach that level of traffic and where you need to manage the risk of deploying, having smaller components, having some services broken would make that easier because you could do, for example, a canary deploy with one release rather than duplicating the code and having the old and new version.

CHAD: Right. The services create boundaries with contracts about behavior and reduces things that are tightly coupled together, and their behavior is tightly coupled together. So, for example, on this application, we do have that one service that is completely managed independently from the main monolith and has its own deploy schedule. And we can, for the most part, change them independently without needing to go through all of that process that we go through to manage change. I think you're absolutely right.

JOE: Another experiment we've been trying for another client is it's another Rails monolith. There are different audiences for it. So this is the food delivery application again. And there are customers who are placing orders. There are drivers who are delivering orders. There are restaurants that are fulfilling orders. And then there are admins who are managing everything in the back end. And there's some overlap in the data they use. But the actual requests, and controllers, and pieces of the Rails application they use are almost entirely isolated.

So one challenge we had was being able to provide different reliability contracts for those different audiences and also scaling them and configuring them differently. So, for example, if you've done tuning for a Rails application before, you've probably tweaked things like how many threads will I have for each of my Puma workers? How many Puma workers will I have per container? How many database connections do I need in the pool?

And what we were able to do for this application using Kubernetes and Isto was running the same application, the same container, so like one monolithic Rails container but running it more than once in different configurations and routing traffic to different pools of containers based on the audience.

And so, for example, if the customer is making requests, those all go to the customer pool of containers, which are scaled independently and have their own configuration tweaks for the kinds of requests that customers tend to make, which are generally small, high throughput requests with lots of little rights.

And then, compared to the admin panel, they typically view dashboards and big lists of records. And so, the requests tend to be larger, but the number of users is much smaller. There are way more customers than there are admins. And so, for those, we have fewer connections. We have more memory allocated for the kind of bloat that results in those types of requests.

And we also have a different performance objective for admins. It's more acceptable for those pages to respond a little bit slower. And admins understand it's their job. They have to use the software. So they'll reload the page if they have to versus a customer where if they're having trouble placing an order, they might just buy somewhere else. So that's been a pretty powerful mechanism we were able to leverage

CHAD: Is that switching on URL-like endpoints?

JOE: Yeah, it's based on the path. But the mechanisms available to us are actually pretty powerful. At that point, we have access to the full request. So we could really route based on anything we wanted right down to the user.

CHAD: I guess that's a really good example. You don't have access to that routing on Heroku.

JOE: No, I think any Platform as a Service where they manage the routing if they don't provide that feature, you don't get that feature.

CHAD: This is the first we're talking about this. That is a really interesting example of how to scale a monolith solves some of the problems that services often get you without having to break everything up right off the bat in order to do that.

JOE: Yeah. I also think it provides kind of an inside-out approach to doing that. One of the problems with breaking out services is you have to plan what the services are going to be to a certain degree. And so, I think the best way to do it is to extract services from a monolith the same way you extract classes to break them up.

And this audience-based approach is almost like a dry run. You can see if the boundaries you're drawing make sense in terms of traffic. And if those make sense, it probably makes sense to break up the front end at those boundaries eventually into different applications. And then figure out what services you need to extract to provide the common infrastructure for those front-end services.

The same way test-driven development makes it much easier to find the correct tests to write, I think this approach of audience boundary discovery is an interesting approach to finding service boundaries versus trying to guess at what the services are, which very frequently leads people to wrapping services around database tables which doesn't help at all.

CHAD: Yeah, that's the wrong thing to be looking at when you're looking at how to do services.

JOE: Right. It's almost like deciding what your database tables would be upfront before you've seen the UI for the application.

CHAD: Cool. So heading into 2022, we're looking ahead at the upcoming year. And so what's on the docket for Mission Control?

JOE: We didn't start experimenting fully with SRE until the third quarter of this year. And so far, we've loved it. So I think we'll make a pretty heavy investment into our SRE offering. The goal is for us to have an open-source set of Terraform modules that effectively deploy a platform ready to go for SRE. What we want to do is maintain and curate that platform and then deploy it and maintain it for our clients.

I think another big thing we'll be doing is (This might be incredibly boring.) but restructuring the way our agreements work a little bit. One of the things we wanted to test out when we built Mission Control was how much we could have built into a monthly recurring contract versus billing for time and materials like we usually do. So we tried putting a lot into that contract and really pushing the boundaries of what would be reasonable.

And there was definitely a lot of pain there for us and a lot of difficult conversations with clients. So I think for 2022, we will be shifting a lot of our work back towards time and materials. So I guess that's a lesson out there for anybody else that's providing [laughs] support contracts is to make sure that the responsibilities contained in the linear amount scale linearly.

CHAD: I think when we originally conceived of Mission Control, we also saw it handling a lot more things that it turns out just were not doing as part of Mission Control like regular Rails upgrades.

JOE: Yeah, a lot of the things that we included in contracts originally were not particularly important to clients or at least were not outside of what they were capable of doing already. So it wasn't that much of a value-add. There are a lot of people out there that will upgrade your Rails version. And having somebody who just does it in the background but isn't aware of some of the impacts that might have in the application turned out to be not much of a value prop. Whereas stability turns out to be a big pain point for a lot of people, people don't know how to do it.

And then our maintenance offering, I think what ended up providing the most value is not the keeping the code fresh parts, but it was more for the teams that don't have a large continuous development team having access to somebody who can fix quick bugs and things like that without needing to first negotiate a contract with a provider. I think that provides a lot of value. Those are pretty separate and different offerings. But those are the pieces that we found have really been valuable to clients.

CHAD: Well, great. If people want to find out more about Mission Control or get in touch with you, where are the best places for them to do that?

JOE: Well, we have a website thoughtbot.com/mission-control with a dash between mission and control. There are a few ways to reach out there. You can also find us on Twitter. We are @thoughtbot, and I am @joeferris.

CHAD: Cool. You can subscribe to the show and find notes for this episode at giantrobots.fm. If you have questions or comments, email us at hosts@giantrobots.fm. And you can find me on Twitter @cpytel.

This podcast is brought to you by thoughtbot and produced and edited by Mandy Moore. Thanks for listening. See you next time.

Announcer: This podcast was brought to you by thoughtbot. thoughtbot is your expert design and development partner. Let's make your product and team a success.

Support Giant Robots Smashing Into Other Giant Robots
Sponsors