How my biggest work failure led to success

Apr 10, 2022

From my tiny West Berkeley apartment, with my family asleep upstairs, I typed into the search box: “python Berkeley”. It was 2015 and for the past 6 years, I commuted hours a day all over the Bay Area. My goal was eventually to create a startup. But, My latest gig had imploded and I was tired and mis-wanting to be a startup founder was reason number one . While I still enjoyed programming, I needed a refuge from the hustle, if possible a job I could walk to. I didn’t care if it put me any closer to being a startup founder.

Despite my overly specific search parameters, I found what I was looking for on the first page of results. A position at a company that sold hosted services to academic libraries. Delighted by its mission and location I applied, interviewed, and got the job. A few weeks later, past gardens of fragrant rosemary and craftsman houses, I walked from my apartment to Downtown Berkeley. Happy to be free of the hustle, I only wanted to tuck into this project and write some code.

Clueless of the future, I was embarking on the biggest project failure of my life (to date). To clear one thing up, I don’t feel like I was the root cause of this failure, but I was absolutely a contributing human factor to the failure. Luckily though, this isn’t just about an epic failure. While it was stressful, the mistakes I was about to make were invaluable experiences. In the best turn-about ever the failure directly led to my favorite success story ever. (I recounted this story on the StaffEng podcast as well)

The largest source of revenue for the company came from a 16-year-old service. In the late 90s, the company successfully pivoted a hosted academic journal service into a new developing product category. Over the years they captured a profitable portion of the market. By the time I would leave the company it had sold for a price that made the shareholders quite happy. When I joined though, the products past success was slowing down, and was leading to problems.

The codebase was at least as old as the service. While attempts had been made to modernize it, the changes were more like layers on the original flat-file database than a full architectrual refactoring. Instead of replacing the previous architectures, they were like wrappers on wrappers. This led to a huge amount of cognitive load when trying to change anything. Even getting a development environment setup was a multi-week proposition. Wether or not the past attempts were successful at the time. The codebase was now the main culprit for a number of business problems.

While the company was profitable and clearly made the right call 16 years ago, the codebase and the engineering org weren’t in a great place. Latency was poor and there were plenty of code quality issues, but the most telling issue was the release process. Every 3 months, the entire engineering staff would stay after work on a Friday and attempt to deploy a new version. Often requiring that they stay well into the night. After every release, there would be a period of stabilization where hot patches were applied directly to production. This led to small quality issues. For instance, it was not uncommon for changes made in production to not get committed to the trunk branch which was a time bomb for the next release. If your only experience with deployments is automated pipelines, this might seem barbaric, but this was not uncommon deployment strategy in the early 2000s. Even now there are probably a sizable population of companies that deploy like this. As Willian Gibson has said, “The future is already here – it’s just not evenly distributed.”

The engineering difficulties were a growing business issue. The reliability of the system was shaky at best and it was increasingly difficult to ship features that customers wanted. Thus some customers were leaving, others stopped using the product, and growth had slowed. The business started to push harder on the engineering organization to fix the situation. It’s fair to say there was a growing trust issue between the business and the engineering organization. The engineering organizations new this and wanted to do something about it.

The team I joined had the solution. They would dump the old codebase entirely, and rewrite it in a new modern language, replacing Perl with python. Now, this team knew that a big bang rewrite was “a bad idea”, therefore they weren’t going to replace the whole thing at once. Instead, they decided to deconstruct the main codebase into a set of microservices. By the time I had joined they were already behind schedule by years and wanted to get this process done as soon as possible. I was ready to dive in. A big messy code problem was exactly what I needed. At the time this felt like exactly the right thing to do, not only could I do it, but it felt good only worrying about the code.

I dove in and tried to focus on writing code. While I tried to ignore all other concerns, within weeks I started asking myself, “is standing up a new service a good idea? Would it be easier to “fix” the thing in place?” I never brought this up to my boss or co-workers though. I desperately wanted to ship the damn thing. So, I pushed those thoughts off the the side – I’d figure out the problems later. I wanted approval that I was a good engineer. At the time I didn’t consider stopping to align on the problem we were solving. That would have required talking, and talking would waste time which mean’t I wouldn’t deliver any code. Which would prove that I wasn’t a good engineer — So, I put my head down and focused on the code. Unfortunately,, buckling down was the start of a process that would end in disaster.

With the power of hindsight, this train was going off the rails. In the year that led up to the release things went from bad to worse:

1 month after I joined the only engineer who knew anything about the project left
My boss (who coded part-time) and I were left to finish the job.
VP of engineering left 1 month before release

By the time I realized all of this, the train had left the track, and I couldn’t do anything to prevent the crash.

As the year came to close, we were also finishing the project. In order to ship the project though, we needed to shutdown the service we were rewriting for “a couple of days” in order to sync the data from the old system to the new. This was the start of the when the project went from bad, to disaster.

As soon as we turned off the site, and started to sync the data, we quickly realized that there were a number of issues with the sync and they started to pile up. Slowly, we realized that this would not be a couple of days, and we didn’t know how long it would be. The site wen’t down on Dec 20th, and wouldn’t come up until Jan 8th. Even after we put a bandaid on the sync issues we still had more bad luck:

The sites 95p response time was 60 seconds, with 50p of 3 seconds (for relatively static data)
We had lost one major customer
My boss resigned (He had hoped to resign after the launch, but alas…)

Besides all the technical failures, the launch process was a gut-wrenching experience for my self. On top of the overall stress, the sync issues kept piling up as we were trying to get things out the door. During this time the CEO of the company was running the daily stand-ups (never a good sign). One day, when things were particularly bad, my boss and CEO were discussing a new issue that had cropped up. Being one of the few people who could fix anything at that time, and being kind of pissed off that I was expected to stand around in a circle while the world was burning down, I sort of yelled at them both. “I’ve got it! I’ll take care of it! Can we move on?”. This was a professional low point for me. I appreciate the grace of everyone in that room who still chose to work with me after that day. I can still feel that moment in my body as I write it out. It was sickening.

While the project’s launch was a disaster, and caused for more human carnage that was nescecary, after the two week shutdown and a 30 day stabilization period, we were able to finish the project.

And despite some rather large technical issues not everything was bad. The site’s design was better which delighted a number of our users and I had implemented a continuous delivery process that allowed us to go from merge to deploy in minutes. This allowed us to patch the system at a rapid pace. After a few weeks, we were able to stabilize the product. After we were stabilized we could even start work on new features which happened at a faster rate than previously possible. The engineer organization even got in trouble once when we shipped a feature before a product manager had approved it. They hadn’t needed to worry about this happening previously.

Deploying code which inadvertently shipped a feature before everyone was ready led to an incredibly new capability for the organization. After aligning that shipping fast allowed us to fix bugs faster, we were able to use feature flags to give product control over feature release, while we continued to control deployment rate. These kinds of innovations weren’t possible previously, but by having the worst behind us, we could begin to build a new future.

Looking back, we might not have had a better way to do this as an organization. Technically, I don’t know if microservices are “easier” for most orgs. There is a strong argument that given the local context of the team and the organization it made no sense to use microservices¹. We didn’t have the correct prerequisites and we didn’t understand what our interfaces were or what our data models were. While the organization wasn’t in a good state to do anything big, a situation was created where we either had to ship the new thing or somehow throw away a million+ in investment, therefor it was forced to adapt, and luckily we adapted in a way that created capacity. That capacity allowed it to iterate faster. Which proved more valuable over time then rewriting one part of the product. This new capacity in part led me to my favorite success story.

Long after we stabilized the new service, we continued to be plagued by various consistency and latency issues. Both issues stemmed from our reliance on the monolith. The new service wasn’t truly independent. The more I dug into the inconsistency I was fairly certain that I needed to dig into the monolith to understand what was going on.

At that time there were other reasons for us to focus more on the monolith. We were managing our own infrastructure. Literally renting our own cage at a data center and running the whole thing ourselves, with only 2 folks who knew how to operate the cage. We were also running our own in-house blob store. Which was increasingly creaky.

We wanted to migrate to the cloud, we wanted to migrate a very large blob store to S3, and if we could we wanted to break up the monolith even more, but after my experience with the first micro-service I wanted to modularize first, and then if things really required it, move to services.

While this made logical sense, digging into the monolith scared me. I am the kind of person who takes pride in speed. I want to fix bugs in minutes, not days or hours. This is hard to do if we can only deploy updates every 3 months! Based on our experience with continuous deployment of the new service, I suggested that we focus on building it out for the old service as well. Luckily, this made sense to my co-workers at the time as well. Given the amount of change we foresaw, as a team, we decided that continuous deployment was a good idea so that we could ship small diffs.

The details of how we did this are interesting but were more harrowing than technical. I did things that should not see the light of day ². After a few months of work, we took a system that deployed once every 3 months, and we made it a push-button deployment that took 10-15 minutes. It went from shipping a versioned branch to shipping trunk. It went from you must deploy in off-hours, to you can deploy in the middle of the day.

While this required a monumental level of work and a huge change in how folks worked, things rolled out fairly smoothly. In classic fashion, I started looking for the next thing, not stopping to savor the win. Around this time the company was purchased by another company which made it quite difficult to stick around. I chose to leave. I was sad though, I had big plans for the “real work” I was going to undertake on the now almost 20-year-old codebase. I had barely changed anything when I implemented CI/CD. At least, that’s what I thought.

One day, I was explaining this to a co-worker who had been at the company since its inception. He stopped me, I don’t remember what he said word for word but it was something like, I know that you feel like you didn’t achieve much, but for the last decade every 3 months I had to give up a Friday night, and maybe deploy a new version of the app, that’s no fun. That’s a Friday night I don’t get to spend with my family. Now, I can push a button anytime I want and deploy the latest changes. No more lost Friday nights. So much for not doing any “real work”.

What he said hit me hard. Previously, I hadn’t thought about how my work made things better for my co-workers. I hadn’t considered how these kinds of process changes could take a significant amount of load off of them. In retrospect, it makes a ton of sense. If folks who work in the system are so overloaded with process work, it will be hard for them to adapt for the better. When you free them up, it allows them to focus on growing capabilities versus fire fighting. Even if you aren’t fixing the real technical issues (if that even is a thing), you are creating the extra capacity needed for folks to do so.

I started this journey with the hope of escaping a certain kind of grind, only to find a new kind of grind. Through the process though, what I learned was that if you genuinely listen to folks you work with, if you build relationships with them, and include them in the prioritization process, you can eventually make an incredibly substantial impact. Even if you don’t know what that looks like at the start, if you trust the process, and listen you will figure it out.

If you are curious about the debate around the microservice/monolith issue I would recommend these pieces. I think both of these posts contextualize the debate in interesting ways. Modules, monoliths, and microservices and Deconstructing the Monolith: Designing Software that Maximizes Developer Productivity. ↩︎
There are many folks dealing with aging infrastructure right now, and it’s totally possible to make them do things like continuous deployment without rewriting everything from scratch. If you are working on a such system I highly recommend Kill It With Fire by Marianne Bellotti. Unfortunately, the book did not exist when I was doing this process. ↩︎