How I started believing in Cycle Time over Estimation

How I started believing in Cycle Time over Estimation

Cycle time is the time it takes to produce value. In software engineering, this is the time between a team starting to develop a feature and delivering this to the customer. Most backlog tools track this data and show it in control charts. You can use cycle time data to predict how long an individual item will take. You can find more explanation here.

I keep telling the following story over and over when I see teams trying to do Estimation while everyone knows that Estimation is a charade. So I thought I could write it down and share it. I believe a story like this will explain the theory how Cycle Time works better than Estimation in a much more lively way.

Background

At the time of the story, I worked with a team that had a hard time planning. I joined after a Lift & Shift to AWS and the company asked me to help get into control. The platform consisted of multiple legacy systems, all working together in complex ways. Before the migration, the hosting of the platform was outsourced resulting in  developers not spending any time on pipelines and maintainability in the decade before.

We had trouble supporting other development teams consuming our platform due to the many operational incidents. Most of the work on new features stalled. We could never estimate when we would finish new features because work could be disrupted by sudden firefighting in production. After some time and by applying engineering to increase reliability, we started to build features again and people began requesting expected dates again. We estimated those dates, but we did not make those dates time and time again due to unforeseen complexity.

Fire Fighting can happen in DevOps teams.

That we had a major problem became apparent to me when the team started to talk about a new project. I asked what it was about, and the team said they were already refining it for over a year but never came to it. The company expected the team to deliver this big project that would be strategical to the future. But it was unthinkable that we would make any headway in this in the coming time.

So I proposed we push out this project and have a new team work on it. The new team started delivering after some ramp-up time, but this new team still needed some support from us for some small features on our platform.

Predicting a Feature Toggle

This new team requested a significant feature in our platform's critical component. We already built this feature and were running it on Acceptance for months. It was turned off in production by a feature flag. Producing this feature would be a simple change in our Configuration as Code and push it to production.

At the end of a Friday afternoon, the project manager asked our team when we would deliver this feature. The weeks before, he already gave some estimations that we didn't make. So he was reluctant to provide an estimate. At the time, I was telling the team that we should stop with estimation and do more in-depth data-driven planning. So he asked me over: "Hey, what would you say?"

Estimation can feel like drawing Tarot cards.

I answered: "Let's pull up the Control Chart to see our Cycle Time.". There we are leaned over the display: the project manager, the team lead, and me. I moved the mouse over the control chart. A pop-up showed our rolling average: '5d 6hr. I stood up straight and said: "Our average is five days 6 hours with a standard deviation of up to two weeks.".

I could tell confusion from the look of my team lead. How could I say it would take over five days for a change that we estimate to take 10 minutes! I didn't really have the answer, but I just said that it takes us that time to deliver on average. But somehow, it did feel right. We agreed that it would be the first thing picked up on Monday. Satisfied that we at least would work on it, the project manager accepted the answer. Probably this was his goal anyway, not expecting any real date from us seeing our previous performance.

Delivering a feature toggle

On Monday, we started with the stand-up and discussed picking up this task. Another engineer proposed fixing a bug on another component first before turning on the flag. We would generate a lot of extra load by turning on the feature flag. We could not test this in our acceptance environment. This bug was related to load. The component would make it more likely to handle it by fixing it. We decided to do this because if that component fails, the whole platform will go down. So that is already a day delay.

At the end of the day, I looked at the board and didn't see the bug moving to the release column. I gathered the team and asked if the bug fix is ready to be deployed after hours. The bug fix was reviewed and tested, ready to be deployed. But as it is such a crucial component for the whole platform, someone reminded us that we had to ask the release manager.

So we asked the release manager if we could deploy this. He exploded: "No way we can deploy this! We need to request all teams to test this change! I have to notify management if we deploy any changes to that component! That takes a full day." So we planned to release it the next day.

Tuesday, we notified all teams of our change and asked them to test it. They tested it on Acceptance during the day. They found no problem, so the bug fix was ready to be released in the combined release. Little did we know that a feature by another team was also in the combined changeset. This feature would break on production. After the release, customers started calling, so the release engineer reverted the release. The other team had to fix this bug the next day.

On Wednesday, the other team saw that the release failed due to their new bug and started to fix it. That evening a new release was scheduled with their and our bug fix. So our bug fix was finally in production.

So our bug fix was finally in production on Thursday. We turned on the feature flag. The feature started to be available on production and it worked.

Capturing Complexity in Data

Our estimation of 10 minutes resulted in a Cycle Time of 4 days and 5 hours. The combination of the level of practices, the complexity of systems, and collaboration between teams caused this delay and are present in any sociotechnical system. These can all be improved, but that is not the point of this story. That takes time and work, while some features need to be delivered in the meantime. As humans, it is impossible to consider all the potential complexity. That is why we cannot estimate correctly. But Cycle Time data can and does.

We did not consider a legacy system with bugs and scaling problems. But we have run into this before while delivering other tasks. That complexity is captured in the data.

We did not consider a combined release process with a Release Manager. But previous releases failed before due to bugs introduced by other teams. But the tracking tools do measure these delays in previous task completion times.

This story is how I turned into a firm believer in using Cycle Time as a planning tool. Afterward, I have seen the same story play out time and time again but due to different complexities: the maturity of third parties, CISO, and so much more. I now never give any estimations anymore. Not using estimation require training. But planning is so important that using a charade like estimation is harmful.

If you liked this story, I would recommend to follow me on twitter: @snorberhuis. I regularly tweet about Software Development. If you need any help to get into control in Agile or DevOps, feel free to contact me!

A major thank you goes out to Stanislav Pogrebnyak to introduce me to #NoEstimates. All photos are from unsplash.

Show Comments