How to Measure Latency Properly in 7 minutes.
Measuring Latency properly requires that you have quality data. There is a reason that KPMG’s “2016 Global CEO Outlook” found that 84% of CEO’s are concerned about the quality of the data they’re basing decisions on and it’s because all too often data can mislead.
The difference between companies that care about their data and don’t is huge. MIT researchers found that companies that have adopted a data-driven design have an output that is 5–6% higher than what would be expected given their other investments and information technology use. It’s for this reason understanding Latency matters so much.
In the next 7 minutes, you will learn how to measure latency, why properly measuring it matters, common pitfalls when looking at your latency data, why you need instant computing and why you need unsampled data.
So What is Latency?
Dictionary.com defines latency as “the period of delay when one component of a hardware system is waiting for an action to be executed by another component”. In simpler terms, this means the amount of time between calling a function and its actual execution. Latency is inherent in all systems; even if we had a perfect system(which doesn’t exist), it would be latent the amount of time it takes for the electrons in the computer to switch the transistors from on to off or vice-versa.
Latency in small operations isn’t a big deal, but when handling millions of operations, there are millions of latencies that add up fast. Latency is not defined by a work units/time but instead how it behaves. Monitoring tools report back how long it takes from the start of a function till the end of the function.
Latency can have a major impact on your business, for example “When it comes to mobile speed, every second matters — for each additional second it takes a mobile page to load, conversions can drop by up to 20%”(Source). So it’s very important to understand your latency as best as you can.
Common Pitfalls when Looking at Your Latency Data:
Latency almost never follows a Normal, Gaussian or Poisson distribution. Even if your latency does follow one of these distributions due to the way we observe latency it makes averages, medians and even standard deviations useless! If for example, you are measuring page loads, 99.9999999999% of these loads may be worse than your median. (Click to tweet this statistic) This is part of the reason that random sampling your latency causes inaccurate data, but more on this later.Did you know that when looking at latency that 99.999999% of loads may be worse than your median Click To Tweet
At this point, you’re probably asking yourself if we aren’t using any standard deviation how are we meaningfully describing latencies? The answer is we must look at Percentiles and Maximums. Most people think to themselves, okay so I look at P95 and I understand the “common case”. The issue with this is that P95 is going to hide all the bad stuff. As Gil Tene, CTO of Azul Systems, says “it’s a “marketing system”, Someone is getting duped.”
Take for example this graph:
When you see this graph, you can clearly see why it is the median and average have no real significance, they don’t show the problem area. When you see the 95th percentile shoot up to the left you think you are seeing the heart of the problem.
This, of course, is not true though, when you go to investigate why your program had a hiccup you are failing to see the worst 5% of what happened. To get this kind of spike requires that the top 5% of the data is significantly worse.
Now look at the same graph that also shows the 99.99th percentile:
That red line is the 95th percentile whereas the green is the 99.99th percentile line. As you can clearly see the 95th percentile only shows 2 out of 22 of your issues! This is why you must look at the full spectrum of your data.
Despite the fact that many people may think that the last 5% of data does not hold that much significance. Sure, it could just be a virtual machine restarting or a hiccup in your system, or something like that but while that is true by ignoring it, you are saying that it just doesn’t happen when it could be one of the most important things for you to target!
Gil Tenel likes to make the bold claim that “The number one indicator you should never get rid of is the maximum value. That is not noise, that is the signal. The rest of it is noise.” While the maximum is indeed a great single in a system at large scale it is often not practical to pursue just the maximum case. No system is perfect and hiccups do occur, in a large-scale practical system pursuing the maximum case exclusively is often a good way to burn out your development team.
When looking at the 99.99th percentile you are seeing what happens to the large majority of your customers and any spikes you see there you know are actual issues, whereas any spikes in your maximum may just be a hiccup in your system. When your devops teams focus their effort on these small hiccups they are doing so at large opportunity cost, as they cannot instead work on more major issues.
It is of note that if your 99.99th and your maximum are very close to each other(and are both spiked) than it is a great signal that this is an issue your team should work on. In this way, Gil is right that the maximum is a great signal, but wrong that the rest of your data is just noise. As you can see in this graph:
Our 99.99th percentile and maximum from our previous example match up exactly. This is a great signal that what it is you are looking at is a real bug and not just a hiccup.
Averaging Percentiles: How Precomputation is Causing You to Mismeasure Latency:
An even worse pitfall people fall into than just looking at the 95th percentile is failing to recognize that their percentiles are averaged. Averaging percentiles is statistically absurd; it removes all significance from what it is you are looking at. We have already shown how averages are not good, when looking at latency, and if you are looking at averaged percentiles you are simply right back to square one. Many software’s average your percentiles take for example this Grafana chart:
Whether or not you realized it before all the percentiles on this are average! It says so right there in the x-axis ledger. NEARLY ALL MONITORING SERVICES AVERAGE YOUR PERCENTILES! This is a reality due to precomputation. When your monitoring service takes in your data, they are computing the percentile of the data for that minute.
Then when you go to take a look at your 95th percentile, they are showing you an average off all your percentiles. This shortcut for “your good” to make your service faster, is, in reality, removing all statistical significance from your data.
Why You Must have Unsampled Data to Measure Latency Properly:
Whether or not you know it, by monitoring tools participating in data sampling, they are producing averaged data. Almost every monitoring tool samples their data. Take for example DataDog; they have major data loss. If you send them 3 million points in a minute they will not take them all. Instead, they will randomly sample the points then aggregate them into 1 point per minute.
You must have unsampled data to understand your latency. It is inherent that with sampled data you can’t access the full distribution! Your maximum is not your true maximum, nor is your global percentile an accurate representation of what is going on!
Sampled Data exacerbates Coordinated Omission!
When you sample data, you are omitting data. Say for example you have 10,000 operations happening in a minute sending out 2 data points each to your monitoring system. Say you have a bug in your system and one of these data points shows this per 10,000 operations. Your monitoring system only has a 1/20,000 chance of choosing this as the data point it shows you as the maximum!
If you run long enough, the data point will show up eventually, but as a result, it will look like a sporadic edge case, even though it is happening to one of your customers every minute! When you don’t sample data, and you have one of these spikes, it will show up clearly in your 99.99th percentile, and your maximum will show up close to it, signalling you that you have a bug in your program. When you sample your data, however, it won’t show up as often, meaning you won’t see it as a bug but rather as a hiccup. This means your engineering team will fail to realize the significance of it!
Don’t let your monitoring tool fool you into thinking you know what is going on with your Latency.
Choose a tool that doesn’t provide sampled data. Choose a tool that doesn’t average your global percentiles. Choose BeeInstant!
To learn more about why you should choose BeeInstant checkout this video comparing us to DataDog