One of the major challenges I have faced in my career is keeping performance up in a large, rapidly-evolving system with many developers. As a Java developer writing enterprise and web applications, I generally have the luxury of not worrying about the minutiae of working within the L1 cache, or having to drop down to assembly to optimize my performance. But even enterprise applications have their performance limits, and users don't want to wait 10+ seconds to have results come back to them, or watch their website freeze due to stop-the-world garbage collection. In the land of e-commerce web apps, performance is even more essential. The bounce rate may double when a result takes 4 seconds vs 1 second, which means dollars we will never realize. So I take performance seriously.
We've all heard the old chestnut "premature optimization is the root of all evil". And indeed, when you find yourself spending half a day worrying about the O(n) time for an operation where N is never bigger than 10, you're probably falling victim to the siren call of premature optimization. A more subtle way I find that people fall into the premature optimization trap is by getting sucked in to profiling tools. I love a good profile as much as the next programmer, but profiling is devilish work that should only be undertaken when you know you have a problem to solve. But how do you know you have a problem to solve before your problem is affecting your users and costing you business?
The easiest thing to do is simply capturing the runtime of important chunks of work. If you happen to be using a service model of system design, capturing the runtime of each of your exposed endpoints is a good start. For those that happen to be using the Play framework (1.X), this is easily accomplished with a @Before in the controller that records start time in a thread local, and an @After that logs the total time for the request Now you can at least see what's running and how long it is taking.
Unfortunately, that alone does very little for detecting performance problems before they hit your live traffic. It also does very little to show you trends in performance without some additional work on your part. The first thing you have to do is actually save the information you're collecting over time. We live in a big data world, but you may not want to save every bit of log output from your services logs forever and ever. And most of the output will be uninteresting except as an aggregate trend. You want to see if your queries are trending up over time, as your data or user load grows, but you probably don't care that at 3:43pm on December 13th the getAvailableSkusForDate endpoint took 100ms to return 20 results. You need something that you can look at quickly, that you can keep around for a long time, and that will warn you in advance if something is going to cause you problems. I'm sure there are many ways to skin this cat, but the way that has worked best for me in the past is basic performance testing.
Starting Simple
Basic performance testing requires a small amount of setup. The main requirement is a data set that is as close as possible to your production data set. A nightly dump of the production database is great. This data source should be dedicated to the performance test for the duration of time it is running. The database (or file system or whatever) doesn't have to be exactly the same hardware as production, but it should be configured with appropriate indexes etc so as to most closely match production. For the most basic test, you can simply pick the queries you know to be worrisome, spin up the latest successful version of your trunk code, and time the execution of a set of important endpoint calls against this data. Record the results. You can use a Jenkins build to drive this whole process with fairly trivial setup, and at the laziest just record the results in a log message. Compare this to yesterday. If the difference is an increase of a certain threshold, or the total time exceeds some cutoff, fail the build and email the team. This is in fact the only thing I have done so far in my new role, but it already is providing me with more confidence that I can see a historic record over time of our most problematic queries.
Slightly More
Now that you're tracking the performance over time, a really slow change is not likely to slip into your codebase unnoticed. But really, what you would like to compare is not just today's trunk vs yesterday's trunk, but today's trunk vs today's production. Now things get a little bit trickier. To do this effectively, you may need to refresh the database twice if you are doing any modifications of the data. And if there has been schema evolution between prod and trunk, you need an automated way to apply that evolution to the data before running the trunk changes (this may apply to the basic case, too, but surely you have an automated way to apply schema changes already eh?). There are several nice things about comparing production to trunk. First of all, you can also use this as a regression test by validating that the results match in the two versions of the code. Second, you can feel pretty confident that you directly know that it is the code, and not the database or the hardware or anything else, that is causing performance differences between prod and trunk.
This setup will ultimately be as simple or as complex as your production system. If your production system is a small service talking to a database, setup is trivial. If, on the other hand, your production system is a complex beast of an application that pulls in data from several different sources, warms up large caches, or generally requires a lot of things to be just so to start up, the setup will be correspondingly complex. As with pretty much all good software engineering, the earlier you start running automated performance monitoring, the easier it is likely to be to set up and the more it will likely influence you to spend a little extra time making your system easy to deploy and automate.
Gaga for Performance Monitoring
I have only touched on the tip of the iceberg for performance testing and monitoring. There's a world of tools that sysops and dbas have to track the load of systems and performance of individual database queries. You can use log analysis tools like Splunk to identify hotspots when your system is running under real user load (a weakness of our basic performance testing framework). But I have found that all these complex tools cannot make up for the feeling of security and tracking that a good performance testing suite provides. So give it a shot. If nothing else, it can give you the confidence that all your fun time playing with profiling tools actually has a trackable difference in the performance of your code base.
We've all heard the old chestnut "premature optimization is the root of all evil". And indeed, when you find yourself spending half a day worrying about the O(n) time for an operation where N is never bigger than 10, you're probably falling victim to the siren call of premature optimization. A more subtle way I find that people fall into the premature optimization trap is by getting sucked in to profiling tools. I love a good profile as much as the next programmer, but profiling is devilish work that should only be undertaken when you know you have a problem to solve. But how do you know you have a problem to solve before your problem is affecting your users and costing you business?
The easiest thing to do is simply capturing the runtime of important chunks of work. If you happen to be using a service model of system design, capturing the runtime of each of your exposed endpoints is a good start. For those that happen to be using the Play framework (1.X), this is easily accomplished with a @Before in the controller that records start time in a thread local, and an @After that logs the total time for the request Now you can at least see what's running and how long it is taking.
Unfortunately, that alone does very little for detecting performance problems before they hit your live traffic. It also does very little to show you trends in performance without some additional work on your part. The first thing you have to do is actually save the information you're collecting over time. We live in a big data world, but you may not want to save every bit of log output from your services logs forever and ever. And most of the output will be uninteresting except as an aggregate trend. You want to see if your queries are trending up over time, as your data or user load grows, but you probably don't care that at 3:43pm on December 13th the getAvailableSkusForDate endpoint took 100ms to return 20 results. You need something that you can look at quickly, that you can keep around for a long time, and that will warn you in advance if something is going to cause you problems. I'm sure there are many ways to skin this cat, but the way that has worked best for me in the past is basic performance testing.
Starting Simple
Basic performance testing requires a small amount of setup. The main requirement is a data set that is as close as possible to your production data set. A nightly dump of the production database is great. This data source should be dedicated to the performance test for the duration of time it is running. The database (or file system or whatever) doesn't have to be exactly the same hardware as production, but it should be configured with appropriate indexes etc so as to most closely match production. For the most basic test, you can simply pick the queries you know to be worrisome, spin up the latest successful version of your trunk code, and time the execution of a set of important endpoint calls against this data. Record the results. You can use a Jenkins build to drive this whole process with fairly trivial setup, and at the laziest just record the results in a log message. Compare this to yesterday. If the difference is an increase of a certain threshold, or the total time exceeds some cutoff, fail the build and email the team. This is in fact the only thing I have done so far in my new role, but it already is providing me with more confidence that I can see a historic record over time of our most problematic queries.
Slightly More
Now that you're tracking the performance over time, a really slow change is not likely to slip into your codebase unnoticed. But really, what you would like to compare is not just today's trunk vs yesterday's trunk, but today's trunk vs today's production. Now things get a little bit trickier. To do this effectively, you may need to refresh the database twice if you are doing any modifications of the data. And if there has been schema evolution between prod and trunk, you need an automated way to apply that evolution to the data before running the trunk changes (this may apply to the basic case, too, but surely you have an automated way to apply schema changes already eh?). There are several nice things about comparing production to trunk. First of all, you can also use this as a regression test by validating that the results match in the two versions of the code. Second, you can feel pretty confident that you directly know that it is the code, and not the database or the hardware or anything else, that is causing performance differences between prod and trunk.
This setup will ultimately be as simple or as complex as your production system. If your production system is a small service talking to a database, setup is trivial. If, on the other hand, your production system is a complex beast of an application that pulls in data from several different sources, warms up large caches, or generally requires a lot of things to be just so to start up, the setup will be correspondingly complex. As with pretty much all good software engineering, the earlier you start running automated performance monitoring, the easier it is likely to be to set up and the more it will likely influence you to spend a little extra time making your system easy to deploy and automate.
Gaga for Performance Monitoring
I have only touched on the tip of the iceberg for performance testing and monitoring. There's a world of tools that sysops and dbas have to track the load of systems and performance of individual database queries. You can use log analysis tools like Splunk to identify hotspots when your system is running under real user load (a weakness of our basic performance testing framework). But I have found that all these complex tools cannot make up for the feeling of security and tracking that a good performance testing suite provides. So give it a shot. If nothing else, it can give you the confidence that all your fun time playing with profiling tools actually has a trackable difference in the performance of your code base.