Thursday 13 December 2007

Lies, damn lies and measurements

This blog started with a chance meeting with a business colleague (how grown up, I'll be wearing a suit and tie next). His basic question was "could you give a talk on something"... my reply, was the only thing I thing you might like is my current line of thought, software measurements, quality etc.

OK lets move onto the where to measure the data. The answer is every where and any where it is produced. But specifically, lets look at unit and module tests. Unit tests are generally run very regularly, many times a day (especially for XP developments). One may use unit tests as profiling and run the occasional test looking for which unit tests take the longest. But there is more information to be gleaned from them... if you care to log it .. imagine a database being populated in the following manner
  1. Test suite resources utilisation: Contiguously (well every 5 minutes) on the test suite, the CPU, memory, network, disk utilisation of all the systems was also logged and the results automatically loaded into a database
  2. Test results: If when every unit tests were run, the result was automatically uploaded to a database with, date, time, package, build no, time taken to run each test and number of times or loops in the test
  3. Test suite configuration: On a weekly basis the test suite uploaded it's configuration, CPU type & speed, total memory, OS and version, patches, java/c/c++/c# version, number of disks and so on
So now we have a database with not only the software metrics such as latencies, throughput etc, but also resource utilisation, and suite configurations. What could we do with this data? How do we make this data work for us?
  • Firstly we can track how each test is performing over time, from it's first run 'till now
  • Secondly we can spot any step changes (either up or down) in the way a single test is performing
  • Lastly, we could pick some specific tests and use these are our KPI, a standard unit of performance, that future versions of the code can be compared to. This could be simple or a string of system states
The above is merely a fairly top level analysis of the software metrics, albeit at a fairly detailed level. In essence the system is being profiled as you test it, and every time you test it. The next trick is correlation. This is slightly harder, as the resource utilisation data will probably be measured as a five minute average, the unit tests may only take 90 seconds or so to run. None the less, in the five minute period when the system is running the unit tests the CPU, memory, disks utilisation etc will all rise. Correlate this rise with the work going on, and you will start to see a pattern of what resources are being consumed as the system is being run.
  • Correlate software measurements such as throughput, latency with resource utilisation, CPU, disk IO etc.
  • look at resource utilisation over time.
Note that we are not striving to ensure each tests footprint is well know (though this is perfectly good approach), but the overall effects of resource required to run the tests is sufficient. Whilst this is gross simplification, it should be noted that working out what the actual resource utilisation under specific load cases is really the job of validation and verification. I'm not suggesting that the V&V stages be replaced by continuous monitoring and measurement process, but merely that what happens in V&V is useful during the development phase of a project. As more data is gathered, it may well (actually should) transpire that the system's performance is a relatively well know and quantified value. This means when it does become time to perform V&V, they have a head start, and may even have a good, well established set of tools to work with.

So how would all this work on a day-to-day basis.

Bedding in: After an initial period of development of the infrastructure and tools, data would be collected. At first it will seem complicated, and far too much detail.
gaining acceptance: Then after a while, the general data patterns will be recognisable, and some general rules of thumb may develop (e.g. each unit test adds about 1-2ms to total test time, each unit test generates 20 IO, 15 read, 5 write and so on).
acceptance: Then deeper patterns will start to emerge. What will also become apparent is when things start to differ. If a new unit test (note new, not existing) takes more than 5ms, look carefully. Or what about a unit test that used to take 1ms to run is now consistently taking 3 ms to run. Once these features are being spotted, the performance of the system is far more quantitative, and far less subjective.
predictive: When the results are being used as engineering data. Performance of module x (as determined by one or more unit test results) is degrading. It will become a bottleneck. redesign activity required. Or even, a better disk subsystem would improve the performance of the tests by 2/3 and mean the unit tests run 1 minute faster.
historical: imagine being able to give quantitative answers to questions like "Architecture have decided to re-arrange the system layout like so (insert diagram here), what will be the effect of this layout compared to the current?". Unusually it would require re-coding the system under the various architectures and performing tests, but all that is now required is to understand the average effects of each of the components, and a good engineering estimate can be arrived at. Taking this a step further imagine "The system's performance indicates that the following architecture would be best like so (insert diagram here), and we/I believe that it would allow 20% more throughput".

Taking stock... Once upon a time tests were done infrequently, but it was realised that testing is the only sure fire way of ensuring features remain stable and bug free. The reduction of the time from bug introduction to bug detection is crucial. Why stop at bugs? Why not use the same approach to performance? When is it easiest to change the software architecture? at the beginning, when is it hardest, at the end? Why change software architecture, there are many reasons, but performance is one.

Lastly the quantification of the software performance is a key feature, because this is the only way to truly make engineering decisions. Developing software is full of contradictions and pro's and cons, it is only be quantifying them that you can confidently take the decisions .

Thursday 6 December 2007

How do you measure the good?

Well a new week a new topic. I've not finished with the Java stuff, but something else is nibbling at my consciousness. It is vaguely related to the previous blogs. The question I'm going to spout about today is "How do you measure the good?"

Having read "Zen and the art of motor cycle maintenance" [and "Zen and the art of archery"] I'm of the opinion that you cannot easily measure or quantify good (or quality, or excellence, or angles). But we all seem to know what is good when we see it. That.. TV program is good, that dinner is good. That film is good. When you push someone to explain WHY it was so good, you get rather less than quantifiable answers it made it made me laugh, it was tasty, I enjoyed it. In essence not very objective quantities, but certainly positive subjective comments.

This spin on quality ( .. as more than eloquently expressed in Zen & bikes book above) is not new. But it was not until a recent project that I realised the reason you can't quantify how good something is is because you can't count the number of good points. .... BUT you can count the bad points, moreover, you can count specific bad points. Say number of spelling mistakes in a text, and add this to number of words and bingo you have a spelling error rate (e.g. 1 spelling mistake per 100 words or 5 spelling mistakes in the 500 word article).

So let us assume that we want to measure the quality of a text and decide to count number of spelling mistakes, the number of grammatical errors. Lets say we now have an article which scores zero on those counts .. does that make it a high quality article? Does it make the article inherently good? Interesting? compulsive reading? The answer to that is obviously no. But then again these are characteristics of high quality writing. That said, if an article is well written with respect to grammar and spelling it obviously means that the author has taken more than a passing interest in the subject, and so has a fair chance of grabbing your attention.

If we move onto code quality.. told you this was related to Java. I know I'm on a loosing footing if I talk about code quality, so I'm going to change the rules (sign of a cheat!!). For the sake of argument I will not call it "code quality" I will call it "code QQnty". QQnty is like quality, but is a lot more quantifiable. QQnty will have a value, or set of values that you can point at and measure. You can take someones product and say it has a QQnity of X and someone else's product and say it is X+1. To judge if a piece of code is quality code is far too subjective, rather you will need to measure its QQnity. What you want to measure the QQnity against is an open question, but none-the-less, once defined the QQnity of that code will be a measurable number. The reason for this is that judging code quality is far harder than judging the quality of a piece of writing, or how good a film was. The reason for this is two fold
  1. Code developers are a mixed bunch of people, and the chances of any two of them agreeing on any subject, trivial or not, are about as slim as winning the lottery. It does happen, to someone, but I've never witnessed it first hand
  2. Code developers are generally quite passionate about their art (science). Anyone who may have even the slightest difference of opinion from themselves is a heretic. Heretics must die, or at least be argued with.
But there is one extra objective measure that can be performed with code... does the code work in the way it was intended (irrespective of the quality of the underlying code base). If the answer to this is yes, then you are home and dry. So if you define QQnity as the number of functional points (or requirements) minus say the number of bugs since release. Now you have quantified the code..

Well that is at least the theory, when you think a bit harder about it QQnity will itself become a subjective measure as different people will put different models together to measure what they want. What is interesting (to me anyway), it is extremely difficult to measure the good in a piece of code, but much easier to count bad points...
  • number of bugs (# bugs).. this is certainly a quantifiable point, how bad each bug is may or may not be quantifiable, but each bug counts as one.
  • number of functional points achieved/failed (FPA/FPF). it is more usual to use the failed version as there are fewer to count. Again, how important each functional point is is subjective, but you can count it
  • number of lines of code (LOC). This is very misleading. A large code base may mean that the code is highly complex, but it also probably means it is inflexible, or difficult to maintain. Similarly it could be lots of lines of junk. However, a measure none the less.
  • number of passed/failed tests (# PT/ # FT). Again the negative is more common than the positive
With the exception of number of lines of code (LOC) the above is mostly counting poor/failed points. To my mind LOC is a necessary measure, but should not as an absolute measure, and when used used as a denominator

QQnty = total FP/LOC
This measure means that if you succeed in producing the total functional points in fewer LOC, then you do better!! This may lead to more code re-use (should not impact LOC), simpler code (one hopes)

Or

QQnty = total tests/LOC
This way the programmer can either increase total tests, or reduce LOC, either way both would likely (but no certainty) improve quality

However, watch out for .. QQnty = # bugs/LOC
This way the only way to improve (well reduce) QQnty is to increase LOC

This is the end of my splurge, now you've reached the bottom of this text... you can go to bed/work/school satisfied you completed one task today (but how good it is is subjective, but it still counts as one!!)