Thursday, 13 December 2007

Lies, damn lies and measurements

This blog started with a chance meeting with a business colleague (how grown up, I'll be wearing a suit and tie next). His basic question was "could you give a talk on something"... my reply, was the only thing I thing you might like is my current line of thought, software measurements, quality etc.

OK lets move onto the where to measure the data. The answer is every where and any where it is produced. But specifically, lets look at unit and module tests. Unit tests are generally run very regularly, many times a day (especially for XP developments). One may use unit tests as profiling and run the occasional test looking for which unit tests take the longest. But there is more information to be gleaned from them... if you care to log it .. imagine a database being populated in the following manner
  1. Test suite resources utilisation: Contiguously (well every 5 minutes) on the test suite, the CPU, memory, network, disk utilisation of all the systems was also logged and the results automatically loaded into a database
  2. Test results: If when every unit tests were run, the result was automatically uploaded to a database with, date, time, package, build no, time taken to run each test and number of times or loops in the test
  3. Test suite configuration: On a weekly basis the test suite uploaded it's configuration, CPU type & speed, total memory, OS and version, patches, java/c/c++/c# version, number of disks and so on
So now we have a database with not only the software metrics such as latencies, throughput etc, but also resource utilisation, and suite configurations. What could we do with this data? How do we make this data work for us?
  • Firstly we can track how each test is performing over time, from it's first run 'till now
  • Secondly we can spot any step changes (either up or down) in the way a single test is performing
  • Lastly, we could pick some specific tests and use these are our KPI, a standard unit of performance, that future versions of the code can be compared to. This could be simple or a string of system states
The above is merely a fairly top level analysis of the software metrics, albeit at a fairly detailed level. In essence the system is being profiled as you test it, and every time you test it. The next trick is correlation. This is slightly harder, as the resource utilisation data will probably be measured as a five minute average, the unit tests may only take 90 seconds or so to run. None the less, in the five minute period when the system is running the unit tests the CPU, memory, disks utilisation etc will all rise. Correlate this rise with the work going on, and you will start to see a pattern of what resources are being consumed as the system is being run.
  • Correlate software measurements such as throughput, latency with resource utilisation, CPU, disk IO etc.
  • look at resource utilisation over time.
Note that we are not striving to ensure each tests footprint is well know (though this is perfectly good approach), but the overall effects of resource required to run the tests is sufficient. Whilst this is gross simplification, it should be noted that working out what the actual resource utilisation under specific load cases is really the job of validation and verification. I'm not suggesting that the V&V stages be replaced by continuous monitoring and measurement process, but merely that what happens in V&V is useful during the development phase of a project. As more data is gathered, it may well (actually should) transpire that the system's performance is a relatively well know and quantified value. This means when it does become time to perform V&V, they have a head start, and may even have a good, well established set of tools to work with.

So how would all this work on a day-to-day basis.

Bedding in: After an initial period of development of the infrastructure and tools, data would be collected. At first it will seem complicated, and far too much detail.
gaining acceptance: Then after a while, the general data patterns will be recognisable, and some general rules of thumb may develop (e.g. each unit test adds about 1-2ms to total test time, each unit test generates 20 IO, 15 read, 5 write and so on).
acceptance: Then deeper patterns will start to emerge. What will also become apparent is when things start to differ. If a new unit test (note new, not existing) takes more than 5ms, look carefully. Or what about a unit test that used to take 1ms to run is now consistently taking 3 ms to run. Once these features are being spotted, the performance of the system is far more quantitative, and far less subjective.
predictive: When the results are being used as engineering data. Performance of module x (as determined by one or more unit test results) is degrading. It will become a bottleneck. redesign activity required. Or even, a better disk subsystem would improve the performance of the tests by 2/3 and mean the unit tests run 1 minute faster.
historical: imagine being able to give quantitative answers to questions like "Architecture have decided to re-arrange the system layout like so (insert diagram here), what will be the effect of this layout compared to the current?". Unusually it would require re-coding the system under the various architectures and performing tests, but all that is now required is to understand the average effects of each of the components, and a good engineering estimate can be arrived at. Taking this a step further imagine "The system's performance indicates that the following architecture would be best like so (insert diagram here), and we/I believe that it would allow 20% more throughput".

Taking stock... Once upon a time tests were done infrequently, but it was realised that testing is the only sure fire way of ensuring features remain stable and bug free. The reduction of the time from bug introduction to bug detection is crucial. Why stop at bugs? Why not use the same approach to performance? When is it easiest to change the software architecture? at the beginning, when is it hardest, at the end? Why change software architecture, there are many reasons, but performance is one.

Lastly the quantification of the software performance is a key feature, because this is the only way to truly make engineering decisions. Developing software is full of contradictions and pro's and cons, it is only be quantifying them that you can confidently take the decisions .

Thursday, 6 December 2007

How do you measure the good?

Well a new week a new topic. I've not finished with the Java stuff, but something else is nibbling at my consciousness. It is vaguely related to the previous blogs. The question I'm going to spout about today is "How do you measure the good?"

Having read "Zen and the art of motor cycle maintenance" [and "Zen and the art of archery"] I'm of the opinion that you cannot easily measure or quantify good (or quality, or excellence, or angles). But we all seem to know what is good when we see it. That.. TV program is good, that dinner is good. That film is good. When you push someone to explain WHY it was so good, you get rather less than quantifiable answers it made it made me laugh, it was tasty, I enjoyed it. In essence not very objective quantities, but certainly positive subjective comments.

This spin on quality ( .. as more than eloquently expressed in Zen & bikes book above) is not new. But it was not until a recent project that I realised the reason you can't quantify how good something is is because you can't count the number of good points. .... BUT you can count the bad points, moreover, you can count specific bad points. Say number of spelling mistakes in a text, and add this to number of words and bingo you have a spelling error rate (e.g. 1 spelling mistake per 100 words or 5 spelling mistakes in the 500 word article).

So let us assume that we want to measure the quality of a text and decide to count number of spelling mistakes, the number of grammatical errors. Lets say we now have an article which scores zero on those counts .. does that make it a high quality article? Does it make the article inherently good? Interesting? compulsive reading? The answer to that is obviously no. But then again these are characteristics of high quality writing. That said, if an article is well written with respect to grammar and spelling it obviously means that the author has taken more than a passing interest in the subject, and so has a fair chance of grabbing your attention.

If we move onto code quality.. told you this was related to Java. I know I'm on a loosing footing if I talk about code quality, so I'm going to change the rules (sign of a cheat!!). For the sake of argument I will not call it "code quality" I will call it "code QQnty". QQnty is like quality, but is a lot more quantifiable. QQnty will have a value, or set of values that you can point at and measure. You can take someones product and say it has a QQnity of X and someone else's product and say it is X+1. To judge if a piece of code is quality code is far too subjective, rather you will need to measure its QQnity. What you want to measure the QQnity against is an open question, but none-the-less, once defined the QQnity of that code will be a measurable number. The reason for this is that judging code quality is far harder than judging the quality of a piece of writing, or how good a film was. The reason for this is two fold
  1. Code developers are a mixed bunch of people, and the chances of any two of them agreeing on any subject, trivial or not, are about as slim as winning the lottery. It does happen, to someone, but I've never witnessed it first hand
  2. Code developers are generally quite passionate about their art (science). Anyone who may have even the slightest difference of opinion from themselves is a heretic. Heretics must die, or at least be argued with.
But there is one extra objective measure that can be performed with code... does the code work in the way it was intended (irrespective of the quality of the underlying code base). If the answer to this is yes, then you are home and dry. So if you define QQnity as the number of functional points (or requirements) minus say the number of bugs since release. Now you have quantified the code..

Well that is at least the theory, when you think a bit harder about it QQnity will itself become a subjective measure as different people will put different models together to measure what they want. What is interesting (to me anyway), it is extremely difficult to measure the good in a piece of code, but much easier to count bad points...
  • number of bugs (# bugs).. this is certainly a quantifiable point, how bad each bug is may or may not be quantifiable, but each bug counts as one.
  • number of functional points achieved/failed (FPA/FPF). it is more usual to use the failed version as there are fewer to count. Again, how important each functional point is is subjective, but you can count it
  • number of lines of code (LOC). This is very misleading. A large code base may mean that the code is highly complex, but it also probably means it is inflexible, or difficult to maintain. Similarly it could be lots of lines of junk. However, a measure none the less.
  • number of passed/failed tests (# PT/ # FT). Again the negative is more common than the positive
With the exception of number of lines of code (LOC) the above is mostly counting poor/failed points. To my mind LOC is a necessary measure, but should not as an absolute measure, and when used used as a denominator

QQnty = total FP/LOC
This measure means that if you succeed in producing the total functional points in fewer LOC, then you do better!! This may lead to more code re-use (should not impact LOC), simpler code (one hopes)

Or

QQnty = total tests/LOC
This way the programmer can either increase total tests, or reduce LOC, either way both would likely (but no certainty) improve quality

However, watch out for .. QQnty = # bugs/LOC
This way the only way to improve (well reduce) QQnty is to increase LOC

This is the end of my splurge, now you've reached the bottom of this text... you can go to bed/work/school satisfied you completed one task today (but how good it is is subjective, but it still counts as one!!)

Monday, 5 November 2007

The Journey ... caught in a trap

The last exciting episode left me on the verge of discovering new vistas in the realm of Yahtzee. I had successfully started to breed slightly better solutions. This got me excited, I like to play Yahtzee for fun, but agonise over the best strategy at various points in the game. Now I never need play another game of Yahtzee, I'll just let my trusted computer do it for me. Get the man out of the loop, automate the whole process. In effect make myself redundant in the fun department. Yep, I need never have fun again, now that's progress. Only a true engineer/scientist would truly embrace this sort of human obsolescence.

One of the driving forces behind doing this was to see if the computer would be able to answer this question for me. I did tentatively try a pen and paper analytical approach. A small tip for anyone who would like to pursue this further than I was able to. If you come quickly at the problem from behind, you can distract it with the pen and bash it with the paper, this is why it is advisable to have lots of paper for these types of problems, a hefty weight of paper can stun or even knock out problems of this sort. But be warned, problems have a habit of striking back!!

Back to the analysis stuff.
  1. Should you be aggressive from the start? If you have 2x6 and 2x1 and a 3 is it better to be optimistic and hold the 6s or play defensively and go for the 1s?
  2. Is it ever right to give up on attempt and not hold anything, and just re-roll all 5 dice?
  3. If a game is going badly is it worth logging a poor score in Yahtzee, as you are unluckily to get a Yahtzee anyway, and logging it in another slot will mean you will loose the chance of achieving that score.
  4. If you do stick to never filling the Yahtzee with a non-Yahtzee roll, when logging the score in a less than optimal pattern, do you log it in the highest scoring (least likely) slot (say large straight) or in a low scoring slot (say 1s) where you might be able to make up for the difference latter in the game.
These questions (even full discussions) were never answered satisfactorily or conclusively. It always boiled down to ... do you feel lucky ... punk. This is a semi-ironic , the whole point of the game is to see how lucky you can be, what your best score is, the maximum number of Yahtzee's etc. The reality is the game shows how unlucky you are, and it is usually ... your nothing special, and sometimes your really poor ... Do lottery winners get excellent score when they play Yahtzee? Will a triple roll over winner get 10 Yahtzee's and have a score in the 1,000 point range? Is there a fixed amount of luck in the universe? If so why do the lottery winners get so much? Or is it that you have a fixed amount of luck for your life, and it is a case of "luck management". The luckiest people actually succeed on the same amount of luck as the other people, so it is really a case of an individual to manage the luck they are given? If so, then should companies hire lucky people, as they are actually successful "good luck managers", and it is probable that they will be good at other styles of management. If so does this hiring them decrement from there luck total? ... Time to stop here.. you probably get the drift.

So my basic questions about Yahtzee were strategy based. My attempt to write a genetic algorithm to solve the above has been listed in earlier blogs. The fundamental thing is... the methods I've used (up to now), will not be any use in answering the questions!!! Why, because
  1. My maps/genes to play the game took no account of the number of rolls one was on. This means you cannot re-roll (otherwise you will end up in an infinite loop, and never finish the game)
  2. If you make no differentiation between the first and third roll you cannot truly say you are throwing away points, as you may be on the first roll being optimistic (e.g. 2x6 and you hold the 6s going for 3x6s, or 4x6s or 5x6s, where as logging 2x6s against 6s is a different matter).
  3. It was a gross over simplification to only have one map/gene for the game. (but it sort of worked!!)
So after chastising myself, realising how far from the original motivation I had strayed, I decided to inject 3 maps/genes for each attempt. One map/gene for each roll. There are a couple of problems though...
  1. Using a genetic algorithm to search a 20x18 space is a large problem, in the order of 20x18! (that contains lots of zeros). Now I had triplied this space. It took a long time to previously, now it will take even longer
  2. The code I had written was expressly written with the assumption that there was only one map... I have to unpick the code to accommodate the new maps. This is proof that my java design was poor.. (I can at least say I'm a beginner at Java ... but it is a fair criticism)
As such, I have adapted the code to read 3 maps (an array of 3 maps/genes, one for each roll). But alas, I was not able to insert a "roll again" command in the first two maps/genes.. The code is too complicated for me to unpick at the moment... Possibly time to re-design the code and review it.

This is my trap. can I re-write the code plus unit tests to reliable re-structure (re-factor) my code? Or should I continue on my current path with less than perfect maps/genes (in that they do not have a "roll again" genome/instruction).

Well that is Tim inTheBlogOSphere signing out. Boldly going nowhere fast, but finding all train to nowhere are delayed or cancelled.

Saturday, 3 November 2007

The Journey Continues .. to Darwin and beyond

Where did I get to... Hmmm I remember. Darwin & Yahtzee. Well I concluded my genetic algorithm for playing Yahtzee was not giving good results. My first thought was "damn, blast & pooh!!", my next was .. "what did you expect". I had a sudden rush of "Darwinism is sort of ridiculous". How can a random shuffling + pruning the worst lead to a optimum solution. Its way too simple.. It needs to have more subtlety, elegance... or some confusion... So my previous reasons for this were
  1. Not enough games being played, so actual game statistics are too unreliable. In other words you can't tell the difference between lucky games and good maps (genes).
  2. Not enough generations.. well I suppose we took millions of years to evolve
  3. Not enough mixing of the maps (genes)
  4. Expectations too high. I was rather hoping that this approach would yield a good set of results. In fact I was half expecting someone to say .. of course this sequence is correct because .. insert some mumbo jumbo here .. If you were to ask the same person before hand, you would get a very non committal reply. This is the difference between doers and thinkers. Sometimes doers think and do, where as thinkers just think, and usually of reasons of why inaction is the best course of action (or is it inaction).
I think 3 & 4 are probably right. After some digging with my colleague (brain size of planet sort of guy). He educated me (well, in a nice way said, I had the brain the size of a peanut). A summary of what he said is
  1. My generation breeding technique was unique, in his experience (translation, wrong!) the pruning was far too sever, usually 40%ish makes it through to next generation.
  2. Because of this I was probably only able to "hill climb" to a local maximum value.
His suggestion was to
  1. Select the best 40% (there were a couple of methods)
  2. Mutate 1% from the population
  3. breed 59% from randomly selecting parents from the population
What this would do in comparison to previous methods is to provide far more random mixing, keep a few poor maps/genes in the mix, this would hopefully mean that there would be a better chance of jumping from one local hill (maxima) to the next set of hills/moutains (maxima).

This basically means, in Darwin terms, you can't be too selective. If you are, you risk only settling on a less than optimum values (breeds) as you end up not being able to cross the valley (because you throw results away that too far frpm where you are, your hill top). To put this in a different way, if you breed too selectively, it is likely that the resulting population will contain defects. Isn't this what dog breeders do!!! No wonder some animal activists are up in arms about breeding and selection. I did not realise!!! (OK.. this looks strange, I must have my geek hat on today. I don't believe people, when they say "selective breeding is bad", but am quite convinced by running an abstract computer simulation of a game of chance, and bingo I see the light.. hmmm. too much beer tonight)

Objective1: use above method to breed from a totally random generated game rules (see previous blog)..

Objective2: Use above method but with a seed of the best I have achieved.

My main problem is that these missions will eat severely into my free time. Kids need attention, wife needs compliments, work needs to be attended. Hey ho, who said this was going to be simple. When the results are in I'll publish a few more results...

result of objective2: I did some tests and it does look like things are getting better. I used a population of 101, each gene/map was given 100 games to prove their worth (probably too small, but 100 games takes about 1 second). I ran this for 20 generations. This takes 30 minutes or so to run. I then take the best map and run it for 100,000 games (get a really good statistical average), this takes about 15 minutes or so. This will really show what the map/genes are really made of. results for 100,000 games
max score 720, max number of Yahtzees 5, average score 218 top 1% score 477, bottom 1% 105. Average number of Yahtzees per game 0.48. The game logic takes 25% of the computers time and the response (interpreting the map/gene) takes 75% of the time.

Pretty good results. Lets take a look at how the system converged...
What you can see here is all the results per generation plotted at each generation. At generation 7 (gens 0-6 disappeared, never to be seen again.. probably a good explanation for this but I suspect a higher intelligence has whipped them from my computer files, they may appear in some remote desktop somewhere at some point ion the future... lets get back to the graph). Where was I oh yeah, generation 7 is essentially 3 groups of solutions, the top one around 215 average score, the next at 175, and the lowest at 90 ish. The lowest one disappears and becomes extinct by generation 12. No way for me to know if this generation X migrated to one of the two solutions or it simply became extinct. The next two remain until the last generation, but the 175 solution set thins out (like my hair). This leaves the biggest set. This set not only grows in size but also the best value increases to 235 ish. This is the map/genes I used for the 100,000 games run.

The astute reader (that means you!!) will realise that the above set of runs topped out at 235 average score, but the 100,000 runs ended up at 218 average score. Well I'm putting this down to statistical variation.. The sceptical reader (at the back wearing the black tee shirt and very dirty shoes) may suggest poor Java programming or sloppy organisation. I must confess, that may be true as well, but I'm trying to put the kids to bed, chasing them with a tooth brush saying.. "brush your teeth to keep the tooth-decay gnome away" (evil uncle of the tooth fairy).. and do this.

More results as and when they come in... I might give up on objective 1... objective2 is more fruitful than I thought...

Boldly going where no sane Java programmer would dare admit.. Continually repeating "your mistakes will guide you a a safer harbour than fair winds and a good navigator would ever lead you"

Tim inTheBlogOSphere signing out.

Sunday, 28 October 2007

The Journey - Java - Yahtzee - Darwin

Well I decided to create my own blog. The title is "The Journey - Java - Yahtzee - Darwin", but I could not think of anything useful to put in in it's place.

The root of this is that I wanted to learn some of the basics of programming in Java. The problem with this is it seems that every who knows Java has their head up their a** and will confuse the hell out of you with excessively long titles (inheritance, polymorphism). If you ask .. what is polymorphism .. the reply is akin to some teacher looking down at the pupil and saying "you can look these things up.. that question is beneath my intellectual powers". I actually NOW translate the above to mean. "This is a fundamental question of which I should know the answer, but I either don't know or am too limited in my powers of expression to give to you".

That said I decided to write a game of Yahtzee in Java (there are loads of sites that explain Yahtzee, and there are some handheld electronic games. These games should come with an addiction warning.. I've worn one down already). Basically Yahtzee is a simple game, relatively easy to understand. The tactics to play are not so easy. You can play defensively and get an average score and throw away some opportunities, or be optimistic, which occasionally pays off (My top score is 620, the average is in the region of 200-250).

Objectives: Write Yahtzee program in Java. Learn Java in the process.

Simple objectives... The big hurdle is the learning java bit. But actually this was not as hard as I thought. I am experienced in writing programs (I started life as a FORTRAN developer, then used c shell, awk, bourne shell, and perl) All of these were process based and Java is object oriented (this may not make lots of sense, but perl, c shell etc is based on writing modules that "do" things or process data. Java is based on storing objects or data. In reality both need to do both, it is just where the focus is.) From my discussions with various people it should have been a totally different experience. My attitude is that learning Java can't be too hard (or not as complicated as Java aficionados make out) as people do use it, even 12 year old children write Java. (That said 12 year old children also use mobile phones features that I can only dream of of.. hey ho). In essence I bought a book, Learn Java in 60 Minutes... I practiced some of the exercises and bingo I can now do a bit of Java (I would not say I'm the best programmer, but I can usually do things in it).

The Yahtzee program was really much easier than I thought. Because of the modular nature of Java programs I simply broke down the process into natural chunks. Getting a module to throw a dice. Getting a module to hold the game state or position (score, bonuses, what was free, how many turns were left etc). Outputting the result. To ensure that I had not introduced any bugs I was able to use JUnit tests. (J stands for Java, Unit is a type of test that focuses on a single feature or function). I really wish I had a unit testing framework when I was developing. Back then (wheels weren't round, hills were young, you get the idea) Unit testing was not considered to be useful. You merely said that the program worked and that was enough. You did unit testing but hid it from your colleagues as "irrelevant" data. As such it was very much a hit and miss affair as to how good your code was, or even if you had broken someone else's function. It is very easy to edit someone else's code, and make it do what you think it should do, and in the process break someone else's functions. This is why you should write lots of unit tests, and they should be run regularly and maintained as part of the code. This way, if you do break someone else's functions, then you at least know what you've done. ...Upshot code written and works and continues to do so

Success: Yahtzee code written, Learn some Java (enough to write a Yahtzee program)

Result: Now I had the program, I was curious if the computer could play it? If so, it would play really fast and get really high scores... loads of Yahtzees...

Objective2: Write a module that will play Yahtzee very quickly.

I think this is very Douglass Adams (or Douglass Adams might find it amusing enough to comment/write about it). Can I write a program such that I am no longer required. It will do everything I can do, far quicker. I will become redundant... Is this a good or bad thing?

So writing the module that will play Yahtzee was interesting. Basically, it does not need to be very intelligent, because the computer I was using was able to play about 150-200 games per second what you lost in intelligence you can make up for in sheer number of games. That said, you do need to write something with some intelligence such that you have some chance of it giving genuine.

Attempt1: "Max score" was my first algorithm. It was simplicity itself. Basically
  1. look at the dice & score it against all the different options
  2. Hold the highest scoring combination
  3. Use all 3 attempts to improve score.
  4. The logic was the same no matter if it was the first throw or the last throw.
As an example if attempt 1 was to throw 4,4,1,3,6, you would score 8 against 4s, 1 3 & 6 against 1s, 3s, and 6s respectively, zero against everything else. So hold 4 & 4. This may yield another 4 (12 against 4s) or it may yield 12344, which is a small straight. If so the system would hold 1234 & un-hold the last 4. The last throw would essentially mean you could get either a small straight or a large straight. This seemed to be a good algorithm, except when you realise that if you roll 1,1,3,4,6, it will go for 6s and not 1s. Time to improve the system...

Now for some stats on the system. Because the logic was so simple it could play many games very quickly (263 games per second, with logging turned off). I managed 9,000,000 games in under 10 hours (run over night). My best scores were 835 with 6 Yahtzees. The average score was 207, with a top 1% of 445 and bottom 1% 113. It also managed an average of 0.39 Yahtzees per game. Not too bad. I recon my average manual score is about 200, and I certainly don't get any where near 0.39 Yahtzees per game (admittedly this is measured over millions of games). The performance for the game logic vs game response was that 51% of the time was spent doing the response, and 49% was game logic.

Attempt2: Count good Dice. Well the above was a good start and I could see that the max score method would not yield a good strategy in some fairly simple scenarios. To do this a different type of approach was needed. I basically wanted to solve the problem of two 1's being ignored in favor of a single higher number. To this end it seemed sensible to simply count the number of good dice. so 11234 would score 2/5 for 1's 1/5 for 2's, 3's 4's 4/4 for small straight and 4/5 for large straight. What was also needed was a way of ordering them, so 11366 it would go for 6's in preference of 1's, similarly 11111 would be logged under yahtzee in preference to 1's, 3 of a king 4 of a kind etc. The way I did this was to assign a weighting to each score (1s, 2s etc). So my basic premise was to fill the 1s, 2s 3s before the 3 of a kind, fill 4 of a kind in preference to 3 of a kind and so on. the weighing looked like this, where count is the number of good dice
1s, 2s, 3s .. 4x(count-2)
3 of a kind 3x(count-2)
4 of a kind 7x(count-3)
full house 4x(count-4)
small straight 3x(count-3)
large straight 7x(count-4)
yahtzee 45x(count-4)-30
chance -1
The above was done the way it was after ranking some of the responses. I did not go into details of all permutations, just enough to get some weighting factors.

There are some anomalies.. The first was that if you have already got a yahtzee and you are luckiliy enough to roll a second yahtzee, you should really make the most of it. log it under the highest scoring number you can (usually large straight) and so on. However, this method would use 4 of a kind next, then what ever the number was (say 1s), then 3 of a kind, full house, etc with small straight and large straight very low down. The next major anomaly was that it was difficult to tune the weighting for a specific logic. The weighting was arrived at using only a few examples. The last problem was similar to the first solution. the system did not differentiate between the first or last throw.

What happened from here, was that I spotted that the system would log the numbers in a order of preference. a ranking if you like. As such I used this as a unit test. From this I could look at which order it would log the dice under for a specific pattern.

On the game statistics side of things. Because the game logic was more complex the system slowed down to 136 games per second with logging off (and 74 with logging on). 1,000,000 games were played in just over two hours. I stopped running large millions of games.. getting lazy. The max score was 927 with 7 yahtzees. Average score was 225, top 1% was 475 and bottom 1% was 106. The system managed 0.49 yahtzees per game. I think this was very good scoring. The performance for the game logic vs game response was 75% of its time calculating responses 25% doing game logic. This is why the speed dropped so much.

Attempt3: Rank Order response .. From the unit testing of count good dice I realised that a way of scoring was to order the responses. I then realised, that the algorithm could simply be a ranked order sequence of what to go for for when certain patterns were seen. All I had to do was generate the patterns, set the order. I already had a good head start of what the orders might be from the count good dice.

The details of this method are even more complex. However, I did not spend too much time on this phase as it was quite plain that I could use genetic Algorithms to breed the best solution, and hence make my logic would become redundant.

As such more work is required in this phase to optimise the responses, and run the tests. The only algorithm I did do was a reverse engineer of the Max Count response. There ARE differences between the two, so the results will be different. I basically categorised all patterns into one of 20 patterns. The easiest of which are things like five of the same number etc. (if someone want details of the 20 categories, post a comment I will gladly oblige). However, the categories are broadly split into 5 based on 5 of a kind, 4 of a Kind, 3 of a kind, 2 of a kind and 1 of a kind (runs). 3 of a kind and 2 of a kind are the most complex sub categories.

On the game statistics side of things. There are two factors here, the fact that every thing is passed as a string (and computers do not like strings), and the complexity of the logic. The upshot is that the system slowed down to 100 games per second with logging off (no results for logging on). 100,000 games were played in 16 minutes. I stopped running large millions of games.. getting lazyer. The max score was 698 with 5 yahtzees. Average score was 215, top 1% was 471 and bottom 1% was 98. The system managed 0.49 yahtzees per game. I think this was pritty good scoring. The performance for the game logic vs game response was 82% of its time calculating responses 18% doing game logic. This is why the speed dropped (again).

.. TODO .. get a better ranking and get higher score., Run the game for 1,000,000 games and stop being lazy...

The age of Darwin beckons, or Genetic Algorithms.. The last attempt was really a proof of concept. It showed the possibility that the game can be broken down into a map. This map will control the game logic. All that was required was for a genetic algorithm to optimise the whole map. Just a couple of minor problems .. I don't know about genetic algorithms .. the learning curve for Java never seems to stop. (what is a shallow copy of an array!!) Hey ho .. I wonder if 12 year old children know about genetic algorithms.. They certainly are a product of them.

On the genetic algorithm front I actually sit next to a chap at work who is an expert in computer algorithms. So after a conversation with him I felt sufficiently well armed that I could at least give the gene thing a go. Also wikipedia has a few articles on it. I think there is some philosophical statement along the lines of .. just do it ..

My approach..
  1. Generate 4 random parents
  2. from these 4 parents breed 12 other children (two parents can breed two children)
  3. Run these 16 and keep the top 4 scores (average score over 1000 games)
  4. repeat steps 2&3 for 50 generations or so
Some details..
Generate 4 random parents .. not too hard. I have 20 categories, each category has 18 possible responses (13 are the 1s 2s 3s, full house etc, and 5 others which are 1st, 2nd, 3rd .. 5th most frequent numbers).

The breeding phase. I took what is called "one point cross over" which basically means I pick a random number from 1-18 (which corresponds to the 18 responses). I split the 20 categories at this point. The father will donate up to this point, the mothers sequence will fill in the rest. This is why you breed father mother and then mother father. I do not use the same random number for both breeding's (perhaps I should).

Top scores not too much to say here, simply pick the top 4.

Problems..
Well the above is pretty rubbish at breeding anything approaching an optimal solution. Even inserting the best solution from attempt3 (rank order response, which was based on counting good dice) does not improve things. Why is this?
  1. Not enough games being played, so actual game statistics are too unreliable. In other words you cant tell the difference between lucky games and good maps.
  2. Not enough generations.. well I suppose we took millions of years to evolve
  3. Not enough mixing of the genes
  4. expectations too high. I was rather hoping that this approach would yield a good set of results. In fact I was half expecting someone to say .. of course this sequence is correct because .. insert some mumbo jumbo here .. If you were to ask the same person before hand, you would get a very non committal reply. This is the difference between doers and thinkers. Sometimes doers think and do, where as thinkers just think, and usually of reasons of why inaction is the best course of action (or is it inaction).