This is a piece I wrote for the Irish Times last year looking at the Predicted Grades
Software tends to get caught up in jargon. Artificial Intelligence, machine learning, algorithms. At its simplest an algorithm is a set of instructions or a recipe, that a computer follows. Computers are stupid and have to be told exactly what to do and what not to do. Your shampoo bottle contains an algorithm. Wash hair, rinse, repeat. A robot following these instructions would use all the shampoo as the’ repeat’ instruction means that it would keep repeating the earlier steps until all the shampoo was gone. Programmers often talk about 90% of a developers work being defensive — to stop bad things happening. To stop the robot emptying the shampoo bottle.
The ideal way to build software is to write out the detailed list of things that the computer is to do. Then turn this into code. Then to rigoursly test this to make sure it really does what you think it needs to do (spacecraft have crashed because a few lines of code weren’t written or tested properly). And to do lots of test planning and test cases and test results to test the models against. Using people who are expert in testing. This isn’t trivial and it works well when you know exactly what you need to do. The HSEs covid app is a good example of this done well (while massively oversimplifying here — ping phones within a few meters and note the codes from those phones every 15 minutes and if someone was in close contact develops covid use the app to notify them ).
The problem comes when you don’t know exactly what you want and you spend time figuring it out. If you’re painting a wall you’ll usually get a few tester pots, decide on a colour and paint. And if you change your mind later you need to repaint the whole wall again. This appears to be a partially what happened with the Leaving Cert Calculated grades. There was a desire for the distribution of grades this year to be similar to previous years
To get this result they created at least 20 different models some with multiple different variations to try and create broadly similar pattern of results as previous years.
The process was designed to keep the spread of grades closer to previous years (similar numbers of H1, H2, H3s etc) while keeping system fair. This feels a bit like painting one wall 20 times and to try and figure out to get the colour of one wall as close as possible to the colour of another wall (the grades from previous years). In addition “In light of public disquiet” in other countries they removed previous school history from the models late in the process. This make the grade matching even harder statistically. This level of change in software development projects under severe time constraints frequently leads to problems.
It is clear is that a huge amount of work went into the calculated grades process over a relatively short period of time. Reading through the documentation the expert group recognises that “statistical prediction models are inherently biased.” I suspect because of this there may have been a focus on testing the overall model to compare overall results to previous years without a detailed testing focus on either schools or individual students. Given the overall output looked broadly correct the specific error in the code was missed and may never have been tested for. As far as I can tell there is no discussion in any documentation of how the models were tested other than by reference to overall comparison to grade distribution in previous years.
At this point there are lots of questions that need to be addressed. Why were so many models run? Did anyone notice the problems experienced by specific schools with the extreme levels of downgrades in some schools compared to other schools? Why was this specific problem discovered now? Is this tied to the cases currently before the courts ? How were the models tested and validated originally? Were external teams and testing experts used to support code review and quality checking of the code and ? Why weren’t the overall algorithms and the models and the assumptions in these models published? In contrast the HSE Covid App has set a gold standard globally for a clear open sourced code that is being shared and used internationally.
Given the role of the Leaving Cert in Irish society and the impact on key life event for thousands of Irish students these questions, which go well beyond a single error urgently need to be answered.
Dermot Casey is a Innovation and Technology expert with over 25 year’s experience from developer to CTO to investor and advisor working across multinationals, and early stage technology company as well as lecturing on Strategy, Technology and Innovation. @dermotcasey