Friday, 16 August 2013

Bayesian testing

If you are starting here, I recommend that you go to the next entry down, or use this link: Bayesian statistics. That will give you some background.

Done that?  Fine, now back to the education examples that I started there.  In the mid-1970s, I was working on making mastery learning possible, by constructing mastery tests.  Now neither mastery learning nor mastery tests could be called new: acknowledging to the push for mastery learning, E. F. Lindquist had written of "mastery tests" in 1936, pointing out that they differed from achievement tests. (See E. F. Lindquist, 'The theory of test construction' in Hawkes, H. E., Lindquist, E. F. and Mann, C. R. (ed.) The Construction and Use of Achievement Examinations: a manual for secondary teachers. Cambridge, Mass.: The Riverside Press, Houghton Mifflin, 1936, page 36.)

One of the things I learned quite early in my working life was that many of my colleagues were careerists. These were people who would stop at nothing to enhance their careers. One future (now dead) Director-General of Education in New South Wales even went so far as to call in journalists to show them the work of a dying man, claiming it as his own.  That is how the great and powerful rise.

Others were more subtle, taking old ideas and giving them a new name, so they could say with a simulacrum of truth "I introduced autorical hermeneutics into the curriculum, you know."

As a rule, the reintroductions come around about every 40 years, and I estimate that this may be because two-score years sees all the old hands who had experienced the idea before, swept off into retirement.  For complex reasons, I avoided that.  I used the terms "mastery learning" and "mastery testing", but I concentrated on the mastery testing, because my skills lay there, and because I knew that the 1936 attempts to use mastery learning had fallen through because there were no mastery tests.

I worked as a classroom teacher in the bureaucracy, and I had little time for daffy theoreticians. A few years earlier, a Master's degree seminar was ear-bashed endlessly about an obscure Dutch philosopher and his ideas.  At the end of 110 minutes, the din ceased, and we sat stunned.  Then somebody (me, actually) asked of the presenter: "What exactly does your philosopher have to say to me about managing 2S7 (that was the lowest science class in a streamed Year 8) for a double period at the end of Friday afternoon?"

The plugger had no answer, because he had never, contrary to the by-laws for the Master of Education, taught in a school.  He was apparently deemed too bright to be put through that mill, but to practising teachers, he was, in the parlance of those days, among working teachers, as useful as a screen door on a submarine.

Bottom line: bright-eyed theorists don't impress me unless their ideas are firmly based in practice.  Keep that in mind in what follows.

As I came to the end of my master's studies, I needed to write something up, and started researching a "long essay" of 15,000 words on the mastery testing programs that were running around the world. The paradigm had taken root, if I may mix my metaphors, and mixing it with tailored testing, many higher degree students in the USA were messing around with Bayesian jiggery-pokery, but they all seemed to have it, if you will excuse the vernacular (and even if you don't) arse-up.

There fuzzy thinking resulted in sanguine acceptance of asking students 50 questions or more, just to make certain that mastery was attained.  This would provoke riots in the classroom and revolt in the staffrooms—and rightly so.  These people were clearly all post-grads with no experience with their equivalent of 2S7.

Yet there was the nub of an idea there, and based on the figures I had available from pre-testing about 60 mastery tests, I knew that very few students stayed in the middle.  They either got 19 or 20 items correct (most of the tests had 4 sets of five items), or they got about 5.  Making any kid answer 50 questions was just stupid.

Bayesian mastery testing

And so I came up with a model:

First: there would be about 15 questions in the test.
Second: most students would answer just a few of them.
Third: they would be re-assessed for mastery/non-mastery after each response.
Fourth: they would only be asked the minimum number of questions.
Fifth: the only way to do this was with what we then called a microcomputer.
Sixth: as each student was classified, item statistics would be updated.

There were several risks that needed to be eliminated. First up, there needed to be a base level of probabilities that would not be skewed by one or two students acting up, right at the start. That was fixed by seeding the response array with dummy data giving ten masters a probability of 0.8 of answering each question correctly, and ten non-masters a probability of 0.2 of getting each answer. These were arbitrary figures, and as each student was classified, his or her responses, either as a master or a non-master, would be added to the array, swamping the dummy data.

The occasional student who hovered in the middle, getting some right and some wrong would be asked to go and do some more study before trying again, and the student's responses would not be stored. To avoid any form of skewing, the test items were in a circular queue, so that if student A's last question was item 4, student B would begin on item 5, with the last item always being followed by item 1.

We got the whole thing programmed in BASIC (my thanks here to David Matheson, who was far better at that sort of thing than I and knew his way around the 64k Apple II), and I found some curious effects.  For example, items with negative discrimination, meaning poor students get them right and good students get them wrong, are usually avoided. In fact, it turned out that such items were just as useful in determining mastery or non-mastery!

What happened next

Not much. My "long essay" was in fact a thesis, and I packed it with lots of other curious findings (like the hidden fallacy underlying item analysis using the point-biserial correlation coefficient and dendrogram/cluster analysis of mastery test items), but the paradox of the negative discrimination items was nearly my undoing.

My examiner told a friend that when he read that, he knew it was wrong, and he was about to fail me, but then he did some figuring and studied the effect, which I had clearly labelled as counter-intuitive and paradoxical, so as to avoid numpties (like him) howling.  In the end, he realised I was right.  Lucky me.

I was asked to scrap most of the long essay and rewrite chapter 7 (the Bayesian bit) as I would be guaranteed an M. Ed. with merit, which would get me into a PhD program.  I had taken 7 years to get my first degree and 10 to get my second (in each case by cunningly misusing the by-laws which forbade such tortoise-like progress).  I thought "Thirteen years more? No thanks."  I went off and wrote books instead, because it's what I do best.

So I got my ordinary Master's, but the scheme was ahead of its time.  Nobody was prepared to tie up the only computer in the school as a testing machine, and I was moving on to other things. The only public record, the copy of what I justifiably call my thesis, was stolen from the Education library at the University of Sydney.

Well, the 40-year cycle is almost here, and I notice that Salman Khan of the Khan Academy is using a mastery approach very effectively, and now the technology is there, but the mastery measure he uses is too simplistic and oriented to low-cognitive-level arithmetic: 10 questions have to be answered correctly in a row.

So given that some scrote may be out there with my thesis and might be poised to launch my scheme in his or her name, I have decided to place this in the public domain.  I aitn't dead yet, but I enjoy cutting scrotes off.

If you like it, run with it: if you need more detail, I am ready and willing to share all the stuff that I have, but any half-way competent programmer, given what I have set out here, could easily implement a similar system using networked devices.


No comments:

Post a Comment