## Friday, 16 August 2013

### Bayesian statistics

This is an introduction, some parts of which are essential for most readers before they read the next post, Bayesian testing. That post, as is normal in blogs, appears above this one.

Consider the case of the teacher who asks a "checking" question to see whether some student has grasped a principle.  If the answer is satisfactory, the teacher may either think "Good, I thought X understood that", or "Funny, I didn't think X had caught onto that one: I'd better ask another question."

On the other hand, if the answer is unsatisfactory, the teacher may think "Yes, I thought X didn't know that", or "Funny, I could've sworn X understood that: I'd better ask again in another way".
In either case, the teacher has taken a prior probability into account, and used the new information to modify that probability to come up with a posterior probability.  The probability that we are dealing with is a continuous variable.

Yet if we were to rely on classical probability theory, the response to a single question can only yield one of two completely discontinuous values.  Either we are 100% certain that the student has understood the business under consideration, or we are equally 100% certain that the student has no clue at all.

The possibility that the student guessed the answer, or made a silly mistake while really understanding the principle, or heard the answer whispered by somebody, all of these are rejected in favour of a narrow, rigid black-and-white view of probability.

Well, you don't have to be a teacher to recognise that this is daft, but let me make it even easier for you.  Suppose we have just used a multiple choice question to assess where the student is at in terms of understanding the principle involved.

Now people who have little real understanding of probability reject these questions on the grounds that you might just get lucky and guess all the answers.  Let me assure you here and now that this isn't possible for any large number of well-written questions, but that isn't what I want to debate right now.

With a single question, offering four choices, however, there is a reasonable chance of guessing with no understanding at all, a 25% chance, in fact.  Anybody who thinks about it can recognise that: what is less obvious is that some students who do know and understand the principle involved will make a clumsy mistake, and get their answer wrong.

Anybody who knows about testing knows for a fact that this happens. In the pressure of a test or exam, students enter the results in the wrong place or do something else silly.  Sometimes, the fault lies in the question, which is badly worded.

So it's crazy to go around assuming that we can assert 100% probabilities about anything.  We teachers can, however, assert that we are pretty certain that somebody understands whatever the principle is, to the extent that we are willing to move on to something new, and teach that.

It's easy when we are talking about something simple like teaching sums, or dates of famous battles, any sort of rote learning: even Blind Freddy can see that we have to be flexible in how we calculate the probability of something.

Now let's turn to something far more important: the batting performance of our nation's cricketers.  Once again, I will start off with a simple and easy exercise: the average performance of Sir Donald Bradman in test matches.

If we look at The Don's last score, he was out for a duck.  Should that be his lasting record?  Of course not!  And if we look at a modern-day batter with a sequence of low scores in the past few months, should we write him or her off?  Tabloid journalists call for the executioner, sager minds look at the longer term.

Every measurement involves elements of chance, and even a consummate wielder of the willow (that's the bat, for heathens) will sometimes "blow it", sometimes several times in a row, and wise selectors usually look at prior performance, or as mathematicians say, prior probability.

### The technical stuff

You don't need to read this: the maths-free description will do for most readers.  In what follows, if you do read it, the numbers in a1, a2 etc. ought to be subscripts, but this blog does not support that, so far as I can see.  This needs to be kept in mind when you see things like an and p(an).  Sorry!

Suppose we have a set of discrete alternatives a1, a2, a3, a4 . . . an, for a given set of trials, and that we can write the probability of a1 as p(a1). To make this easier, suppose we are looking at a set of test scores, and the probability that a student has mastered the skill being assessed in the test, which has twenty questions. The alternatives are the test scores, from 0 to 20, and what we need to assess is the probability that the student is a master of that skill, given a particular score.

Beginning with a prior estimate of probability p(a1|b), the probability of a particular score being obtained by a student who has mastered the skill, we can then use a simple formula to estimate the probability that a particular student has mastered the skill, given that student's score:

p(ai|b) = [p(b|ai) x p(ai)] / [p(b|a1) x p(a1) + p(b|a2) x p(a2) . . . . + p(b|an) x p(an)]

Or we may take this form of the equation, where there are two events, A and B:

p(A|B) = [p(B|A) x p(A)] / [p(B|A) x p(A) + p(B|~A) x p(~A)

Here we may define event A as 'mastery' and event B as a particular score, or we may look at them in terms of the likelihood of guilt in a particular situation, or almost anything else. If we only have a limited amount of information available, or a limited number of data points, this will tend to give us a better average estimate of the true situation.

To take a simple example, if four experimenters are trying to find out what the frequency of heads and tails should be when you toss two coins, there are four possibilities, which would give results of two heads, two tails, a head followed by a tail or a tail followed by a head.

Now suppose we take a Bayesian approach, beginning with the reasonable assumption that there should be a 'half and half' chance for heads and tails. Under the same conditions, the head-head and tails-tails observations will lead to a conclusion that the coin is biased to a particular result, rather than suggesting that the same result will always be achieved. The head-tail and tail-head cases will still lead to the conclusion that there is an equal probability of getting heads or tails, so the overall set of results is more accurate.

The examples here tend to relate to educational settings, simply because the writer devoted two years of his life to researching and developing such applications, but the same reasoning can just as easily be applied to estimating baseball batting averages, or almost any other measure which is mathematically equivalent to a probability.

### Uses in the courts (still technical!)

You can skim this one as well, or leave it for now and come back to it, as it is a side-issue. The main thing to note is that Bayesian statistics have many uses.

Bayesian probability has now become important in the law courts of the world, where it provides the most appropriate way of dealing with DNA evidence or blood grouping. This need arises because of a peculiar situation that arises when a lawyer says something like "there is a one in a hundred thousand chance of somebody else having this DNA profile".

Suppose a random citizen (we will call him Fred) has been accused on the basis of a blood spot left behind at a murder, which matches Fred's profile to a level that the prosecution are calling "one in a hundred thousand". Further, they say, somebody of Fred's racial group was seen leaving the area. That makes the odds even better, because his race are just 10% of the population. "That makes it one in a million", says the prosecutor.

In fact, it brings the probability down, not up, says the defence lawyer, who has read up on this topic. The defence may well be correct, if they can show that most of the people of that DNA profile type are in AB's racial group, so what we need to do is use a Bayesian probability, but this example gets a bit confusing.

So let us look at a case where paternity has been alleged, and DNA evidence seems to support the claim. Once again, the frequency of that DNA type in a small community can be quite different to what you get in the whole nation, so we do a calculation of the probability of the accused being the father of the child at the centre of the case.

We have, from all of the tests, a Combined Paternity Index, (CPI). This is calculated as the product of the paternity indices for each individual system tested. The CPI tells us how likely it is that the alleged father (or a man genetically identical to the alleged father) contributed the paternal genes to the child, divided by the likelihood of another unrelated man of the same race contributing the paternal genes.

As well, we have a Prior Probability (Pr). This is a numerical value in the range 0-1 (that is, ranging from impossibility to total certainty) which indicates the likelihood of a certain event occurring. This value is estimated, before genetic testing, on the basis of known, non-genetic circumstances surrounding the event.

That means taking into account non-statistical evidence, such as casual acquaintance versus an intimate relationship. Since the laboratory does not know of the existence or the substance of these circumstances, a prior probability of 0.5 is customarily assigned for the purpose of neutrality, but this can be varied.

Now we can calculate the probability of paternity:

P = (CPI) (Pr) / (CPI) (Pr) + (1-Pr), where P = Posterior Probability of Paternity, CPI = Combined Paternity Index and Pr = Prior Probability.

### Uses against spam

The common methods of filtering spam, back in 2003, such as rejecting mail from known spammers (black lists), and only accepting mail from friends and colleagues (white lists), were not enough. Merely filtering known spam messages was always one step behind clever spammers. More aggressive filtering posed an unacceptable risk of killing legitimate messages.

Take a simple trap that rejected e-mails mentioning the word 'Viagra' in the subject line: the word 'V1AGRA' will pass straight through, but in nine out of ten cases, it will still be read by a human as 'VIAGRA'. New filtering methods were brought in to analyze e-mail messages in their entirety, instead of just looking at a handful of key words.

These filters (and we still use them) make sophisticated models, based on probability and statistics theory going back to the ideas of the 18th-century mathematician and cleric, Thomas Bayes, that determine whether new messages are spam or not.

Such a system allows a message about sextants which mentions that the pen is mightier than the sword will be examined and passed, rather than being examined and hurled into the outer darkness.
Rhe basic notion of Bayesian statistics is that it begins with a certain assumed probability that a message should be rejected, and then uses a variety of observations to adjust that probability, only acting if the probability rises above (or falls below) a certain level.

Some findings may increase the probability, others may reduce the probability, and in more sophisticated forms, testing may be exited fast by the use of white lists and black lists, while indeterminate messages can be given a more thorough scrutiny, even to looking for any of a few thousand terms and phrases, all of the usual weasel claims about mail not being sent unless people have opted in.

By the same token, the name of a known sender might be used to validate e-mail, so that a message about e-mail from the New England Journal of Medicine or Nature, for example, would be allowed to pass, even if it mentioned a number of otherwise 'black mark' terms. It would also get around the problem encountered by some Thai people, whose names end in '-porn', leading to all sorts of problems, and in the past, words like Middlesex and Essex have been known to trigger poorly designed guardian software.

Which is why we now rarely see this once-popular tagline on e-mails:
When they come for the anarchists, I shall speak up even though I am not a anarchist. When they come for the Jews, I shall speak up even though I am not a Jew. When they come for the Muslims, I shall speak up even though I am not a Muslim. When they come for the Christians, I shall speak up even though I am not a Christian. When they come for the spammers, I'll say "You missed one over there!"