## Monday, 15 April 2013

### Trusting statistics, part 3

 Remember this one? Now it's about to become relevant!

In part 2, I suggested that statistics are best regarded as convenient ways of wrapping a large amount of information up into a small volume.  A sort of short-hand condensation of an unwieldy mess of bits and pieces.

And one of the handiest of these short-hand describers is the correlation coefficient, a measure of how two variables change at the same time, the one with the other.

Now here I'll have to get technical for a moment.  You can calculate a correlation coefficient for any two variables, things like number of cigarettes smoked, and probability of getting cancer.  The correlation coefficient is a simple number which can suggest how closely related two sets of measurements really are.

It works like this: if the variables match perfectly, rising and falling in perfect step, the correlation coefficient comes in with a value of one.  But if there's a perfect mismatch, where the more you smoke, the smaller your chance of surviving, then you get a value of minus one.

With no match at all, no relationship, you get a value somewhere around zero.  But consider this: if you have a whole lot of tennis balls bouncing around together, quite randomly, some of them will move together, just by chance.  No cause, nothing in it at all, just a chance matching up.  And random variables can match up in the same way, just by chance.  And sometimes, that matching-up may have no meaning at all.

So this is why we have tests of significance.  We calculate the probability of getting a given correlation by chance, and we only accept the fairly improbable values, the ones that are unlikely to be caused by mere chance.  In the example above, you will see r=0.9971 (p<0.0001), where r is the correlation coefficient, and p is the probability of getting such a result by chance.  This result was highly improbable, so I guess that proves the case, huh?

Nope.

The trouble is, all sorts of improbable things do happen by chance.  Winning the lottery is improbable, although the lotteries people won't like me saying that.  But though it's highly improbable, it happens every day, to somebody.  With enough tries, even the most improbable things happen.

So here's why you should look around for some plausible link between the variables, some reason why one of the variables might cause the other.  But even then, the lack of a link proves very little either way.  There may be an independent linking variable.

Suppose smoking was a habit which most beer drinkers had, suppose most beer drinkers ate beer nuts, and just suppose that some beer nuts were infected with a fungus which produces aflatoxins that cause slow cancers which can, some time later, cause secondary lung cancers.

In this case, we'd get a correlation between smoking and lung cancer which still didn't mean smoking actually caused lung cancer.  And that's the sort of grim hope which keeps those drug pushers, the tobacco czars going, anyhow.  It also keeps the smokers puffing away at their cancer sticks.

It shouldn't, of course, for people have thrown huge stacks of variables into computers before this.  The only answer which keeps coming out is a direct and incontrovertible link between smoking and cancer.  The logic is there, when you consider what the cigarettes contain, and how the amount of smoking correlates with the incidence of cancer.  It's an open and shut case.

I'm convinced, and I hope you are too.  Still, just to tantalise the smokers, I'd like to tell you about some of the improbable things I once got out of the computer.  They aren't really what you might call damned lies, and they are only marginally describable as statistics, but they show you what can happen if you let the computer out for a run without a tight lead.

 Rabat, Morocco: just one stork per tower or chimney.
Now anybody who's been around statistics for any time at all knows the folk-lore of the trade, the old faithful standbys, like the price of rum in Havana being highly correlated with the salaries of Presbyterian ministers in Massachusetts, and the Dutch (or sometimes it's Danish) family size which correlates very well with the number of storks' nests on the roof.

More kids in the house, more storks on the roof.  Funny, isn't it?  Not really.  We just haven't sorted through all of the factors yet.

The Presbyterian rum example is the result of correlating two variables which have increased with inflation over many years.  You could probably do the same with the cost of meat and the average salary of a vegetarian, but that wouldn't prove anything much either.

In the case of the storks on the roof, large families have larger houses, and larger houses in cold climates usually have more chimneys, and chimneys are what storks nest on.  So naturally enough, larger families have more storks on the roof.  With this information, the observed effect is easy to explain, isn't it?

There are others, though, where the explanation is less easy.  Did you know, for example, that Hungarian coal gas production correlates very highly with Albanian phosphate usage?  Or that South African paperboard production matches the value of Chilean exports, almost exactly?

Or did you know the number of iron ingots shipped annually from Pennsylvania to California between 1900 and 1970 correlates almost perfectly with the number of registered prostitutes in Buenos Aires in the same period?  No, I thought you mightn't.

These examples are probably just a few more cases of two items with similar natural growth, linked in some way to the world economy, or else they must be simple coincidences.  There are some cases, though,  where, no matter how you try to explain it, there doesn't seem to be any conceivable causal link.  Not a direct one, anyhow.

There might be indirect causes linking two things, like my hypothetical beer nuts.  These cases are worth exploring, if only as sources of ideas for further investigation, or as cures for insomnia.  It beats the hell out of calculating the cube root of 17 to three decimal places in the wee small hours, my own favourite soporific.

Now let's see if I can frighten you off listening to the radio, that insomniac's stand-by [that's a hint that this was once a radio script].  Many years ago, in a now-forgotten source, I read there was a very high correlation between the number of wireless receiver licences in Britain, and the number of admissions to British mental institutions.

At the time, I noted this with a wan smile, and turned to the next taxing calculation exercise, for in those far-off days, all correlation coefficients had to be laboriously hand-calculated.  It really was a long time ago when I read about this effect.

It struck me, just recently, while wearing my scientist hat, that radio stations pump a lot of energy into the atmosphere.  In America, the average five-year-old lives in a house which, over the child's life to the age of five, has received enough radio energy to lift the family car a kilometre into the air.  That's a lot of energy.

Suppose, just suppose, that all this radiation caused some kind of brain damage in some people.  Not all of them necessarily, just a susceptible few.  Then, as you get more licences for wireless receivers in Britain, so the BBC builds more transmitters and more powerful transmitters, and more people will be affected.  And so it is my sad duty to ask you all: are the electronic media really out to rot your brains?  Will cable TV save us all?

Presented in this form, it's a contrived and, I hope, unconvincing argument.  Not that it matters much, even switching off right now won't stop the radiation coming into your home, so lie back and enjoy it while you can!  My purpose in citing these examples is to show you how statistics can be misused to spread alarm and despondency.  But why bother?

Well, just a few years ago, problems like this were rare.  As I mentioned, calculating just one correlation coefficient was hard yakka in the bad old days.  Calculating the several hundred correlation coefficients you would need to get one really improbable lulu was virtually impossible, so fear and alarm seldom arose.

That was before the day of the personal computer and the hand calculator.  Now you can churn out the correlation coefficients faster than you can cram the figures in, with absolutely no cerebral process being involved.

As never before, we need to be warned to approach statistics with, not a grain, but a shovelful, of salt.  The statistic which can be generated without cerebration is likely also to be considered without cerebration.  Which brings me, slowly but inexorably to the strange matter of the podiatrists, the public telephones, and the births.

Seated one night at the keyboard, I was weary and ill at ease.  I had lost one of those essential connectors which link the parts of one's computer.  Then I found the lost cord, connected up my computer, and fed it a huge dose of random data.

Well, not completely random, just, well, deliberately different.  I told it about the rattiest things I could dredge up, all sorts of odds and sods from a statistical year-book that just happened to be lying around.  In all, I found twenty ridiculously and obviously unrelated things, so there were one hundred and ninety correlation coefficients to sift through.  That seemed about right for what I was trying to do.

When I was done, I pressed button B, switched on the printer, and sat back to wait for the computer to churn out the results of its labours.  The first few lines of print-out gave me no comfort, then I got a good'n, then nothing again, then a real beauty, and so it went.

At the end, I looked over the results.  I saw that NSW podiatrists' registrations showed a correlation of minus point nine eight with the number of South Australian public telephones, and minus point nine six with the Tasmanian birth rate.  The Tasmanian birth rate in turn correlated plus point nine four with the South Australian public phones. All highly improbable!

And proving nothing: I had done enough tests to get at least a few unlikely results, and I was choosing things that were all likely to vary over the years. so I looked at the figures with a sober eye (no, don't ask about the other one, it was having the night off).

Well of course the podiatrists and phones part is easy.  Quite clearly, New South Wales podiatrists are moving to South Australia and metamorphosing into public phone boxes.

Or maybe they're going to Tasmania to have their babies, or maybe Tasmanians can only fall pregnant in South Australian public phone booths.

Or maybe codswallop grows in computers which are treated unkindly.  As I said in the first part, figures can't lie, but liars can figure.

I would trust statistics any day, so long as I can find out where they came from, and I'd even trust statisticians, so long as I knew they knew their own limitations.  Most of the professional ones do know their limitations: it's the amateurs who are dangerous.

I'd even use statistics to choose the safest hospital to go to, if I had to go.  But I'd still rather not go to hospital in the first place.

After all, statistics show clearly that more people die in the average hospital than in the average home.

This would have made more sense if you started with Part 1 and Part 2.  You did?  Well done!