I've been largely silent, consumed by the news of the day, but I have decided to break my silence, realizing that it was the sheer volume of my thoughts that was preventing me from speaking.
So often, I hear or read that computers are the problem. Our relatively newfound ability to locate and process mass quantities of data somehow 'caused' the current financial crisis. What is odd and funny to me (not in the ha-ha sense) is that the machines are not really the problem. It's that those using them -- by omission, commission, or both -- are unknowingly wielding tools about which they know so little. (I refer to the mathematical formulae, the hardware and the software necessary to process mass quantities of data, which comprise a key element of the deck of cards collapsing around us).
Please forgive what may seem like an oversimplification, but statistical analyses basically boil down to two types of methods:
- Those that rely on experimental data (basis=ANOVA, the analysis of variance), and
- Those that rely on post-hoc analysis of trend or pattern-level data (most basic building block=the Pearson product moment correlation also known as the Pearson r. It is the basis of regression analysis, latent variable analysis--AKA factor analysis and all its permutations, path analysis and more).
Both of these, are inferential statistics. That is, they are statistics that result from calculations on a sample; if the results are dramatic enough that they are unlikely to be due to chance alone, one can infer that the result will hold true for the population as a whole, within a range of variation known as the confidence interval and limited by the likelihood of type 1 (false positive) or type 2 (false negative) errors. Any inferential statistic, by definition, contains something called an error term, because one is predicting something that applies to an entire population (be it human, financial, or otherwise) from a sample. Predictive models simply cannot predict a single case of anything. Note: In the case of a census, no inference is necessary because the population parameters are known.
Moreover, there are certain assumptions built into all of these models which, if violated, render the outcome invalid. My favorite is called: homocedasticity. This is a basic assumption of regression analysis and means that the variation of x scores around the regression (y) line falls within certain limits and is not scattered all over the place.
Several points about statistical analysis, inference, and prediction of outcomes:
- When you start putting together multiple calculations using multiple sources of data upon which you have made inferences, you'd better remember to include the error term for each in your calculations. BTW, the error term is additive. If you put together too many equations and data sources, the error term gets bigger and bigger, and you might as well flip the script and admit that the error term is larger than the confidence interval, and if any prediction is to be made with confidence, it is that your result will be wrong.
- If you also fail to take seriously the assumptions of each type of analysis you are using and make sure they are met, you are doubly doomed.
- Often times, there is no empirical basis for the input data other than wishful thinking on the part of the data source. For instance, just because you believe that your privately held company is worth 20X gross revenue, even though you have never tested that theory, and no company like yours has ever sold for more than 2X revenue, doesn't make it so. My hunch is that this sort of wishful and -- in some cases delusional -- thinking is also a factor in the nation's current economic implosion.
The overarching point is that the machines are just doing the bidding of the people who run them. Any self-respecting statistician knows the above points to be true, but the statisticians have never been in charge. The people who run the show are the ones who hired the statisticians who used technology to perform calculations.
Whether the statisticians bowed to the wishes of their employers, or had themselves forgotten that no matter how perfect the strength of an association, the type one or type two error scores, or etc., no inference can ever predict any one specific outcome--or whether they were clear about the limits of prediction and were simply ignored by their employers--is immaterial. The point is it's easy to blame a machine, even for doing what you told it to. The last time I checked, machines weren't able to defend themselves.
Oh, by the way, there is no such thing as AI, unless one is referring to a certain extremely talented basketball player with a mind of his own. To think that a bunch of equations could ever mimic the complexity, the quirkiness, and the multidimensionality (not sure if this is a real word--if not, hope the meaning is clear) of the human mind -- which exists not just in the head but also in the finger, the small intestine, and etc.-- is surely delusional.
Truly, it is a modern day version of Pygmalion but with a less happy outcome. At least the original Pygmalion fell in love with the statue of a woman. After praying to Venus to bring his beloved statue to life and having his wish granted, the couple bore a son and a daughter. I think we now are seeing just how unappealing the offspring of a person and an algorithm can be.