Explaining Machine Learning to Fido 2.0
In part one of my machine learning primer, we compiled a table of the relevant diagnoses, medical history, and proximity to work accidents of three claimants with knee injuries. From that, we fit a predictive model, a classification tree, to predict the likelihood of these claimants receiving surgery. The model is forward looking; we intend to use it to predict the outcome for a new claimant. But are we ready to start using our model to predict outcomes for new claimants? Intuitively we understand no, probably not; whatever pattern we uncover in three people is unlikely to generalize. But let’s say we have 100,000 claimants in our sample, or even more - we have big data. Would we be confident our predictions would generalize then? Could we be confident that claimants suffering from a dislocated kneecap who have yet to receive surgery have a 7.3% chance of getting it if that’s been the incidence of surgery for 75,000 similar claimants?
In machine learning, the convention is to split sample data into two groups, one to train a model, and the other to test it. We can imagine that we fit our model on 100,000 claims in our training split and held out another 25,000 claims in a testing split. We can make predictions on the claims in our testing split, and by comparing how our predictions align with the outcomes of those claims, evaluate our model’s performance. If we were interested in the calibration of our predictions, do we find that 50% of the claimants in our holdout set with a torn meniscus and recent accident receive surgery, as our model predicts? A model that performs well on a training split but does not perform commensurately on holdout data is said to be overfit.
To close out part 2, I want to emphasize that the entire edifice hinges on the data itself. If we have unrepresentative data - because, for example, new standards of care postdate our claim records, the providers for our claimants are relatively idiosyncratic, or there are just a bunch of corruptions - we might build a model that performs great on a testing split. But the object is really to maintain performance out of sample. Our testing split is a proxy for out of sample data, but it’s still part of the sample. To the extent possible, we want to continue assessing how our model performs on new data. But if real world, out of sample data is significantly different from our training data, no amount of model selection, parameter tuning, or gold standard validation technique is going to yield viable performance.
To quote Sir Josiah Charles Stamp (1880-1941), "The government [is] extremely fond of amassing great quantities of statistics. These are raised to the nth degree, the cube roots are extracted, and the results are arranged into elaborate and impressive displays. What must be kept ever in mind, however, is that in every case, the figures are first put down by a village watchman, and he puts down anything he damn well pleases."
To learn more about how you can achieve RADICAL results for claims in 2021
12:30pm EST ⏰ 11:30am CST ⏰ 10:30am MST ⏰ 9:30am PST ⏰5:30pm UK
For More Information:
To Initiate a REFERRAL for Services
Call Toll-Free: 888-434-9326 Ext. 101