Wow Scott, what an amazingly thoughtful comment — thanks!

Based on this (and other comments) I realize I’m lacking some knowledge of statistics and how it fits in.

I also have a thought that the most expensive data-modeling mistake in human history (and for a while to come) was almost certainly made in the past two months. I’m not sure what, exactly, it was, but there were so many obviously-flawed models projecting serious things on gigantic stages…

When the dust settles, I hope we find out what the mistake was and who made it.

And I hope we add to our general education so the underlying flaw is much-less-likely to happen again.

]]>It is my belief that the Coronacrisis will be the watershed moment when data science retrenches and admits the inherent limitations of monomaniacal predictive ‘big data’ machine learning, deep learning in particular.

This will not be a surprise, as even predictive machine learning advocates admit that the discipline is dangerously over-hyped. Non-specialists simply misunderstand and expect too much from predictive machine learning, pursuing ‘magic box’ solutions.

We can consider this as a natural and healthy adjustment which will lead to improvements concerning how people think about and frame data science more generally (the ‘Plateau of Productivity’ in the Gartner Hype Cycle (https://en.wikipedia.org/wiki/Hype_cycle)).

What is needed to build a future data science competency in the post-COVID future? The field of data science will need to shift its attitude and approach in two particular respects:

1) Abandon the false dichotomy that predictive machine learning has made classical statistics somehow obsolete (e.g. the summit of the Gartner ‘Peak of Inflated Expectations’ as driven by writings such as: The Big Idea: The Next Scientific Revolution (https://hbr.org/2010/11/the-big-idea-the-next-scientific-revolution) and The Fourth Paradigm: Data-Intensive Scientific Discovery (https://www.microsoft.com/en-us/research/publication/fourth-paradigm-data-intensive-scientific-discovery/)).

2) Retrench and reinvigorate data science by focusing on methods that integrate explanatory (more associated with classical statistics and econometrics) and machine learning approaches (as more associated with correlation and prediction in large datasets).

Regarding treating data prior to analysis, statisticians, generally cleave to some version of Tukey’s EDA process (https://en.wikipedia.org/wiki/Exploratory_data_analysis)to develop causal-explanatory models. Those pursuing predictive machine learning generally follow an analytics life cycle process (https://www.amazon.com/Analytics-Lifecycle-Toolkit-Practical-Capability/dp/1119425069/), or a particular ML model development approach such as CRISP-DM (https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining).

There is a somewhat contrived debate among some statisticians and some machine learning advocates regarding the virtues and vices of the other camp. Some of this are those who just would rather not invest time in building bridges between the two domains. However, well-respected statisticians and econometricians (i.e. H. Varian (http://people.ischool.berkeley.edu/~hal/Papers/2013/ml.pdf), T. Hastie (https://web.stanford.edu/~hastie/TALKS/SLBD_new.pdf), J. Pearl (https://www.amazon.com/Book-Why-Science-Cause-Effect/dp/046509760X/ref=sr_1_1), D. Donoho (https://www.youtube.com/watch?v=QTzNXYcZLbU)), in one form or another, advocate bridging explanatory and predictive methods. I would recommend absorbing the works linked here.

It is a fairly heavy commitment to be fluent in both the range of multivariate inferential statistical methods (causal-explanatory) and machine learning methodologies (supervised-predictive). However, pursuing predictive (generally correlative) ML models alone invokes ‘correlation not causation’ and risks overfitting to noise and spurious correlations (see ‘The Parable of Google Flu (https://gking.harvard.edu/files/gking/files/0314policyforumff.pdf)’ for an epidemiological big data ML morality tale).

This is not to say statistics has an answer or cure in these cases, but it certainly is much better at quantifying uncertainty formally and incorporating that uncertainty in rigorous assessments.

If I had to put my money down, I believe in 2 years we will see more adherence to and respect for, in particular, Bayesian statistical decision theory. Deep learning isn’t going to get us out of this situation.

For those data scientists focused on deep learning / predictive machine learning, I would recommend expanding your skillset by reading up on Bayesian methods, particularly statistical decision theory. Here is a good introduction: Introduction to Statistical Decision Theory: Utility Theory and Causal Analysis (https://www.routledge.com/Introduction-to-Statistical-Decision-Theory-Utility-Theory-and-Causal/Bacci-Chiandotto/p/book/9781138083561)

The future of data science will be:

* Not ignoring classical statistics as ‘something superseded by machine learning’

* Less arrogant and presumptive of being able to solve every problem ever and more careful concerning the inherent limitations of insights provided (hint: garbage-in-garbage-out rule)

* Data scientists will be properly viewed as support for domain experts, not (Dunning-Kruger (https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect)) leaders in whatever domain they jump into (e.g. epidemiologists should take the lead in COVID-19 analytics and be supported by data scientists, not the other way around)

* Paying attention to the data generating process (DGP) – source, quality, fit with reference to the famed research question under scrutiny, etc.

* Blending explanatory (statistical / econometric) and predictive (machine learning) approaches

Predictive machine learning depends on having high quality training data. The COVID-19 crisis has exposed quite dramatically how obtaining high-quality data is not a given and cannot be taken for granted. While some machine learning techniques assist in accommodating poor quality data, it cannot by nature overcome data that is fundamentally lacking in quality or is marked by ‘real world’ uncertainty to such a degree that there are inherent doubts concerning the representativeness of the data.

For instance, as epidemiologists know well, key variables such as the transmission (spread) rate and fatality rate of a disease may not be easy to determine during an epidemic. This is because there are many factors associated with spread and it may be difficult to track transmission discretely. Essentially robust contact records are needed to determine transmission vectors. This is why there has been a sudden interest in potentially using cellphones to track spread and infection.

Mortality rate is also subject to a great deal of uncertainty as the impact of comorbidity (when a condition is influenced by pre-existing underlying or co-occurring conditions) may be difficult. Age, lifestyle (e.g. drinking, smoking), demographic, and genetic factors are also an element. There are fundamental questions and uncertainties concerning the degree to which the progression of COVID-19 infection is related to or effected by a whole range of complex health factors. There is also the factor of exposure – whether getting a large versus smaller exposure to the virus, the latter potentially being easier for the immune system to overcome.

In short, there are a range of fundamental variables which are subject to uncertainty and incomplete data. No fancy algorithm or magic box will fix this automatically.

This gets back to a fundamental issue in data science: the degree to which it embraces classical statistics versus the degree to which it (over-) focuses on machine learning, predictive machine learning in particular.

A number of popular predictive machine learning algorithms are deployed without conducting deeper investigations into the origins of (how was the data collected?), quality (how much can the data be trusted? is the dataset complete and representative?), and probabalistic nature (parametric characteristics) of the data.

Many predictive machine learning algorithms are deployed and operate under the assumption that correlation is a reasonable proxy, in aggregate, for causation. That is, that ‘big’ data, many variables and/or many instances, is an acceptable way to ‘get around’ problems associated with data quality and completeness.

The classical statistician methodologically disagrees, having a primary focus on qualifying and quantifying the dataset prior to statistical analysis.

In classical statistics there is a notion of the ‘data generating process’ (DGP): that what we see (the data sample) is restricted by what and how we measure, often with strange and unexpected distortions.

For instance, during Ebola outbreaks infections appear to dip when they become serious because people hide their sick family members as otherwise strange hazmat-suited foreigners come and take them away and ‘kill them’. As in the COVID-19 outbreak, the reporting of and subsequent actions reacting to the outbreak cause distortions in how people behave. This makes it more difficult to collect high-quality data.

Dr. Fauci and other experts have been attempting to communicate this: that not only is there a great amount of uncertainty associated with the data we have, but that our reporting on what we believe about the uncertain data affects people’s behavior, causing follow-on distortions on subsequent events. This means, essentially, that the DGP for COVID-19 is not only riven with uncertainty, but that the nature of the uncertainty shifts as reporting and social behavior evolves. We will never have a complete ‘real-time real world’ picture, as we are only seeing shadows in data samples.

This concept of the classical statistics notion of the DGP has been somewhat sidelined in the hubristic age of BIG DATA and data science. There is this assumption that we always have ALL the data and know everything about a situation.

However, this whole situation has revealed that collecting good data is and always will be tricky and difficult. And then even when we try our best, the messy real world intervenes and introduces uncertainty (if not outright distortion due to unanticipated social behaviors resulting from reporting).

]]>