[ Video version here. ]
My first job was as a machine learning researcher in a product group at one of the big tech companies.
I remember thinking I needed a backup plan, because I was pretty sure the whole machine learning thing would turn out to be a fad, and then I’d have to figure out if I wanted to be a software engineer or a program manager…or if I’d just try to move back in with mom and dad.
A few years later I started managing other machine learning researchers. Every year at review time I encouraged them to keep up their non-machine learning skills. You know, because of the whole ‘machine learning might be a fad’ thing…
Boy was I wrong. Today machine learning is more than just a single stable career path, there are actually many different types of careers you can have in machine learning, depending on your interests and skills.
For example, my titles have included: researcher, applied researcher, program manager, applied research manager, principal software engineer, architect, machine learning scientist, and software engineering manager. And over the past 15 years I’ve worked with: data scientists, decision scientists, data analyst, machine learning engineers, data quality engineers, scientists, applied scientists, research scientists, and ranking engineers. And all of these titles were doing similar data and ML focused work.
So what the heck is going on?
Well, the large scale use of data is relatively new and we’re inventing stuff as we go. Different organizations have different data cultures, and there are many strange evolutionary forces at work.
For example, one company I worked for got rid of the ‘software test engineer’ function. In the process, many software test engineers were given the option to change their titles ‘data scientist’…and then to try to figure out what the heck a data scientist did for a living…
As you can imagine, this led to some chaos. And it strongly affected the data culture at the company. Because everyone who was a data scientist before this change did everything they could to avoid getting sucked into management chains that had no data experience, but huge data ambition.
The result? If you’re looking for a job at this company and search for ‘data science’ you might not end up with what you expect.
Another company I worked for thinks data skills will consolidate over time and eventually every software engineer will add ‘data and ML’ to their toolkit…and there won’t be any specialized data people…eventually. So if you search for ‘data scientist’ at this company you might find nothing – but the company would love to hire people with strong data science skills as ‘engineers’.
I’m sure there are hundreds of similar stories across the industry. And as people who learn data in one culture move between companies, things are diffusing and blending in crazy ways.
The point is: when you’re looking for a job in data or machine learning, keep an open mind – don’t get over-indexed on a particular title or a particular way of looking at the field.
So where to start? Here are five data professional job functions that I think will become stable over time. The names may vary, but the functions (hopefully) won’t. These are:
- Machine Learning Researcher
- Data Scientist
- Machine Learning Scientist (Modeler)
- Machine Learning Engineer
- Machine Learning Architect (Program Manager)
Keep in mind that these are functions, not jobs. Most jobs will blend these to various degrees. I’ll go through and give a bit more detail.
Machine Learning Researcher
Machine Learning Researchers advance the state of human knowledge. They come up with theories about how the world works and they create experiments to test those theories. When they are right, the result is new algorithms or approaches that allow us to accomplish more than we thought we could.
And it is an exciting time to be a researcher in machine learning. Things are advancing crazy fast and small groups of people have accomplished incredible things.
Machine Learning researchers might be ‘applied’ in that they work in the context of a specific product, like a search engine or a self-driving car. But fundamentally research is not about building products. It is about understanding why a particular approach works in a particular setting, and creating knowledge that transcends any single feature or product.
Core skills for success with machine learning research include: scientific method, ability to deal with ambiguity, a high level of comfort with advanced math, good communication, the ability to advocate for crazy ideas, and just enough engineering skills to carry out some experiments.
To become a professional Machine Learning Researcher – like to get some company to sponsor you to sit around and try to advance human knowledge – you really need to publish papers at top scientific conferences. And the only practical way to learn how to do that is to get a PhD in Machine Learning.
There was a time when the only way to learn machine learning was to get a PhD; so there was a time where just about every professional machine learning practitioner had a PhD. But this is no longer necessary. In fact, unless you really want to advance human knowledge and write papers: a PhD is an inefficient way to become a data professional.
Data Scientists find the stories in data and share them with others. They explore large data sets and answer questions, measure performance, track down problems, and find unexpected connections. A great data scientist is like a detective – they know how to interpret the clues they find in the data and track those clues to uncover valuable insights.
Most data scientists have background in statistics or applied mathematics, coupled with enough programming skill to independently get at log data, process and clean it, query it, and automate repetitive tasks.
And data science requires a specific mentality. You have to like staring at data and dreaming up stories that explain it. But you also have to be meticulous and technical enough to prove or disprove your stories (before spouting off random theories and confusing everyone around you).
Core skills for success with data science include: A curious and flexible mind, deep statistical knowledge, familiarity with data querying languages, and moderate programming, probably in R or Python.
And what does this have to do with machine learning?
You might say that Data Science is about understanding what is happening in a big complicated system, while “machine learning” is about predicting what is going to happen in the future. There is a lot of overlap in tools and approaches. The differences are about the focus.
Machine Learning Scientist (Modeler)
Machine Learning Scientists build models. They find or create training data, do feature engineering, they know what learning algorithm to use for any particular task, they tune model parameters, they measure, measure, measure, and they know how to evaluate the output of modeling runs and what to change to make based on these observations.
Modeling is an open ended, exploratory task. Kind of like constantly debugging a program written in a language you can’t understand. A machine learning scientist might spend weeks or months working on a single modeling task, making the model just a little bit better every single day.
Core skills for success at machine learning science include: a deep intuition with the core modeling algorithms and approaches, expertise in one or more domains (like NLP or computer vision), strong programming in a language like python, a lot of comfort with data processing and querying, and a passion for measuring and debugging.
The best way to become a machine learning scientist is to get a degree in a field with a computational focus, like computer science, applied math, maybe statistics or even a hard science like physics or chemistry. This will give the core statistics and computation skills. And then go to Kaggle.com and start entering their modeling competitions. Start doing well in Kaggle, and you’re well on your way to becoming a machine learning scientist.
Machine Learning Engineer
Machine Learning Engineers integrate machine learning into working systems to produce successful end-to-end experiences. They implement the runtimes where models execute, they build systems to deploy new models reliably, they connect model output into user experiences, and they build systems that collect telemetry about interactions between users and models, producing future training data.
Machine learning engineers create the systems that put guardrails around the machine learning modeling process, allowing creative exploration, but providing simple, reliable ways to take the resulting models and ingest them into the broader system.
Core skills for success in machine learning engineering start, of course, with a strong software engineering base. Beyond that, a good conceptual understanding of machine learning is key. Not the math behind the algorithms – that’s not super important to a machine learning engineer – but the pieces that make up a machine learning implementation, and where in the system should they live.
With a little study, any software engineer can get into machine learning engineering. You could take an online course, do a few Kaggle tutorials. But in my opinion, the best place to start is by reading this book. Building Intelligent Systems. Which I wrote. This book has all the stuff I wish I knew when I got started doing machine learning professionally.
Machine Learning Architect / Program Manager
Machine learning architects / program managers design ML-based solutions to real world problems. They know when machine learning is the right tool (and when it isn’t); they understand how to optimize a system end to end so that the machine learning is in position to shine; they know how to design around the mistakes that machine learning is guaranteed to make; and they know how to nurture a machine learning system through its lifecycle from a technical demo, to a viable product, to a world class solution.
When machine learning architects look at a problem they don’t ask: can my organization model that. They ask: should my organization model that. And if so what’s the best approach to be efficient and reliable. You can learn more by watching this video or this blog post.
Core skills for a machine learning architect or program manager are strong software design skills, customer empathy, and a strong conceptual understanding of aspects of machine learning (but not the math and not the specific algorithms).
And the best way to get into machine learning architecture or program management? Work as a program manager or engineer for a while… and then read Building Intelligent Systems. I don’t know. I’m sorry. I guess I’m a bit biased.
It’s an exciting time to be a data professional. Data and machine learning are making the world a better place – and things are changing fast. Good luck. Stay safe!
|ML Career||Core Activity||Core Skills|
|Machine Learning Researcher||Advance human knowledge||Scientific method, math, basic programming|
|Data Scientist||Stories from data||Statistics, data manipulation, communication|
|(Applied) Machine Learning Scientist||Build predictive models||Machine learning algorithms, domain specific feature engineering, basic programming|
|Machine Learning Engineer||Integrate machine learning into systems||Software engineering, conceptual machine learning|
|Machine Learning Architecture / Program Manager||Design solutions that leverage machine learning||Software design skills, customer empathy, Strong conceptual machine learning|
2 thoughts on “Top five career paths for data professionals”
In the evolving scope of the Corona-crisis, I am concerned that machine learning advocates have not been forthcoming concerning lessons-learned from ‘The Parable of Google Flu’ https://science.sciencemag.org/content/343/6176/1203.full
It is my belief that the Coronacrisis will be the watershed moment when data science retrenches and admits the inherent limitations of monomaniacal predictive ‘big data’ machine learning, deep learning in particular.
This will not be a surprise, as even predictive machine learning advocates admit that the discipline is dangerously over-hyped. Non-specialists simply misunderstand and expect too much from predictive machine learning, pursuing ‘magic box’ solutions.
We can consider this as a natural and healthy adjustment which will lead to improvements concerning how people think about and frame data science more generally (the ‘Plateau of Productivity’ in the Gartner Hype Cycle (https://en.wikipedia.org/wiki/Hype_cycle)).
What is needed to build a future data science competency in the post-COVID future? The field of data science will need to shift its attitude and approach in two particular respects:
1) Abandon the false dichotomy that predictive machine learning has made classical statistics somehow obsolete (e.g. the summit of the Gartner ‘Peak of Inflated Expectations’ as driven by writings such as: The Big Idea: The Next Scientific Revolution (https://hbr.org/2010/11/the-big-idea-the-next-scientific-revolution) and The Fourth Paradigm: Data-Intensive Scientific Discovery (https://www.microsoft.com/en-us/research/publication/fourth-paradigm-data-intensive-scientific-discovery/)).
2) Retrench and reinvigorate data science by focusing on methods that integrate explanatory (more associated with classical statistics and econometrics) and machine learning approaches (as more associated with correlation and prediction in large datasets).
Regarding treating data prior to analysis, statisticians, generally cleave to some version of Tukey’s EDA process (https://en.wikipedia.org/wiki/Exploratory_data_analysis)to develop causal-explanatory models. Those pursuing predictive machine learning generally follow an analytics life cycle process (https://www.amazon.com/Analytics-Lifecycle-Toolkit-Practical-Capability/dp/1119425069/), or a particular ML model development approach such as CRISP-DM (https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining).
There is a somewhat contrived debate among some statisticians and some machine learning advocates regarding the virtues and vices of the other camp. Some of this are those who just would rather not invest time in building bridges between the two domains. However, well-respected statisticians and econometricians (i.e. H. Varian (http://people.ischool.berkeley.edu/~hal/Papers/2013/ml.pdf), T. Hastie (https://web.stanford.edu/~hastie/TALKS/SLBD_new.pdf), J. Pearl (https://www.amazon.com/Book-Why-Science-Cause-Effect/dp/046509760X/ref=sr_1_1), D. Donoho (https://www.youtube.com/watch?v=QTzNXYcZLbU)), in one form or another, advocate bridging explanatory and predictive methods. I would recommend absorbing the works linked here.
It is a fairly heavy commitment to be fluent in both the range of multivariate inferential statistical methods (causal-explanatory) and machine learning methodologies (supervised-predictive). However, pursuing predictive (generally correlative) ML models alone invokes ‘correlation not causation’ and risks overfitting to noise and spurious correlations (see ‘The Parable of Google Flu (https://gking.harvard.edu/files/gking/files/0314policyforumff.pdf)’ for an epidemiological big data ML morality tale).
This is not to say statistics has an answer or cure in these cases, but it certainly is much better at quantifying uncertainty formally and incorporating that uncertainty in rigorous assessments.
If I had to put my money down, I believe in 2 years we will see more adherence to and respect for, in particular, Bayesian statistical decision theory. Deep learning isn’t going to get us out of this situation.
For those data scientists focused on deep learning / predictive machine learning, I would recommend expanding your skillset by reading up on Bayesian methods, particularly statistical decision theory. Here is a good introduction: Introduction to Statistical Decision Theory: Utility Theory and Causal Analysis (https://www.routledge.com/Introduction-to-Statistical-Decision-Theory-Utility-Theory-and-Causal/Bacci-Chiandotto/p/book/9781138083561)
The future of data science will be:
* Not ignoring classical statistics as ‘something superseded by machine learning’
* Less arrogant and presumptive of being able to solve every problem ever and more careful concerning the inherent limitations of insights provided (hint: garbage-in-garbage-out rule)
* Data scientists will be properly viewed as support for domain experts, not (Dunning-Kruger (https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect)) leaders in whatever domain they jump into (e.g. epidemiologists should take the lead in COVID-19 analytics and be supported by data scientists, not the other way around)
* Paying attention to the data generating process (DGP) – source, quality, fit with reference to the famed research question under scrutiny, etc.
* Blending explanatory (statistical / econometric) and predictive (machine learning) approaches
Predictive machine learning depends on having high quality training data. The COVID-19 crisis has exposed quite dramatically how obtaining high-quality data is not a given and cannot be taken for granted. While some machine learning techniques assist in accommodating poor quality data, it cannot by nature overcome data that is fundamentally lacking in quality or is marked by ‘real world’ uncertainty to such a degree that there are inherent doubts concerning the representativeness of the data.
For instance, as epidemiologists know well, key variables such as the transmission (spread) rate and fatality rate of a disease may not be easy to determine during an epidemic. This is because there are many factors associated with spread and it may be difficult to track transmission discretely. Essentially robust contact records are needed to determine transmission vectors. This is why there has been a sudden interest in potentially using cellphones to track spread and infection.
Mortality rate is also subject to a great deal of uncertainty as the impact of comorbidity (when a condition is influenced by pre-existing underlying or co-occurring conditions) may be difficult. Age, lifestyle (e.g. drinking, smoking), demographic, and genetic factors are also an element. There are fundamental questions and uncertainties concerning the degree to which the progression of COVID-19 infection is related to or effected by a whole range of complex health factors. There is also the factor of exposure – whether getting a large versus smaller exposure to the virus, the latter potentially being easier for the immune system to overcome.
In short, there are a range of fundamental variables which are subject to uncertainty and incomplete data. No fancy algorithm or magic box will fix this automatically.
This gets back to a fundamental issue in data science: the degree to which it embraces classical statistics versus the degree to which it (over-) focuses on machine learning, predictive machine learning in particular.
A number of popular predictive machine learning algorithms are deployed without conducting deeper investigations into the origins of (how was the data collected?), quality (how much can the data be trusted? is the dataset complete and representative?), and probabalistic nature (parametric characteristics) of the data.
Many predictive machine learning algorithms are deployed and operate under the assumption that correlation is a reasonable proxy, in aggregate, for causation. That is, that ‘big’ data, many variables and/or many instances, is an acceptable way to ‘get around’ problems associated with data quality and completeness.
The classical statistician methodologically disagrees, having a primary focus on qualifying and quantifying the dataset prior to statistical analysis.
In classical statistics there is a notion of the ‘data generating process’ (DGP): that what we see (the data sample) is restricted by what and how we measure, often with strange and unexpected distortions.
For instance, during Ebola outbreaks infections appear to dip when they become serious because people hide their sick family members as otherwise strange hazmat-suited foreigners come and take them away and ‘kill them’. As in the COVID-19 outbreak, the reporting of and subsequent actions reacting to the outbreak cause distortions in how people behave. This makes it more difficult to collect high-quality data.
Dr. Fauci and other experts have been attempting to communicate this: that not only is there a great amount of uncertainty associated with the data we have, but that our reporting on what we believe about the uncertain data affects people’s behavior, causing follow-on distortions on subsequent events. This means, essentially, that the DGP for COVID-19 is not only riven with uncertainty, but that the nature of the uncertainty shifts as reporting and social behavior evolves. We will never have a complete ‘real-time real world’ picture, as we are only seeing shadows in data samples.
This concept of the classical statistics notion of the DGP has been somewhat sidelined in the hubristic age of BIG DATA and data science. There is this assumption that we always have ALL the data and know everything about a situation.
However, this whole situation has revealed that collecting good data is and always will be tricky and difficult. And then even when we try our best, the messy real world intervenes and introduces uncertainty (if not outright distortion due to unanticipated social behaviors resulting from reporting).
Wow Scott, what an amazingly thoughtful comment — thanks!
Based on this (and other comments) I realize I’m lacking some knowledge of statistics and how it fits in.
I also have a thought that the most expensive data-modeling mistake in human history (and for a while to come) was almost certainly made in the past two months. I’m not sure what, exactly, it was, but there were so many obviously-flawed models projecting serious things on gigantic stages…
When the dust settles, I hope we find out what the mistake was and who made it.
And I hope we add to our general education so the underlying flaw is much-less-likely to happen again.
Comments are closed.