If you have been following the breathless hype of AI and ML over these past few years, you might have noticed the increasing pace at which vendors are scrambling to roll out “platforms” that service the data science and ML communities. The “Data Science Platform” and “Machine Learning Platform” are at the front lines of the battle for the mind share and wallets of data scientists, ML project managers, and others that manage AI projects and initiatives. But what exactly are these platforms and why is there such an intense market share grab going on?
One insight is the realization that ML and data science projects are nothing like typical application or hardware development projects. Whereas in the past hardware and software development aimed to produce functionality that individuals or businesses could individually run or control, data science and ML projects are really about managing data, continuously evolving learning gleaned from data, and the evolution of data models based on constant iteration. Typical development processes and platforms simply don’t work from a data-centric perspective.
It should be no surprise then that technology vendors of all sizes are focused on developing platforms that data scientists and ML project managers will depend on to develop, run, operate, and manage their ongoing data models for the enterprise. The thought from these vendors is that the ML platform of the future is like the operating system or cloud environment or mobile development platform of the past and present. If you can dominate market share for data science / ML platforms, you will reap rewards for decades to come. As a result, everyone with a dog in this fight is fighting to own a piece of this market.
However, what does a Machine Learning platform look like? How is it the same or different than a Data Science platform? What are the core requirements for ML Platforms, and how do they differ from more general data science platforms? Who are the users of these platforms, and what do they really want? Let’s dive deeper.
What is the Data Science Platform?
In our earlier article on Data Scientists vs. Data Engineers, we talked a bit about what data scientists do and what they want to accomplish with the technology to support their missions. In summary, data scientists are tasked with wrangling useful information from a sea of data and translating business and operational informational needs into the language of data and math. Data scientists need to be masters of statistics, probability, mathematics, and algorithms that help to glean useful insights from huge piles of information. A data scientist is a scientist who creates hypothesis, runs tests and analysis of the data, and then translates their results for someone else in the organization to easily view and understand. So it follows that a pure data science platform would meet the needs of helping craft data models, determining the best fit of information to a hypothesis, testing that hypothesis, facilitating collaboration amongst teams of data scientists, and helping to manage and evolve the data model as information continues to change.
Furthermore, data scientists don’t focus their work in code-centric Integrated Development Environments (IDEs), but rather in notebooks. First popularized by academically-oriented math-centric platforms like Mathematica and Matlab, but now prominent in the Python, R, and SAS communities, notebooks are used to document data research and simplify reproducibility of results by allowing the notebook to run on different source data. The best notebooks are shared, collaborative environments where groups of data scientists can work together and iterate models over constantly evolving data sets. While notebooks don’t make great environments for developing code, they make great environments to collaborate, explore, and visualize data. Indeed, the best notebooks are used by data scientists to quickly explore large data sets, assuming sufficient access to clean data.
Indeed, data scientists can’t perform their jobs effectively without access to large volumes of clean data. Extracting, cleaning, and moving data is not really the role of a data scientist, but rather that of a data engineer. Data engineers are challenged with the task of taking data from a wide range of systems in structured and unstructured formats, and data which are usually not “clean”, with missing fields, mismatched data types, and other data-related issues. In this way, the role of a data engineer is an engineer who designs, builds and arranges data. Good data science platforms also enable data scientists to easily leverage compute power as their needs grow. Instead of copying data sets to a local computer to work on them, platforms allow data scientists to easily access compute power and data sets with minimal hassle. A pure data science platform is challenged with the needs to provide these data engineering capabilities as well. As such, a practical data science platform will have elements of pure data science capabilities and necessary data engineering functionality.
What is the Machine Learning Platform?
We just spent several paragraphs talking about data science platforms and not even once mentioned AI or ML. How are data science platforms relevant to ML? Well, simply put, Machine Learning is the application of specific algorithms, additional unsupervised or supervised training approaches, and learning-focused iteration to the large sets of data that would otherwise be operated on by data scientists. The tools that data scientists use on a daily basis have significant overlap with the tools used by ML-focused scientists and engineers. However, these tools aren’t the same, because the needs of ML scientists and engineers are not the same as more general data scientists and engineers.
Rather than just focusing on notebooks and the ecosystem to manage and collaboratively work with others on those notebooks, folks tasked with managing ML projects need access to the range of ML-specific algorithms, libraries, and infrastructure to train those algorithms over large and evolving datasets. ML Platforms help ML data scientists and engineers discover which machine learning approaches work best, how to tune hyperparameters, deploy compute-intensive ML training across on-premise or cloud-based CPU, GPU, and/or TPU clusters, and provide an ecosystem for managing and monitoring both unsupervised as well as supervised modes of training.
Clearly a collaborative, interactive, visual system for developing and managing ML models in a data science platform is necessary, but it’s not sufficient for an ML platform. As hinted above, one of the more challenging parts of making ML systems work is the setting and tuning of hyperparameters. The whole concept of a machine learning model is that it’s a mathematical formula that requires various parameters to be learned from the data. Basically, what machine learning is actually learning are the parameters of the formula, and basically fitting new data to that learned model. Hyperparameters are configurable data values that are set prior to training an ML model that can’t be learned from data. These hyperparameters indicate various factors such as complexity, speed of learning, and more. Different ML algorithms require different hyperparameters, and some don’t need any at all. ML platforms help with the discovery, setting, and management of hyperparameters, among other things including algorithm selection and comparison that non-ML specific data science platforms don’t provide.
What do ML Project Managers Really Want?
At the end of the day, ML project managers simply want tools to make their jobs more efficient and effective. While we have written earlier that not all ML is AI, and perhaps some of the ML approaches are used primarily for non-AI predictive analytics, those seeking to add true intelligence as part of their mission need the same capabilities regardless of how ML is being applied. The real winners in the ML platform race will be the ones that simplify ML model creation, training, and iteration. They will make it quick and easy for companies to move from dumb unintelligent systems to ones that leverage the power of ML to solve problems that previously could not be addressed by machines. This is the ultimate vision of ML as applied to AI: make systems autonomous, intelligent, and generate knowledge and action that otherwise would require human capabilities.
ML platforms that enable this capability are winners. Data science platforms that don’t enable ML capabilities will be relegate to non-ML data science tasks. Vendor who pretend that their business intelligence, data analytics, big data engineering, programming-centric, or other tools are rebranded AI / ML platforms are in for a rude awakening. We know who you are, and no, you are not an AI / ML platform vendor. Stay tuned for our big report on Data Science and Machine Learning Platforms as we sort out who is doing what in the ML platform space, which data science platform vendors are the ones worth paying attention to in the ML space, what is necessary functionality for ML platforms and what is not, and who is starting to win the race for marketshare in this constantly evolving, but significantly attractive market.