3 Early Questions to Indicate the Quality of a Machine Learning Setup

By Simon Brendel, Philipps & Byrne

What should I ask an AI company early in the funnel to figure out if their Machine Learning is good? This is something we get asked quite often from clients, especially VCs who screen their deal flow. With the recent trends in Generative AI, the GPT models for instance, we see more AI companies arising than in the years before. Now, VCs are tasked with selecting the most promising of these companies to present them to their investment committee members. Obviously, quite often the question comes, if this is really “good ML”. Companies will claim to be using AI, but what is it really and how strong is it? We as technology experts are quite often asked by investors how they can separate the good from the bad AI. That is why we wanted to give some early heuristic to think about the quality of a setup in the form of three questions.

First Question: Are You Dealing With Machine Learning or Traditional Rule-Based Systems?

The first distinction one can draw is whether you are really dealing with ML or a hand-crafted rule-based system. In other words: Does the tech involve statistical learning from data or not? Try ruling out, if it is a system that improves when trained with more data. Then it is ML. In addition, self-learning ML includes self-supervised or unsupervised methods as well. The question of whether and how best to allow such self-improvement is called “online learning”. So, is this the case, or is it mostly a system that works on certain rules, that are basically set by developers beforehand and that is not trained with data or self-learning.

The latter is usually not real AI — at least not in the more recent understanding. So that is the first question: You can distinguish between a more rule-based technology and an AI or ML technology where they have a certain statistical learning model that they train or where they even build a model themselves and by that have real ML capabilities within the company.

Second Question: Do They Make or Buy?

The second level of distinction: Are they developing the ML themselves, or is it a model provided by a third party? There are open source models or proprietary models from such third parties. If it is open source, very often companies will adapt or fine-tune those models themselves. In that case it is not really developing something from scratch, but also not simply taking something that already exists either. So, an example would be the recent turns of Chat GPT, where you can use the GPT models of Open.AI for your service. They are open source, and you can use them under certain licensing models. So, here you can distinguish, whether it’s external ML capabilities, used and changed for their purposes (which can also be a great business model) or whether it is self-developed ML, which is required for certain types of technology companies, and which usually goes much deeper.

Third Question: What Kind of Data is Being Used for Training. and Who Owns it?

With both of the previously mentioned ML models, there is an even deeper heuristic to understand, with regard to whether the ML is good. So, if you have figured out, if the company uses a model that is self-made or bought, then the bigger question is, how has this model been trained? Models will be a commodity in the next years and there will be lots of models and computing capabilities. Getting to a model will not really be a differentiator. There are and will be a lot of extended commodity models. And then the really interesting question will be, how have these models been trained? With what type and quantity of data? And have these models been trained by the company you are looking into? Owning data will be a real moat in the AI/ML centric years to come.

If the company uses its own data, then there are really two vectors to distinguish. The first one being, the quantity of data points in the data set. So, understanding how strong, how statistically significant certain effects in that data set will be. Obviously, a larger data set is always more interesting to train your model, so that you can rule out statistical errors and just have a stronger confidence level. Here the aspect of data diversity cannot be underestimated. How well does the available data cover the space the model lives in. Some companies are able to produce huge datasets from their own operations, but often they have the problem that their datasets are too narrow to improve significantly.

The second vector is the proprietariness of this data set. So the interesting question here is: How did the company get this data set? Is it publicly available? Is it obtained by certain suppliers? Or is it maybe even collected by the company itself? The more private the data set is, the stronger the capabilities of this company may be to serve a certain case and by that build a successful business model.

Public Data

A good example for a model being built on publicly available data, would be an application that predicts certain weather movements. This is a standard case, everybody could build a model based on these data points. So you have data that is quite publicly available, that everybody can obtain. You may have to pay for them, but you can easily obtain them.

Semi-Private Data

An example for using data, that is a little bit more private, would be a model in the predictive maintenance space. You get data from certain machines, OEMs machines, manufacturers, etc. You get input data on machine lifecycles, machine usage, and everything and by that train your model to predict certain maintenance intervals. So here the data is not publicly available, but it isn’t exclusive as well. It is still available to a certain number of suppliers that you would need to reach out to. Potentially another company, a competitor for instance, could get a contract with these suppliers and also obtain this data. So it’s more private, but it’s still available to others.

Exclusively Private Data

Then the third and at least in our understanding the most interesting case is the non-available or exclusive data sets. So that would mean the company training the model would collect that data set themselves. For example, with their own machinery, IoT infrastructure, sensors, etc. Therefore, they will be creating a data set that will only be available to themselves to train their models. This will in itself create a very strong mode. Why? Because it will be really hard for other companies, for competitors, to replicate that type of ML model. After all, they would first have to build up the database themselves, then build the model, and then combine these capabilities.

Bonus Question: Is there a Verticalization Strategy?
Now, there is one interesting development that we see regarding these very private data sets that people are trying to obtain to build their moat. More and more of these AI companies are understanding that they need to privatize the data they want to train their model with, in order to build a successful business model. How can they achieve this? Obviously, collecting data or even building machines to do that is really asset heavy and takes a lot of time. This is not the fastest way, it’s not really the startup way. So, what we are starting to see is that really IP academic and AI companies are going into M&A cases with more traditional industry companies. Quite often, some IoT hardware components are part of this. So, for example, hospitals in the healthcare space, or OEM or small manufacturers, or even service companies in the industrial space can be interesting for these AI companies. The strategy being, to obtain these data sets by buying out the companies in itself and having exclusive data access.

So, what we see here is companies deciding to verticalize not by building the assets from scratch and thereby becoming asset heavy themselves, but by understanding that in order to prioritize their ML and to train it properly, an early stage acquisition of a traditional industry player is quite an interesting case. Why? Because usually these players come with a vast quantity of unused data that they don’t realize the value of, as they don’t have the ML capabilities. But combining the industry data from the old industry with the ML capacities of an AI startup, will be quite key in building very interesting and predictive models in the long run. So, we are assuming that we will see a lot of companies trying to buy out companies that own data sets to train their ML models. And we are assuming, we will even see an M&A spree in that sense. We are more than interested in what AI companies will take over traditional industry companies to obtain that data in an effort to successfully verticalize.

So to sum it up, there is this simple early heuristic, when trying to identify the quality of ML. First, you would have to figure out, is the company really doing ML? Or is it just a rule based system? Second, you would need to clarify whether the model is bought or built. And then third, you would have to question the database that the model was trained with. And here the common understanding is the bigger, diverse, and more private data set your train with, the better, in terms of building a differentiated mode and potentially becoming successful as a company. A major key in that will be the acquisition of traditional industry players with vast untapped data in order to verticalize.