What should I ask an AI company early in the funnel to figure out if their Machine Learning is good? This is something we get asked quite often from clients, especially VCs who screen their deal flow. With the recent trends in Generative AI, the GPT models for instance, we see more AI companies arising than in the years before. Now, VCs are tasked with selecting the most promising of these companies to present them to their investment committee members. Obviously, quite often the question comes, if this is really “good ML”. Companies will claim to be using AI, but what is it really and how strong is it? We as technology experts are quite often asked by investors how they can separate the good from the bad AI. That is why we wanted to give some early heuristic to think about the quality of a setup in the form of three questions.
First Question: Are You Dealing With Machine Learning or Traditional Rule-Based Systems?
The first distinction one can draw is whether you are really dealing with ML or a hand-crafted rule-based system. In other words: Does the tech involve statistical learning from data or not? Try ruling out, if it is a system that improves when trained with more data. Then it is ML. In addition, self-learning ML includes self-supervised or unsupervised methods as well. The question of whether and how best to allow such self-improvement is called “online learning”. So, is this the case, or is it mostly a system that works on certain rules, that are basically set by developers beforehand and that is not trained with data or self-learning.
The latter is usually not real AI — at least not in the more recent understanding. So that is the first question: You can distinguish between a more rule-based technology and an AI or ML technology where they have a certain statistical learning model that they train or where they even build a model themselves and by that have real ML capabilities within the company.
Second Question: Do They Make or Buy?
The second level of distinction: Are they developing the ML themselves, or is it a model provided by a third party? There are open source models or proprietary models from such third parties. If it is open source, very often companies will adapt or fine-tune those models themselves. In that case it is not really developing something from scratch, but also not simply taking something that already exists either. So, an example would be the recent turns of Chat GPT, where you can use the GPT models of Open.AI for your service. They are open source, and you can use them under certain licensing models. So, here you can distinguish, whether it’s external ML capabilities, used and changed for their purposes (which can also be a great business model) or whether it is self-developed ML, which is required for certain types of technology companies, and which usually goes much deeper.
Third Question: What Kind of Data is Being Used for Training. and Who Owns it?
With both of the previously mentioned ML models, there is an even deeper heuristic to understand, with regard to whether the ML is good. So, if you have figured out, if the company uses a model that is self-made or bought, then the bigger question is, how has this model been trained? Models will be a commodity in the next years and there will be lots of models and computing capabilities. Getting to a model will not really be a differentiator. There are and will be a lot of extended commodity models. And then the really interesting question will be, how have these models been trained? With what type and quantity of data? And have these models been trained by the company you are looking into? Owning data will be a real moat in the AI/ML centric years to come.
If the company uses its own data, then there are really two vectors to distinguish. The first one being, the quantity of data points in the data set. So, understanding how strong, how statistically significant certain effects in that data set will be. Obviously, a larger data set is always more interesting to train your model, so that you can rule out statistical errors and just have a stronger confidence level. Here the aspect of data diversity cannot be underestimated. How well does the available data cover the space the model lives in. Some companies are able to produce huge datasets from their own operations, but often they have the problem that their datasets are too narrow to improve significantly.
The second vector is the proprietariness of this data set. So the interesting question here is: How did the company get this data set? Is it publicly available? Is it obtained by certain suppliers? Or is it maybe even collected by the company itself? The more private the data set is, the stronger the capabilities of this company may be to serve a certain case and by that build a successful business model.
Public Data
A good example for a model being built on publicly available data, would be an application that predicts certain weather movements. This is a standard case, everybody could build a model based on these data points. So you have data that is quite publicly available, that everybody can obtain. You may have to pay for them, but you can easily obtain them.
Semi-Private Data
An example for using data, that is a little bit more private, would be a model in the predictive maintenance space. You get data from certain machines, OEMs machines, manufacturers, etc. You get input data on machine lifecycles, machine usage, and everything and by that train your model to predict certain maintenance intervals. So here the data is not publicly available, but it isn’t exclusive as well. It is still available to a certain number of suppliers that you would need to reach out to. Potentially another company, a competitor for instance, could get a contract with these suppliers and also obtain this data. So it’s more private, but it’s still available to others.
Exclusively Private Data
Then the third and at least in our understanding the most interesting case is the non-available or exclusive data sets. So that would mean the company training the model would collect that data set themselves. For example, with their own machinery, IoT infrastructure, sensors, etc. Therefore, they will be creating a data set that will only be available to themselves to train their models. This will in itself create a very strong mode. Why? Because it will be really hard for other companies, for competitors, to replicate that type of ML model. After all, they would first have to build up the database themselves, then build the model, and then combine these capabilities.
Bonus Question: Is there a Verticalization Strategy?
Now, there is one interesting development that we see regarding these very private data sets that people are trying to obtain to build their moat. More and more of these AI companies are understanding that they need to privatize the data they want to train their model with, in order to build a successful business model. How can they achieve this? Obviously, collecting data or even building machines to do that is really asset heavy and takes a lot of time. This is not the fastest way, it’s not really the startup way. So, what we are starting to see is that really IP academic and AI companies are going into M&A cases with more traditional industry companies. Quite often, some IoT hardware components are part of this. So, for example, hospitals in the healthcare space, or OEM or small manufacturers, or even service companies in the industrial space can be interesting for these AI companies. The strategy being, to obtain these data sets by buying out the companies in itself and having exclusive data access.
So, what we see here is companies deciding to verticalize not by building the assets from scratch and thereby becoming asset heavy themselves, but by understanding that in order to prioritize their ML and to train it properly, an early stage acquisition of a traditional industry player is quite an interesting case. Why? Because usually these players come with a vast quantity of unused data that they don’t realize the value of, as they don’t have the ML capabilities. But combining the industry data from the old industry with the ML capacities of an AI startup, will be quite key in building very interesting and predictive models in the long run. So, we are assuming that we will see a lot of companies trying to buy out companies that own data sets to train their ML models. And we are assuming, we will even see an M&A spree in that sense. We are more than interested in what AI companies will take over traditional industry companies to obtain that data in an effort to successfully verticalize.
Conclusion
So to sum it up, there is this simple early heuristic, when trying to identify the quality of ML. First, you would have to figure out, is the company really doing ML? Or is it just a rule based system? Second, you would need to clarify whether the model is bought or built. And then third, you would have to question the database that the model was trained with. And here the common understanding is the bigger, diverse, and more private data set your train with, the better, in terms of building a differentiated mode and potentially becoming successful as a company. A major key in that will be the acquisition of traditional industry players with vast untapped data in order to verticalize.
Generative AI has been making head waves in the VC and startup scene in recent weeks. A refreshing and energizing debate – especially after months of rather unpleasant news about market correction, investor pullbacks, valuation drops and layoffs. A debate driven by tech, even more. As tech experts ourselves, who have assessed startups working in the space of Generative AI before, we are of course super hyped by the exposure the topic is currently getting within the startup ecosystem.
The topic was pushed to the forefront by diffusion models taking over Generative Adversarial Networks (GANs) as state-of-the-art AI models in image generation. Now they are expanding into text-to-video, text generation, audio, and other modalities.
Stability.ai and Midjourney are pushing the envelope there with their text-to-image models rivaling those of established AI labs. While Midjourney is reportedly profitable, Stability.ai secured $101M funding from Coatue, Lightspeed Venture Partners and O’Shaughnessy Ventures LLC, after releasing Stable Diffusion in August 2022. Stable Diffusion is an open source text-to-image model that – different from other generators – was made available publicly for free. Diffusion-based text-to-video generation also took major steps forward earlier this year, with Google and Meta announcing models for text-to-video generation – sooner than expected.
In October, Sequoia Capital brought the topic to everyone’s attention by putting together a Market Map on Generative AI, which laid out the main players for Code, Text, Image, Audio, Video, and other areas. Verve Venture then enhanced Sequioa’s heat map by adding the European players in the respective areas. Unsurprisingly, the map included AI startups we have worked with in the past as well.
Prospects are promising: The MIT Technology review described Generative AI as one of the most promising advances in the world of AI in the past decade. Sequoia estimates that Generative AI will have the potential to become a trillion dollar business and business analyst Gartner predicts a time-to-market of 6-8 years – with mass adoption in the near-ish future. Whether these predictions will actually come true or not, Generative AI will revolutionize tens of millions of creative and knowledge-based jobs and play a vital role in driving future efficiency and value.
What is Generative AI and How Does it Work?
To begin, let us first get the terminology straight. What is Generative AI and on which models is it based? So generally speaking, Generative AI uses existing content as source material, such as text, audio files, images, or code to create new and plausible artifacts. Underlying patterns are learned and used to create new and similar content. This differentiates from well-known Analytical AI, which analyzes data, identifies patterns, and predicts outcomes. One could say, Analytical AI mimics the left brain of humans, that is said to be more analytical and methodical, while Generative AI mimics the right brain – the creative and artistic side. Moving past the automation of routine and repetitive tasks, Generative AI is able to replicate capabilities that to date have been unique to humans – inspiration and creativity.
Moving on to the modeling types. To produce new and original content, Generative AI uses unsupervised learning algorithms. They are given a certain number of parameters to analyze during the training period. The model is essentially forced to draw its own conclusions about the most important characteristics of the input data. Currently, two models are most widely used in Generative AI: Generative Adversarial Network and Transformer-Based Models.
Generative Adversarial Networks (GANs)
A Generative Adversarial Network or GAN is a machine learning model that places the two neural networks – generator and discriminator – against each other, therefore called “adversarial”. Generative modeling tries to understand the structures within datasets and generates similar examples. In general, it is part of unsupervised or semi-supervised machine learning. Discriminative modeling on the other hand classifies existing data points into respective categories. It mostly belongs to supervised machine learning. One could also say the job of the generator is to produce realistic images (or fake photographs) from random input, while the discriminator attempts to distinguish between real and fake images.
In the GAN model, the two neural networks contest one another, which takes the form of a zero-sum game – one side’s gain being the other side’s loss. Currently, GANs are the most popular Generative AI model
Transformer-Based Models
The second model widely used in Generative AI is based on transformers, which are deep neural networks that learn context and meaning by tracking relationships in sequential data. An Example would be the sequence of words in a sentence. NLP (Natural Language Processing) tasks are a typical use case for Transformer-Based Models.
Context is provided around items in the input sequence. Attention is not paid to each word separately, but rather the model tries to understand the context that brings meaning to each data point of the sequence. Furthermore, Transformer-Based Models can run multiple sequences in parallel, thereby speeding up the learning phase significantly.
Sequence-to-sequence learning is already widely used, for example when an application predicts the next word in a sentence. This happens through iterating encoder layers. Transformer models apply attention or self-attention mechanisms to identify ways in which even distant data elements in a series influence on another.
How Generative AI Will Transform Creative Work
Narratives and Storytelling in general as a form of engagement will remain powerful, as humans are inherently drawn to stories – be it about a person, business, or an idea. However, good storytelling is difficult and requires content creation in different formats. While we see plenty of other areas being automated and made more efficient, the process of content creation remains manual and quite complex.
Generative AI will help content creators by generating plausible drafts that can function as a first or early iterations. AI will also help by reviewing and scrutinizing existing human-written text with regard to grammar and punctuation to style and word choice and narrative and thesis. By creating content that seems to be made by humans, Generative AI will be able to take over some part of the creative processes that until now only humans were capable of. Generative AI will be able to review raw data, craft a narrative around it, and put together something that’s readable, consumable, and enjoyable for humans.
Previously, Generative AI was mostly known for deep fakes and data journalism, but it is playing an increasingly significant role in automating repetitive processes in digital imaging and audio correction. In manufacturing, AI is being used for rapid prototyping and in business to improve data augmentation for robotic process automation (RPA).
Generative AI will be able to reduce much of the manual work and speed up content creation. Most likely, every creative area will be impacted by this in one way or another – from entertainment, media, and advertising, to education, science, and art.
Challenges and Dangers
While Generative AI brings enormous potential and the steps taken forward this year are truly astonishing, there is the danger of misuse. As with every technology, it can be used for both good and bad. Copyright, trust, safety, fraud, fakes, and costs are questions that are far from resolved.
Violent imagery and non-consensual nudity, as well as AI-generated propaganda and misinformation, are a real danger. Apparently, Stable Diffusion and its open-source offshoots have been used to create plenty of offensive images, as more than 200,000 people have downloaded the code since it was released in August, according to Stability.ai.
Pseudo-images and deep fakes can be misused for propaganda and misinformation. With more and more applications being publicly available to all users, such as FakeApp, Reface, and DeepFaceLab, deep fakes are not only being used for fun and games, but for malicious or even criminal activities too. Fraud and scamming is another problem, as well as data privacy, as for example health-related apps run into privacy concerns on individual-level data
Also, due to the self-learning nature of Generative AI, it’s difficult to predict and control its behavior. The results generated therefore can often be far from what was expected.
As with AI in general, machine learning bias is a tremendous problem in training data in Generative AI. AI bias is a phenomenon in which algorithms reflect human biases, due to the biased data which was used in training during the machine learning process. An example would be if facial recognition algorithm recognizing a white person more easily than a non-white person because of the type of data that has been used in the data training.
Therefore, we need to be sensitive to AI bias and understand that algorithms are not necessarily neutral when weighing data and information. These biases are not intentional, and it’s difficult to identify them until they’ve actually been programmed and poured into software. Understanding these biases and developing solutions to create unprejudiced AI systems will be necessary to ensure, existing biases and forms of oppression are not perpetuated by technology.
Despite the different challenges, technology would be incapable of developing and growing without challenges. Responsible AI gives way to avoid such drawbacks of innovation to a certain degree, or even eliminate them altogether.
What Founders and Investors Should Prioritize When Building & Scaling a Generative AI Startup
Research & Development: As so much regarding Generative AI is still in its infancy, research and development will have to be prioritized in any startup that wants to push the envelope in this area. A strong research team with sufficient senior roles with multiple years of experience in Machine Learning will have to set the basis on most cases. With a strong dedication to facilitating focus within research and accelerating research efforts, AI startups can differentiate against competitors and gain a competitive edge.
Modeling and Product Management: Building up a mature product organization is key for the commercialization of companies in the space. Strong product management competence with in-depth technical understanding is of the essence when operationalizing an AI business strategy. Implementing a product framework that supports the growing engineering organization and sets clear priorities should be on the to-do list as well. Investors should specially focus here from a Series A onwards, since most scientific founder teams in the space lack productization experience and need to hire experienced product leaders. This should be accounted for rather early in the process.
Security and Compliance: Both need to be a priority. It is important to actively track and manage any security vulnerabilities in the system. Guidelines to fulfill the necessary compliance and security requirements should be defined and implemented to achieve production-readiness. This is important particularly in a governance context, but also in general.
Responsible teams need to be aware of and understand the security requirements. There needs to be visibility over changes made to critical infrastructure, so possible malicious changes do not only become noticeable when they start affecting end-users. The tech organization should be able to quickly respond to security incidents in an automated way. Otherwise, detecting and resolving issues would need considerable manual effort. With startups and young companies with only loosely defined processes that often are still manual, this can become a security risk that needs to be on the radar.
Scalable Infrastructure: Generative AI startups should build a secure, scalable and automatically provisioned infrastructure that is easy to manage and controls the cost of computing and data training. The AI models described above require a lot of computing power, since the more combinations they try, the better the chance to achieve higher accuracy.
As startups and growth companies are competing in the Generative AI space, they are under pressure to improve data training and lower the cost of it. In addition, the carbon footprint of data training is an important factor in times in which impact is becoming an increasingly important measurement for investors. AI companies therefore need to strive for more efficiency in training methods as well as in data centers, hardware and cooling.
There should also be a plausible trade-off between the cost of training models and using them. If models will be used many times in its lifetime, they can bring a proper return on investment of the initial training cost and computing power.
Conclusion
With Generative AI, content creators will have technology at their disposal that will be able to present artifacts from the data and use it to generate new content that can be considered an original artifact.
Generative AI will increasingly be important in the creation of synthetic data that can be used by companies for different purposes and scaled throughout different formats. AI-generated synthetic audio and video data, derived from texts which were triggered by some initial human input, can remove the need to manually shoot films or record audio: Content creators can simply type what they want their audience to see and hear and let Generative AI tools create the content in different formats.
We believe that Generative AI will progress quickly with regard to scientific progress, technological innovation, and commercialization. While we are still at the beginning of this trend, a wide range of appliances is on their way and plenty of use cases are being introduced to the market – ranging from media and entertainment, to life sciences, healthcare, energy, manufacturing and more. Innovative startups tackling problems around manual and time-consuming processes in the creative industry stand at the heart of this development, alongside established platform companies such as Google and Meta. Generative AI will extend into the metaverse and web3, as they have an increasing need for auto-generated synthetic and digital content.
Safety concerns and harmful use of Generative AI, such as deep fakes, pose a challenge and might impact mass adoption with consumers and corporations. Security and compliance guidelines will have to take the growing challenge of bias and general importance of Generative AI governance into account.
As with other types of AI, repetitive and time-consuming tasks will be automated, eliminating certain portions of tasks and activities that are currently done by humans. However, instead of eliminating creative jobs, Generative AI most likely will rather support processes in the creative industry through automation, while there will still be a human in the loop as a controlling and refining instance at some point. As an assistive technology that helps humans produce faster, we will see humans and AI work together for better and possible more accurate results.