Although we live in an increasingly data-driven world, most companies don’t operate data-driven business models. The virtuous circle of network effects driving the success of enterprises like Alphabet, Meta, and Amazon aren’t available to organizations selling traditional products and services. However, the tools to get more from the proprietary data you generate from everyday business processes are becoming widely accessible, and could help your company develop a competitive edge.
As markets become more competitive, building a defensible moat from data can make all the difference. McKinsey estimates that leveraging internal data for sales and marketing insights can result in above-average market growth and increases of 15 to 25% in EBITDA. LLMs offer a new and unique way to extract this value, and training them on proprietary data to achieve specific business objectives could transform many companies.
Data quality outstrips quantity
As AI guru and former director of research at Google Peter Norvig once said, “More data beats better algorithms, but better data beats more data.” This is becoming increasingly true as gen AI models are adapted for use in the enterprise. While frontier models have been trained on massive quantities of data scraped from the internet and other public sources, their utility for specific business purposes is limited.
The ability of these LLMs to extract meaning from data needs to be combined with proprietary data unique to an organization for real benefits to be realized. Making sure data is ready for this is a key step once business objectives have been set. Gartner estimates that preparing data for AI improves business outcomes by 20%, which means data must be appropriate for the use cases intended, whether structured or unstructured. A key reason why 30% of internal AI projects are abandoned, according to Gartner, is poor data quality inputs. This involves removing corrupt data and duplicates, and filling gaps where inputs are incomplete.
And while quality is key, there also needs to be sufficient quantity. Depending on the objectives and how the LLM is tuned, this means thousands of records at a minimum and possibly significantly more.
Tuning up
Using unique proprietary data is where the greatest competitive benefits may be realized. This might include anonymized customer data and purchasing patterns, customer feedback, web analytics, and supply chain information. Open-source data can be a useful supplement, too, but, by definition, is available to everyone, so not a differentiating factor on its own. Using proprietary data, providing it meets privacy regulations, also reduces legal complexities relating to data sovereignty.
But most organizations don’t have the resources, financial and human, to build and train their own domain-specific models from the ground up. Fine-tuning existing LLMs requires considerable time and skills beyond the capabilities of mid-size enterprises, even though it needs less compute power and data than building from scratch. Prompt tuning and prompt engineering are the most common and straightforward approaches. Rather than modifying model parameters, these techniques consume far less resources and, although specialist skills are required, can be adopted relatively easily.
In the real world
Some early LLM deployments trained on internal data have come from the larger banks and consulting firms. Morgan Stanley, for instance, used prompt tuning to train GPT-4 on a set of 100,000 documents relating to its investment banking workflows. The objective was to help its financial advisers provide more accurate and timely advice to clients. BCG has also adopted a similar approach to help its consultants generate insights and client advice alongside an iterative process that fine-tunes their models based on user feedback. This has helped improve outputs and reduces the chances of hallucinations more common in consumer-facing GPTs.
We’re now starting to see less technology-intensive, service-oriented firms customizing LLMs with internal data. Garden-care company ScottsMiracle-Gro has collaborated with Google Cloud to create an AI-powered “gardening sommelier” to provide customers with gardening advice and product recommendations. This has been trained on the firm’s product catalogues and internal knowledge base, and will soon be rolled out to its 1,000 field sales associates to help them advise retail and market garden clients on prices and availability. It’s anticipated that, depending on results, it’ll then be available to consumers, with the aim of driving sales and customer satisfaction.
Just as ScottsMiracle-Gro is using AI to add value to their traditional sales catalogue, so is Volkswagen of America with its car manuals. Trained on vehicle instruction guides and supplemented with the customer’s connected car data, the AI-powered virtual assistant can help drivers better understand their vehicles. This includes providing guidance on changing tires and understanding what dashboard indicator lights mean.
As LLMs become increasingly commoditized in terms of feature sets and processing capabilities along the rise of open-source models, thus lowering the barriers to entry for application developers, data will become increasingly important. Content owners are already pushing back on allowing companies such as OpenAI and Anthropic to freely amass their data, moves that will further highlight the value of proprietary information.
Companies of all sizes would be wise to start valuing and guarding their internal data assets more carefully, and thinking about how it can be leveraged through AI for competitive advantage. Even the humble product catalogue or user manual, as we’ve seen, can be assets ripe to capitalize on.