daizy updates

Solving Large Language Model Hallucinations With Digital “Acid Tests”

When most of us think of AI, we tend to imagine helpful or hurtful virtual entities like Tony Stark’s assistant Jarvis or HAL 9000 from 2001: A Space Odyssey. We think of talking, thinking beings with digital minds that work much like those of their creators, only smarter, faster, and capable of performing calculations far beyond the capabilities of a human brain. In other words, we tend to think of artificial general intelligence (AGI) with the same capability to learn, infer, deduce, and solve any number of different problems as a human could.

The truth is, that’s not what AI today looks like.

Models like GPT-4 and its contemporaries are the closest humanity has gotten to mirroring our own cognition. They’re incredibly advanced, sophisticated programs that seem like they may carry the same amount of potential and possibilities as computers themselves, and they’ve already presented a remarkable amount of utility in numerous use cases throughout the very short time they’ve been available.

For all their sophistication and advanced technological underpinnings, however, none of the programs we refer to as AI today even come close to meeting the definition of a true AGI. ChatGPT and the like are known as large language model (LLM) generative AI (GAI), and that distinction can and should make all the difference when you think about how to integrate them into your workflow.

The Issue With Generative AI

Hallucinations, confabulations, delusions – you can call them whatever you want. At the end of the day, they all refer to the same tendency for LLMs, like ChatGPT, to fabricate facts, details, quotes, sources, and data in the middle of the content they generate. These fabrications are common and sometimes egregious enough to make ChatGPT and its fellow GAIs too unreliable to trust with certain tasks, which presents a significant hurdle for enterprise usage.

Using a tool that can and will “lie” to you is obviously problematic outside of the most casual applications. It’s one thing to have ChatGPT write you a poem about the reasons why your favorite flavor of ice cream is the best in the world, but it’s another thing entirely when you rely on it to write legal briefs, white papers, contracts, and other content that requires a high degree of factual accuracy.

The problem with hallucinations is one that OpenAI, Google, and every other owner and producer of GAI platforms are working hard to solve. This isn’t a case of isolated glitches or bugs in the system; this is a systemic issue that must be solved before enterprises can rely on AI tools without human supervision. OpenAI has had some success by building in guardrails to ChatGPT 4.0 to prevent hallucinatory flights of fancy, though they can’t yet promise that the GAI will deliver consistently fabrication-free content.

Part of the problem is the way the GAI themselves operate. Take GPT-3 for instance. It’s trained on massive databases of content across an equally massive range of subjects and media, which is both the reason it can reproduce convincingly human copy and the reason it is so prone to hallucinations.

GPT-3 works by generating each word in a sequence based on the previous words it’s generated – essentially using its architecture and training data to identify and write words that statistically “make sense” given the prompt and each word it’s written for the prompt. The more words the GAI generates in a row, the more likely it is that it will hallucinate.

It’s hard to say exactly why this happens without a computer science degree and several years working on the GAI themselves. There could be encoding errors, problems with the data used to train the GAI, issues with the training itself, historical sequence bias, or any number of other issues. The question is, what can you do about it?

Acid Testing

Large language models in their current state are most useful for generating text that sounds like natural language. Unfortunately, this is a separate and much different problem than generating accurately sourced and computationally correct text that sounds like natural language. For the end user, this distinction isn’t intuitive, as we’ve traditionally used computer and internet technology to find information and make computations.

At Daizy, as we’ve built products that bridge the worlds of language generation, data, and analysis, we’ve learned the hard way over the past four years that solving for each piece takes careful consideration (let alone connecting them all)!

In our industry, WealthTech, the stakes are about as high as they come. There is no room to be wrong with someone’s money, financial data, or financial knowledge. Trust and reliability are crucial.

Financial advisors can’t risk providing their clients with performance reports, fact sheets, portfolio reports, or research on specific markets or assets that contain information that’s even misleading, let alone outright incorrect. And therein lies the problem. LLMs should, in theory, be one of the most powerful tools at a financial advisor’s disposal, but there is already a widespread trust deficit that is making it difficult for financial professionals to adopt and implement this technology.

At Daizy, we’ve come up with a process for building trust and verification back into LLM outputs. We call it “Acid Testing”.

Think of it like this. A Daizy Scribe user asks for a specific analysis done on certain securities and for the AI to generate a report on its findings. Daizy Scribe generates the report, but the user doesn’t know if the analysis was performed 100% correctly or if any data points were fabricated in the process.

The user then runs an Acid Test against the report to check the output against the source data, which asks the model to identify data fabrication, inconsistencies, and/or mislabeling. The Acid Test runs a new prompt where Daizy compares its own original output to the source data.

This works for three key reasons:

  • LLMs evaluate each prompt input independently. They do not “double down” on past answers or suffer from a “checking your own homework” bias.
  • LLMs are fantastic at comparison.
  • Daizy Scribe can point to the source data used in its outputs.

Using this “acid test” process, we can rapidly verify LLM outputs without blind trust or hallucination anxiety. By using LLMs for what they are best at, and actively integrating the source data into the ultimate output, we can create repeatable reliability and trust.

Iterative Solutions

We are excited to be on the cutting edge of LLM technology and to be proposing solutions in real-time. We believe that understanding the unique capabilities of LLMs in comparison to other AI applications, focusing on quality data sourcing, and continuing to stay open to new developments in technology are all key to future LLM use cases. The current version of the acid test is a bit clunky, so Daizy is currently developing internal tools to make the process smoother and easier to perform.