The Problem with Testing LLM Apps

3 min readDec 14, 2023

OpenAI’s release of Custom GPTs, their accompanying price drop of GPT 4.0, and hint at the upcoming GPT 5.0 highlights a problem with LLM based apps — but it’s not what you think it is.

LLM based apps are extremely popular right now and with good reason — they are powerful and produce magical feeling results. No matter which type of LLM app is being built or which LLM is being used under the hood, text is being generated and that text is unpredictable — and it is this fact that makes testing interesting.

But this unpredictability introduces a new problem for building and maintaining high-quality applications.

When your app relies on the unique and interesting text generation of AI:
How do you know if what is being created is good?

The primary goal of tests are to ensure that as your application changes over time everything is still working well. Tests help prevent inadvertent negative side-effects from being introduced.

When you first add an LLM integration you’ll certainly make sure that you are getting good results. You’ll do this by triggering some requests and verifying that the results look pretty good in your app. Maybe you’ll even need a subject matter expert to look at the results and tell you they are acceptable or not. For instance, the software that my company is working on requires a lawyer to review the results to ensure high quality.

But this does not scale…

In traditional (read: deterministic) software development, you might have a test that looks like:

assert add(2,2) == 4

While obviously overly simple, this paints the picture that no matter what I change in my application, my add() function will sum 2 and 2 correctly or I’ll get an alert when my test suite fails.

Put another way — I’m ensuring that my add() function is giving me good results.

How then do you test that your LLM-based, non-deterministic results are still good? How do you finish this test case?

assert generate_greeting() == // "???"

Now imagine if instead of a simple case like this we are asking for complex responses — especially ones that require expensive subject matter expertise:

assert create_plot_synopsis(book_1) == // "???"
assert create_plot_synopsis(book_2) == // "???"
// ... 
assert create_plot_synopsis(book_N) == // "???"

You don’t want to be in the business of reviewing the results of N questions every time you make an application change.

I believe this represents a new type of testing problem.

There are solutions arriving around LLM testing and debugging. Tools like LangChain’s LangSmith and LangFuse.

I’m putting both of these in the bucket of tools that help debug a problem that you know exists, similar to a traditional debugger or the role that logging plays in production applications.

These are super helpful and required but they are insufficient.

How do you even know to look for a problem if something hasn’t alerted you that there is a problem?

Something else is needed that will track the CONTENT of the responses over time.

Is this really a problem?

The term that people around LLMs use to describe their testing process is to “Test by vibes”. Meaning that they play with the app a little bit and see if it feels better or worse.

Too many things can change the quality of results:

Are you switching or upgrading LLMs?
Are you doing some prompt engineering and making changes there?
If you’re doing RAG, are you changing your document storage strategy? How you’re chunking?
Are you changing your Embedding?
Your Vector store?
If you’re using an Agent framework like LangChain, are you introducing new tools or APIs?

Maybe you’re not even changing anything major but you’re contemplating an upgrade like the move from GPT 3.5 to GPT 4.0.

Is the upgrade better? Is it worth the cost?

Let’s have a chat.

I’d love to hear what others are doing about this problem.

Are LLM apps just so new that there isn’t an industry solution available yet? Or maybe I just haven’t seen one yet that fits my needs.

Since no solution exists, my company is building our own LLM testing tool. Feel free to drop me a message to learn more or sign-up for our product updates here.

Read more of Byron’s articles about Leadership and AI.

Development articles here.

Follow on Medium, Dev.to, Twitter, or LinkedIn.

The Problem with Testing LLM Apps

Written by Byron Salty