The “trend” of Large Language Models (LLMs) has spread across many companies in the market. The vast majority of them are using pre-trained models and focusing their work on prompt engineering and data.
The data part holds significant importance in techniques like Retrieval-Augmented Generation (RAG), which enables generating responses based on a data source we trust/create rather than the model’s training sources. These data need to be collected and aggregated in a way that they are easily retrievable when the model needs them. Implementing such a process demands a lot of work.
Not only data, but the “simple” act of crafting a prompt can also be extremely complicated when working with an LLM. It’s no wonder that along with the popularization of this technology, the term “prompt engineering” emerged. Various tactics and strategies can be employed in constructing a prompt to achieve better results from an LLM. A project using pre-trained LLMs involves constant experimentation with different prompts in the quest for better outcomes. And that entails a lot of work too.
Pre-trained LLMs
Can we still agree that working with pre-trained LLMs requires far fewer resources than creating one from scratch? But that doesn’t mean the work is nonexistent in this approach. On the contrary, there is much work to be done in projects that have decided to use pre-trained LLMs. Not only the work I mentioned in the prompt and data part, by the way.
And do you know what one of the most complicated steps is while working on this? Validating what we are doing. Yes, validating changes in a prompt, generating relevant responses leveraging our data, the absence of hallucinations, biases, and inconsistencies in the responses of our “ready” LLM, and so forth, is a lot of work.
But why does validation require so much work? Well, this is partially caused by the randomness in the responses generated by these models. You’ve probably noticed that ChatGPT, for example, can respond to the same question in different ways. These different responses prevent the implementation of traditional automated validation for a scenario. We can’t simply write an automated test that expects the return of a string `”Paris is a great city to travel”`, as in the next test run, we might receive a string containing `”Paris is an excellent destination for travelers”`.
In addition to the randomness factor in responses, we also have a very large number of “test cases” in this type of project. In theory, a user can submit anything to an LLM. They can ask for the same thing in different forms, as in the example below:
“`
user A: “What is the weather forecast in Paris tomorrow?”
user B: “Do you have any information about the weather conditions in Paris for tomorrow?”
user C: “Can you tell me the weather outlook for Paris tomorrow?”
…
user Z: “Tell me about tomorrow’s weather in Paris, please.”
“`
So, we need to validate a scenario with an immense number of input cases where the obtained responses are “kind of random”. A good solution for this scenario is to use tools like DeepEval.
DeepEval
DeepEval is an open-source framework that aims to solve the problem of testing and evaluating LLMs and their responses. It is a framework integrated into the Confident AI platform, which is capable of receiving and storing the executions of our tests, allowing us to verify the results of our tests through a more user-friendly UI than the terminal.
The proposal of DeepEval is simple: to use a second LLM to evaluate the responses of the LLM of your application. Thus, the aforementioned problems such as the “randomness” of the responses of an LLM are solved by using this second LLM, which has the linguistic capacity to understand if a response is in line with the expected or not.
Using DeepEval we can automate a good part of the continuous improvement cycle of an LLM application. For instance, here’s the traditional cycle:
Notice that there is a lot of manual work to do in this traditional cycle. Not only should the answer be gathered manually, but also the inspection is manual. Now, when we add DeepEval, we have the following cycle:
Notice that the cycle seems very similar to the traditional one, however, the highlighted part where we have the DeepEval automation is extremely faster than the manual process.
Taking into account the speed difference between the two cycles, you will notice that the continuous improvement cycle using DeepEval is much shorter, faster and easier to iterate than the traditional.
That’s why I found it to be a great tool. The engineering team at Bitboundaire also did, so much that we are already using DeepEval in 2 different projects.
Of course, it is important to note that the framework is still in an early stage of development, but it can be difficult to find much more mature alternatives aimed at solving such a recent problem.
So, how exactly can we use DeepEval to evaluate our LLM application? Well, there are several ways to do this, and it all depends on your objective. At the time of writing this article, DeepEval contains more than 10 different metrics for evaluating responses, such as:
- G-Eval
- Summarization
- Faithfulness
- Answer Relevance
- Contextual Relevance
- Contextual Precision
- Contextual Recall
- Ragas
- Hallucination
- Toxicity
- Bias
It is also important to highlight among these metrics, we have G-Eval, which allows us to evaluate responses based on a criterion defined by ourselves. As shown in the example below:
And if we need to, we can also create a custom metric. Evaluating any other element of our model. This custom metric is very useful for projects that work with LLM responses in JSON format. Thus, with a custom metric, it is easy to verify the properties of a JSON response.
The ease of implementation is also a positive point of this framework. Since it integrates with Pytest, writing and executing a test case can be considered simple and intuitive, as we can see in the example below:
Which we can execute with the following command:
deepeval test run your_test_file.py
And so, by executing this test, we will be employing an LLM to evaluate the response – theoretically – provided by another LLM. Pretty cool, right?
It is important to emphasize that DeepEval is not limited only to providing metrics. It is possible to use the framework for test case aggregation, real-time monitoring of production responses, and so on.
The use of a tool like DeepEval (or any preferred alternative) facilitates many processes within an LLM project. You and your team will be able to:
- Validate changes in an LLM prompt more easily;
- Find new bugs in an LLM application before they reach the production environment;
- Evaluate different LLMs in search of better alternatives (based on the metric that matters to your team);
- Validate LLM responses based on various metrics;
- And so on.
Furthermore, one of the best benefits of DeepEval is the enforcement of a validation product, helping products scale without quality degradation and improving the quality control procedures of that given product.
Currently, it’s hard to think of reasons not to test the automated assessment process in your LLM project. But just so I’ve tried, I can give you one: cost. Remember that running assessments through a second LLM is on you. So, be careful with the number of tests and executions you’re going to perform, especially if your budget is tight.
However, except for exaggerations, I think the cost of these assessments is well worth it. Many hours of validations have already been saved here at Bitboundaire with the execution of these automated assessments.
Well, that’s it for today. What’s your opinion on the subject? Have you ever worked with automated LLM assessments? Comment below!
Thanks for making it this far!