Notebook for generating evals using sythetic data #937
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Evals using synthetic data
LLM-As-A-Judge
methodology using LlamaOverall idea
Let's assume we have a use case for generating a summarization report based on a given context, which is a pretty common use case with LLM. Both the context and the report have a lot of factual information and we want to make sure the generated report is not hallucinating.
Since its not trivial to find an open source dataset for this, the idea is to take synthetic tabular data and then use Llama to generate a story(context) for every row of the tabular data using Prompt Engineering. Then we ask Llama to summarize the generated context as a report in a specfic format using Prompt Engineering. Finally we check the factual accuracy of the generated report using Llama by converting this into a QA task using the tabular data as the groud truth.
To generate synthetic data for this approach, we use an open source tool like Synthetic Data Vault(SDV)
Want to thank @adamloving-meta for suggesting this tool
Measuring Hallucinations
The usual method to measure hallucinations uses LLM-As-Judge methodology. An example hallucination metric is using DeepEval.
This would use a powerful LLM as the ground truth to measure hallucinations.
This notebook shows a simple way to measure hallucinations using the ground truth data that we have (tabular data). The methodology is to make use of the tags that we have added in the report and use Llama to answer simple questions looking at the corresponding sections. Llama compares the answers with the ground truth and generates a list of boolean values. This is then used to measure accuracy of the factual information in the report. If your report has a well defined structure, using QA to measure hallucinations can be highly effective and cost efficient
Example
Below is an example of the expected output while checking for hallucinations in the generated report. Since we converted the hallucination measurement task into a list of QA with
True
/False
outcomes, we can useaccuracy_score
fromsklearn
to measure the accuracy in terms of a single number.Fixes # (issue)
Feature/Issue validation/testing
Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.
Test A
Logs for Test A
Test B
Logs for Test B
Before submitting
Pull Request section?
to it if that's the case.
Thanks for contributing 🎉!