Notebook for generating evals using sythetic data #937

agunapal · 2025-05-05T15:51:19Z

Evals using synthetic data

When you deploy Llama for your use case, it is a good practice to have Evals for your usecase. In an ideal world, you want to have human annotated Evals. If for some reason, this is not possible, this notebook shows a strategy for how one might go about addressing Evals using synthetic data. However, the Evals generated still require validation by a human to make sure that your revenue generating production use case can rely on this.
The notebook also shows how one can accurately measure hallucinations without using LLM-As-A-Judge methodology using Llama

Overall idea

Let's assume we have a use case for generating a summarization report based on a given context, which is a pretty common use case with LLM. Both the context and the report have a lot of factual information and we want to make sure the generated report is not hallucinating.

Since its not trivial to find an open source dataset for this, the idea is to take synthetic tabular data and then use Llama to generate a story(context) for every row of the tabular data using Prompt Engineering. Then we ask Llama to summarize the generated context as a report in a specfic format using Prompt Engineering. Finally we check the factual accuracy of the generated report using Llama by converting this into a QA task using the tabular data as the groud truth.

To generate synthetic data for this approach, we use an open source tool like Synthetic Data Vault(SDV)
Want to thank @adamloving-meta for suggesting this tool

Measuring Hallucinations

The usual method to measure hallucinations uses LLM-As-Judge methodology. An example hallucination metric is using DeepEval.
This would use a powerful LLM as the ground truth to measure hallucinations.

This notebook shows a simple way to measure hallucinations using the ground truth data that we have (tabular data). The methodology is to make use of the tags that we have added in the report and use Llama to answer simple questions looking at the corresponding sections. Llama compares the answers with the ground truth and generates a list of boolean values. This is then used to measure accuracy of the factual information in the report. If your report has a well defined structure, using QA to measure hallucinations can be highly effective and cost efficient

Example

Below is an example of the expected output while checking for hallucinations in the generated report. Since we converted the hallucination measurement task into a list of QA with True/False outcomes, we can use accuracy_score from sklearn to measure the accuracy in terms of a single number.

Checking accuracy of generated report in generated_data/data_6.json

<answer>
student_id: [False, report shows Student ID is not mentioned in the data and ground truth says 6180804]
degree_type: [True, None]
salary: [True, None]
mba_spec: [True, None]
duration: [True, None]
employability_perc: [True, None]
</answer>

Fixes # (issue)

Feature/Issue validation/testing

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Test A
Logs for Test A
Test B
Logs for Test B

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Thanks for contributing 🎉!

IgorKasianenko

Thanks for the PR! Approved with typo fixes

Notebook for generating evals using sythetic data

55d8fbe

facebook-github-bot added the cla signed label May 5, 2025

agunapal added 4 commits May 5, 2025 15:51

Notebook for generating evals using sythetic data

e1e4cc4

Added flow diagram

08f551f

spellcheck correction

b736b10

spellcheck correction

0830252

IgorKasianenko requested a review from init27 May 12, 2025 10:38

typo fixes

4ecdbf9

IgorKasianenko self-assigned this May 21, 2025

IgorKasianenko approved these changes May 21, 2025

View reviewed changes

IgorKasianenko merged commit 4c32e76 into meta-llama:main May 21, 2025
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notebook for generating evals using sythetic data #937

Notebook for generating evals using sythetic data #937

agunapal commented May 5, 2025 •

edited

Loading

IgorKasianenko left a comment

Notebook for generating evals using sythetic data #937

Notebook for generating evals using sythetic data #937

Conversation

agunapal commented May 5, 2025 • edited Loading

Evals using synthetic data

Overall idea

Measuring Hallucinations

Example

Feature/Issue validation/testing

Before submitting

IgorKasianenko left a comment

Choose a reason for hiding this comment

agunapal commented May 5, 2025 •

edited

Loading