Skip to content

Notebook for generating evals using sythetic data #937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 21, 2025

Conversation

agunapal
Copy link
Contributor

@agunapal agunapal commented May 5, 2025

Evals using synthetic data

  • When you deploy Llama for your use case, it is a good practice to have Evals for your usecase. In an ideal world, you want to have human annotated Evals. If for some reason, this is not possible, this notebook shows a strategy for how one might go about addressing Evals using synthetic data. However, the Evals generated still require validation by a human to make sure that your revenue generating production use case can rely on this.
  • The notebook also shows how one can accurately measure hallucinations without using LLM-As-A-Judge methodology using Llama

Overall idea

Let's assume we have a use case for generating a summarization report based on a given context, which is a pretty common use case with LLM. Both the context and the report have a lot of factual information and we want to make sure the generated report is not hallucinating.

Since its not trivial to find an open source dataset for this, the idea is to take synthetic tabular data and then use Llama to generate a story(context) for every row of the tabular data using Prompt Engineering. Then we ask Llama to summarize the generated context as a report in a specfic format using Prompt Engineering. Finally we check the factual accuracy of the generated report using Llama by converting this into a QA task using the tabular data as the groud truth.

To generate synthetic data for this approach, we use an open source tool like Synthetic Data Vault(SDV)
Want to thank @adamloving-meta for suggesting this tool

Workflow_Diagram

Measuring Hallucinations

The usual method to measure hallucinations uses LLM-As-Judge methodology. An example hallucination metric is using DeepEval.
This would use a powerful LLM as the ground truth to measure hallucinations.

This notebook shows a simple way to measure hallucinations using the ground truth data that we have (tabular data). The methodology is to make use of the tags that we have added in the report and use Llama to answer simple questions looking at the corresponding sections. Llama compares the answers with the ground truth and generates a list of boolean values. This is then used to measure accuracy of the factual information in the report. If your report has a well defined structure, using QA to measure hallucinations can be highly effective and cost efficient

Example

Below is an example of the expected output while checking for hallucinations in the generated report. Since we converted the hallucination measurement task into a list of QA with True/False outcomes, we can use accuracy_score from sklearn to measure the accuracy in terms of a single number.

Checking accuracy of generated report in generated_data/data_6.json

<answer>
student_id: [False, report shows Student ID is not mentioned in the data and ground truth says 6180804]
degree_type: [True, None]
salary: [True, None]
mba_spec: [True, None]
duration: [True, None]
employability_perc: [True, None]
</answer>

Fixes # (issue)

Feature/Issue validation/testing

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Test A
    Logs for Test A

  • Test B
    Logs for Test B

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Thanks for contributing 🎉!

@IgorKasianenko IgorKasianenko requested a review from init27 May 12, 2025 10:38
@IgorKasianenko IgorKasianenko self-assigned this May 21, 2025
Copy link
Contributor

@IgorKasianenko IgorKasianenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Approved with typo fixes

@IgorKasianenko IgorKasianenko merged commit 4c32e76 into meta-llama:main May 21, 2025
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants