Skip to content

Endpoints created with custom image using create_inference_endpoint from huggingface_hub returns inconsistent and hallucinatory outputs #3184

@Sanahm

Description

@Sanahm

Issue Description

Context:
I am using the Hugging Face Inference API to deploy models on a leaderboard using the huggingface_hub library with a custom image for text generation. Everything was working fine until recently. Now, I am experiencing issues where the model either hallucinates or the inference API returns a sequence of random words in the "generated_text" field.

Problem Description

  1. Hallucination or Random Words:

    • When using a custom image, the model returns hallucinatory or random text.
    • Without the custom image, the output is more reasonable but still inconsistent.
  2. Inconsistency Between requests and InferenceClient:

    • There is an inconsistency in the output when using the requests library and the InferenceClient.

Code to Reproduce the Error

Setting Up the Endpoint

from huggingface_hub import create_inference_endpoint

endpoint = create_inference_endpoint(
    "test-qwenz",
    repository="Qwen/Qwen2.5-0.5B-Instruct",
    framework="pytorch",
    namespace="<xxxxxxxx>",
    task="text-generation",
    accelerator="gpu",
    vendor="aws",
    region="us-east-1",
    type="protected",
    instance_size="x1",
    instance_type="nvidia-a10g",
    custom_image={
        "health_route": "/health",
        "env": {
            "MAX_BATCH_PREFILL_TOKENS": "2048",
            "MAX_INPUT_LENGTH": "1024",
            "MAX_TOTAL_TOKENS": "1512",
            "MODEL_ID": "/repository"
        },
        "url": "ghcr.io/huggingface/text-generation-inference:latest",
    },
)

Test Code Using requests

import requests

# Define the input data
data = {
    "inputs": "What is 4*2 ?"
}

# Define the headers, including the Authorization token
headers = {
    "Authorization": "Bearer <token>",
    "Content-Type": "application/json"
}

# Send the POST request
response = requests.post(endpoint_url, json=data, headers=headers)

# Print the response
print(response.json())

Output with Custom Image

[
    {
        'generated_text': 'What is 4*2 ? The integral is not 0 for ν ≤ 3/2. Any such τ will allow us to express (7b) in a form of a quadratic function of the twists τ x : Z z → Z y , which turns out to be a constant ...'
    }
]

Output without Custom Image

[
    {
        'generated_text': 'What is 4*2 ? To solve the problem of multiplying two numbers, you simply multiply them together. In this case, we...'
    }
]

Inconsistency Between requests and InferenceClient

from huggingface_hub import InferenceClient

client = InferenceClient(endpoint_url, token="<token>")

output = client.text_generation(
    prompt="What is 4*2 ?", details=True
)

print(output)

Output:

' - Brainly.com\\nprofile\\njessie1078\\njessie10'

Additional Information

  • Models Tested:

    • Phi4
    • Qwen7B-Instruct
  • Custom Image URLs Tested:

    • ghcr.io/huggingface/text-generation-inference:1.1.1
    • ghcr.io/huggingface/text-generation-inference:3.3.1
    • ghcr.io/huggingface/text-generation-inference:latest

Expected Behavior

  • The model should return a consistent and accurate response to the input prompt.
  • There should be no significant difference in the output when using requests and InferenceClient.

Environment

  • Python Version: 3.10.16
  • huggingface_hub Version: 0.27.1 (tested also on the 0.33.1)
  • requests Version: 2.32.3

Steps to Reproduce

  1. Set up the inference endpoint with the provided code.
  2. Use the requests library to send a POST request to the endpoint.
  3. Use the InferenceClient to send a request to the endpoint.
  4. Observe the inconsistent and hallucinatory outputs.

Additional Notes

  • The issue persists across different models and custom image versions (Qwen-1.5B-Instruct, Qwen-7B-Instruct, Phi4).
  • The problem started recently, suggesting a potential change in the inference API or custom image configuration.

Request

Could you please investigate this issue and provide a solution or workaround to ensure consistent and accurate model outputs?

Thank you for your attention and support.

System info

- **Python Version:** 3.10.16
- **huggingface_hub Version:** 0.27.1 (tested also on the 0.33.1)
- **requests Version:** 2.32.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions