Skip to content

Handle image content in FileReadObservation #1003

@xingyaoww

Description

@xingyaoww

Overview

This issue tracks the implementation of image content handling in FileReadObservation, which was previously implemented in OpenHands PR #10207 (OpenHands/OpenHands#10207).

Background

The OpenHands repository previously had functionality to detect and process image data URLs in FileReadObservation, wrapping them in ImageContent instead of TextContent. This ensures proper handling of image content in agent conversations.

Changes Required

The implementation should include:

1. ConversationMemory Updates

Update the conversation memory to:

  • Detect image data URLs (base64-encoded images with format data:image/[type];base64,[data]) in FileReadObservation content
  • Wrap image data in ImageContent instead of TextContent for proper LLM processing
  • Handle images in tool messages by adding them as separate user messages (workaround for LLM API limitations)

2. Runtime File Reading

Update the action execution server to:

  • Read image files (.png, .jpg, .jpeg, .bmp, .gif) as base64-encoded data URLs
  • Return image content in the format: data:image/[type];base64,[encoded_data]
  • Prioritize image handling over binary file error handling
  • Skip OH-ACI file editor for image files

3. Tool Description Updates

Update the str_replace_editor tool description to:

  • Clearly indicate support for viewing image file types
  • Update the binary file handling documentation

Original Implementation Details

ConversationMemory Changes

import re

# In _process_observation method for FileReadObservation:
if isinstance(obs, FileReadObservation):
    content = []
    if re.match(r'^data:image/[^;]+;base64,', obs.content.strip()):
        content.append(ImageContent(image_urls=[obs.content.strip()]))
    else:
        content.append(TextContent(text=obs.content))
    message = Message(role='user', content=content)

Image Content in Tool Messages

When a tool message contains images, add them as a separate user message:

# After creating tool message, check for images
if message.contains_image:
    return [
        Message(
            role='user',
            content=[
                content
                for content in message.content
                if isinstance(content, ImageContent)
            ],
            vision_enabled=vision_is_active,
        )
    ]

Action Execution Server Changes

# Check if file is an image
is_image_file = action.path.lower().endswith(
    ('.png', '.jpg', '.jpeg', '.bmp', '.gif')
)

# Skip OH-ACI for images
if action.impl_source == FileReadSource.OH_ACI and not is_image_file:
    # Use file editor...
    pass

# Read image files as base64
if is_image_file:
    with open(filepath, 'rb') as file:
        image_data = file.read()
        encoded_image = base64.b64encode(image_data).decode('utf-8')
        # Determine image type from extension
        ext = filepath.split('.')[-1].lower()
        if ext == 'jpg':
            ext = 'jpeg'
        content = f'data:image/{ext};base64,{encoded_image}'
    return FileReadObservation(path=filepath, content=content)

# Check for binary files AFTER image handling
if is_binary(action.path):
    return ErrorObservation('ERROR_BINARY_FILE')

Related Issues

Testing

The implementation should be tested with:

  • Reading various image file formats (PNG, JPG, JPEG, BMP, GIF)
  • Verifying base64 encoding and data URL format
  • Confirming proper ImageContent wrapping in conversation memory
  • Testing image handling in both agent and non-agent contexts (FileReadSource.DEFAULT vs FileReadSource.AGENT)
  • Ensuring images in tool messages are properly moved to user messages

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions