Skip to content

Complete Guide to Bug Reporting and Testing in ChatGPT Agent Mode

Denis Morozov edited this page Oct 17, 2025 · 22 revisions

📝 Complete Guide to Bug Reporting and Testing in ChatGPT Agent Mode

Table of Content

Introduction

ChatGPT Agent Mode is a powerful tool for automated bug reproduction and verification in real environments. It reduces manual effort, but requires well-prepared bug reports, structured prompts, and awareness of technical limitations.

This guide provides end-to-end best practices for working with Agent Mode, including:

  • Writing effective bug reports and rules for making them agent-compliant.
  • Examples of bug reports, both archived and up-to-date.
  • A step-by-step script for running bug reproduction with the agent.
  • A reusable report template for documenting agent runs.
  • Guidelines for writing effective prompts (for single and multiple bugs).
  • Known limitations of Agent Mode and its specific behavior patterns.

The content is based on multiple phases of research, including the study documented in #7758, where around 20 bugs were analyzed and reproduced. Insights from that research shaped the recommendations, examples, and templates in this guide. Future improvements and next steps can also be found in the discussion of #7758.


1. How to Write Bug Reports for Reproduction by ChatGPT Agent

A well-written bug report is critical for reproducibility, especially when executed by an automated agent.

💡 Tip: Make sure that bug covers only one clearly defined issue.

💡 Tip: Iterate with ChatGPT in usual chat mode to refine bug descriptions into agent-friendly wording in case of any doubts.

🔹 Essential Sections

  1. Title

    • Short and descriptive.
    • Focus on what is wrong, not how to reproduce it.
    • Example: “Line between Paste option and Create RNA antisense strand option is missing in the context menu”.
  2. Environment (not required as URL defined in prompt)

    • URL of the test environment.
    • Browser and version.
    • OS.
    • App version (if available).
  3. Preconditions (not required as URL and information about browser new tab defined in prompt)

    • Always start with a new browser tab.
    • Specify which layout (Sequence or Flex) is required → attach a screenshot.
    • Specify which file format (HELM/IDT/Ket) is selected → attach a screenshot.
  4. Steps to Reproduce

    • Simple, atomic steps (one action = one step).
    • Simplify wording, explicitly document any implicit/hidden steps (every switch, mode change, format selection...).
    • For each step, include focused screenshots of relevant UI elements with highlighted controls (not entire screens).
    • Use green highlights — they are easier for the agent to detect and less likely to be confused with error states
    • Highlight tool and button locations explicitly (toolbar, popup, hotkey, top/left panel). Where possible, specify relative placement, e.g., “top toolbar, first button from the left,” or “left panel, second button right after X”.
    • Use Open → Paste from Clipboard → Select format → Add to Canvas instead of Ctrl+V for pasting structures.
    • Simplify chemical structures as much as possible — use the minimal chemical structure that still reproduces the defect.
    • When selecting bonds and atoms, use zooming beforehand to make the selection more accurate. Clearly describe which exact atoms and bonds should be selected.
    • Prefer right-click menus to left panel tools where possible (e.g.: S-Group, Delete).
  5. Actual Result

    • What happens in reality.
    • If applicable, prefer red highlights, since red naturally indicates errors or incorrect behavior.
  6. Expected Result

    • What should happen in the correct scenario.
    • If applicable, use green highlights, as green clearly communicates the correct outcome and contrasts with the red “actual” state.
    • Clarify the expected result and ensure that each bug covers only one clearly defined issue.

2. Agent-Compliant Bug Report Examples

Bug Report for Molecules mode, example based on #7746 — click to expand

Title: System replace few spaces in monomer name to one space on monomer preview

Steps to Reproduce

  1. Switch to Molecules mode using the switcher in the top right corner of the top panel:
Image
  1. Click on "100%" zoom control in the top right corner.
Image
  1. Change the zoom value in the input field to "350%" to better interact with the structure.
Image
  1. Click the folder icon "Open..." in the top left corner of the top panel:
Image
  1. Click on Paste from Clipboard:
Image
  1. Copy this SMILES string: BrBr

  2. Paste it into Open Structure pop-up

  3. Click on "Add to Canvas" button:

Image
  1. Click on the Canvas to place the chemical structure:
Image
  1. Click the Rectangle Selection tool (second icon in the left panel):
Image
  1. With the Rectangle Selection tool, place the cursor above and to the left of the Br atom. Hold the left mouse button and drag to select the Br atom and the bond. Leave the second Br atom unselected:
Image
  1. Click the "Create a monomer" button in the middle of the left panel.
    Note: This button becomes available only after a bond and an atom are selected. This button is located right after the "R-Group Label" tool.
Image
  1. In the Attributes pop-up, select "CHEM" in the Type dropdown.
  2. In the Symbol field, enter: LongName.
  3. In the Name field, copy and paste the following text exactly as shown: 1 2 3 4 5 6 End
  4. Click the "Submit" button:
Image
  1. Switch to Macromolecules mode using the switcher in the top right corner of the top panel:
Image
  1. Hover the mouse over the monomer on the canvas.

Actual behavior

  • Preview tooltip title is without spacing: 1 2 3 4 5 6 End
Image

Expected behavior

  • Preview tooltip title is with spacing: 1 2 3 4 5 6 End
Bug Report for Macromolecules mode, example based on #7395 — click to expand

Title: Line between Paste option and Create RNA antisense strand option is missing in the context menu

  1. Switch to Macromolecules mode using the switcher in the top right corner of the top panel:
Image
  1. Click on the “A” button in the top panel:
Image
  1. Select the third option "Switch to flex layout mode":
Image
  1. Click on Folder icon "Open...":
Image
  1. Click on the first option "PASTE FROM CLIPBOARD":
Image
  1. Click on "Ket" at the left bottom area of "Open Structure" modal window:
Image
  1. Select "HELM" option in dropdown:
Image
  1. Copy this exact HELM string: RNA1{r(A)p}$$$$V2.0

  2. Paste the copied text into the "Open Structure" modal window

  3. Click the "Add to Canvas" button:

Image
  1. Click the Rectangle Selection tool (second icon in the left panel):
Image
  1. Left-click and drag to select the entire chemical structure placed on the Canvas in Step 10:
Image
  1. Right-click on any monomer on the Canvas to open context menu

Actual behavior

  • Line between Paste option and Create RNA antisense strand option is missing
Image

Expected behavior

  • Line between Paste option and Create RNA antisense strand option is in place
Image

3. Bug Reproduction, Data Collection, and ChatGPT Agent Run Analysis Script

  1. Reproduce the bug manually

  2. Make the bug report agent-compliant according to this guide (see sections 1 and 2).

  3. Start new GPT chat and select Agent mode:

image

💡 Tip: Here you can see the number of available requests. The limit is 40 requests per month, and it refreshes monthly.
Because requests are limited, make sure to use them effectively — one prompt in chat equals 1 agent request.

⚠️ Note: You can also take over control manually during the session. Pay attention, that connecting directly to the agent’s desktop also consumes requests and reduces the number of runs available for testing:

image
💡 Tip : Use the following memory settings if you want to avoid previous results affecting new agent runs - click to expand image
  1. Add a prompt (e.g.: to reproduce 1 bug, for more examples check section 5 below):
1. Open a new tab in Chrome and go to: https://github.com/epam/ketcher/issues/7395
2. Read the bug report and note the Steps to Reproduce and the Expected behavior.
3. Open another new tab in Chrome and go to: https://rc.test.lifescience.opensource.epam.com/KetcherDemoSA/index.html
4. Follow the bug’s Steps to Reproduce exactly in this environment.
5. Describe what you observed in the environment.
6. Compare your observation with the Expected behavior from the bug report, and clearly state the result as one of:
   - Reproducible — if the bug is visible,
   - Not reproducible — if the bug does not appear and you are completely sure,
   - Not reproduced, but not sure — if the result is uncertain.
7. Always take one final screenshot of the browser showing the outcome before giving your answer. 
  1. Run the agent and wait until it completes.
    Do not stop the agent or send other requests in the same chat during execution.
    Click on “Share” and copy the link to get a shared execution URL.
    Full execution details are available in the shared session:
image

⚠️ Note: Do not delete the chat. Once a chat is deleted, the share link becomes unavailable.

  1. Review the shared agent execution and verify whether the defect is reproducible.
    Make sure it is not a false positive (the agent reported reproducible, but review did not confirm) or false negative (the agent reported not reproducible, but that’s incorrect). During the review of the agent’s session, it is useful to check the playback mode (video-style rewind) to follow the flow of actions:
image

⚠️ Note: Sometimes the display may freeze: the agent’s “thoughts” are shown, but screenshots are missing or outdated — just refresh the page.

It is also helpful to use the “activity mode”, where the entire session is split into a sequence of screenshots for easier step-by-step analysis: image

image
  1. Add a GitHub label to bug based on the reproduction result:
  • GPT Ready — if the issue is successfully reproduced by the agent.

⚠️ Note: It makes sense to apply this label to the bug in GitHub after at least two consecutive successful runs.

  • GPT not Ready — if the reproduction is unsuccessful or results in a false positive/negative.
    In both cases, include a comment with the result, duration of run and the link to the agent’s execution.
  1. Fill in agent-run report on Google Sheets the following fields (check the table for examples):
  • Bug link: URL to GitHub issue (e.g., =HYPERLINK("https://github.com/epam/ketcher/issues/7746", "#7746"))
  • Run: Number of the run for this bug
  • Reproduction Result: Label indicating reproduction status (GPT Ready / GPT not Ready).

⚠️ Note: It makes sense to apply this label to the bug in GitHub after at least two consecutive successful runs.

  • Duration: Total execution time
  • Researcher: Name of the person who performed the run
  • Date: Date of the run (dd.mm.yy)
  • Link to execution: URL to the shared agent run (e.g., https://chatgpt.com/share/68cd370c-6338-8008-a987-f1616e37d189)
  • Changes before reproduction: Describe any modifications made to the bug steps, if applicable
  • Steps number: Total number of steps in the bug report
  • Prompt Version: Version number used (if new, add a new version on the second tab of the document "Prompts")
  • Agent’s problems: List the steps and exact issues encountered (e.g., Rectangle Selection, Bond Tool)
  • Agent's Result: Outcome or conclusion reached by the agent
  • Details of reproduction:
    • Brief description of whether the bug was reproduced, as confirmed by reviewing the shared agent run session.
    • Note if the result was a false positive, if execution was stopped manually, or if an execution error occurred.
    • Analyze the total execution time and any changes if it’s not the first run.
    • Describe any difficulties the agent encountered and any other relevant observations.
  • What can be improved in the bug: Hypotheses or suggestions for the next run
  • What can be improved in the prompt: Notes for the next iteration
  • What can be improved in the report: Meta-notes on reporting quality or format

4. How to Write Effective Agent Mode Requests

Key Rules to Reproduce 1 Bug

  • Always include the correct links to both the bug and the environment in the prompt.
  • The example prompt below should be treated as a draft version. It is not final and requires more successful runs to confirm its stability.
  • This draft prompt already includes all critical elements: strict step order, explicit link usage, comparison with expected behavior, clear result categories (Reproducible / Not reproducible / Not reproduced, but not sure), and a mandatory final screenshot.
  • After several runs, you may simplify the prompt for efficiency or extend it with additional clarifications if the agent still struggles.

⚠️ Note: As for current analysis prompt improvements mainly help with structure, clarity of result, and avoiding false positives/negatives, but they cannot fully fix weaknesses in selection/unclear bug steps or reduce the time of execution.

Example of Request to Reproduce Only One Bug

1. Open a new tab in Chrome and go to: https://github.com/epam/ketcher/issues/7395
2. Read the bug report and note the Steps to Reproduce and the Expected behavior.
3. Open another new tab in Chrome and go to: https://rc.test.lifescience.opensource.epam.com/KetcherDemoSA/index.html
4. Follow the bug’s Steps to Reproduce exactly in this environment.
5. Describe what you observed in the environment.
6. Compare your observation with the Expected behavior from the bug report, and clearly state the result as one of:
   - Reproducible — if the bug is visible,
   - Not reproducible — if the bug does not appear and you are completely sure,
   - Not reproduced, but not sure — if the result is uncertain.
7. Always take one final screenshot of the browser showing the outcome before giving your answer. 

Key Rules to Reproduce Several Bugs

  1. List bugs with direct GitHub links → Ensures the agent retrieves reproduction steps and details directly.

  2. Always specify the environment → Exact URL of the testing stand.

  3. Fix the browser explicitly → Example: “on Google Chrome browser”.

  4. Force clean runs with new tabs → Add “Before each test, open a new browser tab with Ketcher”.

  5. Limit bug count per request → Maximum 2–3 bugs for stability.

⚠️ Note: After more than three bugs in a single run, the agent may:

  • Lose track of steps.
  • Stop mid-execution with an error.
  • Get stuck in thinking state without finishing.

Example of Request to Reproduce Several Bugs

Check that the bugs: 
https://github.com/epam/ketcher/issues/7697 
https://github.com/epam/ketcher/issues/7574 
https://github.com/epam/ketcher/issues/5225 
on the environment https://rc.test.lifescience.opensource.epam.com/KetcherDemoSA/index.html 
on Google Chrome browser. 
Before each test, open a new browser tab with Ketcher.

5. Agent Mode Limitations and Specific Behavior

Please pay attention on this during the testing agent mode for bug reproduction:

  • DevTools are unavailable → console errors and network requests cannot be inspected.

  • Zip files and other attachments are not supported → Agent can't download and access files attached to Github issue. Use text formats (HELM, IDT strings) instead. Agent has difficulties with copying text from quotes, so it's better to put structure into code block instead. If the structure is too big (GitHub issue bodies are limited to 65,536 characters) you can put it into Github gist and add link into the description.

  • Context loss across multiple bugs → when testing several defects in sequence, the agent may lose context and fail.

  • Agent screenshot verification limitations → issues that depend on cursor position or subtle rendering details on the Canvas cannot be reliably verified through screenshots. For example, ghost images under the cursor are not visible #7421, and slight changes in bond thickness cannot be captured #7726). Such cases should not be tested with the agent.

  • Reliable interactions → the agent consistently handles clicks that open dropdowns, pop-ups, or context menus. When the bug report includes clear screenshots of where to click, the agent executes these interactions correctly.

  • ⚠️ Tool-related struggles → the agent has difficulties with tools such as Erase, Selection, and Create Monomer, which require selecting the tool first and then applying it to the structure. Frequent issues include mis-selecting atoms or bonds, choosing the wrong tool (e.g., Hand or Erase instead of Selection), drawing rectangles incorrectly, or dragging atoms instead of selecting them.

  • ⚠️ Limited prompt impact → As for current analysis prompt improvements mainly help with structure, clarity of result, and avoiding false positives/negatives, but they cannot fully fix weaknesses in selection/unclear bug steps or reduce the time of execution.

  • ⚠️ Unpredictable execution times → even stable bugs may vary significantly in run time (for example one run - 5m, next run - 15m for the same bug).

  • ⚠️ Execution freezes → sometimes the agent “thinks” but screenshots are missing or outdated; refreshing the page (F5) is required.

  • ⚠️ Direct desktop takeover consumes requests → connecting directly to the agent’s desktop also reduces the number of available runs for testing.


Conclusion

ChatGPT Agent Mode has proven to be a useful assistant for QA testing when applied correctly. Reliable results depend on:

  • Well-structured bug reports with clear steps, screenshots, and environment details.
  • Following a disciplined script for reproductions and documenting outcomes.
  • Using consistent templates for agent-run reports.
  • Writing precise, reproducible prompts that limit ambiguity.
  • Respecting environment limitations (no DevTools, no zip files, issues with complex selection tools).
  • Avoiding overloading a single run — keep to 2–3 bugs per request.

The practices in this guide are grounded in real research (#7758) and reflect lessons learned from multiple reproduction sessions. They are not static: new limitations, prompt improvements, and testing strategies will emerge with further experiments. Testers are encouraged to contribute findings back to the shared research thread so the methodology continues to evolve.

By following these practices, QA teams can ensure more accurate, repeatable results and reduce the cost of manual verification with the help of AI.

Clone this wiki locally