Skip to content

[Bug] UnicodeDecodeError in gitea_provider.py when parsing binary files before extension filtering #2380

@ChrisTJie

Description

@ChrisTJie

Git provider

Other

System Info

  • Deployment Type: Self-hosted Docker App
  • Docker Image: codiumai/pr-agent:0.34-gitea_app
  • Git Provider: Gitea v1.25.5 (Self-hosted)
  • Trigger Method: Gitea Webhook (handle_gitea_webhooks)
  • LLM Model: gpt-5.4-nano
  • Relevant Config: patch_extension_skip_types correctly includes [".webp", ".mp3", ".png"], but the crash bypasses this mechanism.

Bug details

Describe the bug
When PR-Agent processes a Gitea webhook for a PR that replaces old media files with new binary formats (e.g., deleting existing .svg/.png and adding new .webp/.mp3 files), it crashes with a UnicodeDecodeError.

The crash occurs during the file content retrieval and diff generation phases (get_file_content and __add_file_diff). It appears the system attempts to decode the binary payload as utf-8 before the patch_extension_skip_types filter can effectively exclude these files.

Additionally, the Gitea API returns 404 Not Found (likely because the agent attempts to fetch the content of files that were just deleted or due to binary asset routing), which is not handled gracefully and leads to further cascade errors.

To Reproduce

  1. Set up PR-Agent using the codiumai/pr-agent:0.34-gitea_app image.
  2. Create a Pull Request in Gitea (v1.25.5) that deletes old media files and adds new binary files (e.g., removing q1bg.png and adding q1bg.webp).
  3. Trigger the PR-Agent via Gitea webhook.
  4. The agent triggers 404s for the assets, crashes due to UTF-8 decoding errors, and returns an empty prediction.

Expected behavior
The gitea_provider.py should catch UnicodeDecodeError (e.g., using errors='replace', checking mime types) and filter out ignored extensions before attempting to decode the raw content. API 404 errors (especially for deleted files) should also be handled without crashing the main process.

Relevant PR Diff (Example highlighting deletions and additions)

diff --git a/assets/images/screens/q1bg.png b/assets/images/screens/q1bg.png
deleted file mode 100644
index a35d62f..0000000
Binary files a/assets/images/screens/q1bg.png and /dev/null differ

diff --git a/assets/images/screens/q1bg.webp b/assets/images/screens/q1bg.webp
new file mode 100644
index 0000000..dd7b99b
Binary files /dev/null and b/assets/images/screens/q1bg.webp differ

Logs

# 1. 404 Error when fetching assets (Likely related to deleted or binary files)
file: /app/pr_agent/git_providers/gitea_provider.py
function: get_file_content (line 947)
ERROR: Error getting file: assets/images/screens/q1bg.webp, content: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'max-age=0, private, must-revalidate, no-transform', 'Content-Type': 'application/json;charset=utf-8', ...})
HTTP response body: b'{"errors":null,"message":"not found","url":"https://<gitea-host>/api/swagger"}\n'

# 2. UTF-8 Decoding Crash in get_file_content
file: /app/pr_agent/git_providers/gitea_provider.py
function: get_file_content (line 950)
ERROR: Unexpected error: 'utf-8' codec can't decode byte 0x86 in position 4: invalid start byte

# 3. UTF-8 Decoding Crash in __add_file_diff
file: /app/pr_agent/git_providers/gitea_provider.py
function: __add_file_diff (line 152)
ERROR: Error getting diff content: 'utf-8' codec can't decode byte 0x84 in position 2007: invalid start byte

# 4. Final failure
file: /app/pr_agent/tools/pr_description.py
WARNING: Empty prediction, PR: <repo>/<pr_id>

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions