Skip to content

fix: byte/character unit mismatch in BigQuery analytics plugin GCS text offload #5561

@caohy1988

Description

@caohy1988

Bug: Mixed byte/character units in GCS text offload threshold

Validated against: main at 2d61cb69

Problem

HybridContentParser._parse_content_object (CASE C, text handling) measures text size in bytes for the GCS offload decision but compares it against thresholds derived from max_content_length, which is a character count used everywhere else in the plugin.

GCS offload uses bytes (bigquery_agent_analytics_plugin.py:1433):

text_len = len(part.text.encode("utf-8"))  # BYTES

Offload threshold derived from character-based config (bigquery_agent_analytics_plugin.py:1436-1438):

offload_threshold = self.inline_text_limit  # 32KB — bytes-intended
if self.max_length != -1 and self.max_length < offload_threshold:
    offload_threshold = self.max_length  # max_content_length — CHARACTERS

_truncate uses characters (bigquery_agent_analytics_plugin.py:1370):

if self.max_length != -1 and len(text) > self.max_length:  # CHARACTERS
    return text[:self.max_length] + "...[TRUNCATED]"

_recursive_smart_truncate uses characters (bigquery_agent_analytics_plugin.py:309):

if max_len != -1 and len(obj) > max_len:  # CHARACTERS
    return obj[:max_len] + "...[TRUNCATED]"

Impact

For multi-byte content (CJK, emoji, Arabic, etc.), a string can be under the character limit but over the byte limit:

Example Characters UTF-8 Bytes max_content_length (500K)
400K CJK characters 400,000 ~1,200,000 Under char limit, over byte limit
500K ASCII characters 500,000 500,000 At both limits

Scenario 1: False offload trigger

400K CJK characters → text_len = 1.2M bytes > offload_threshold (500K) → triggers GCS upload, even though the text is under the 500K character limit. The same text in ASCII would stay inline.

Scenario 2: Failed offload falls back incorrectly

If the GCS upload fails (line 1460-1466), the fallback calls _truncate(part.text) which uses len(text) (characters). The 400K-character text passes the character check and stays inline untruncated — but it was supposed to be offloaded because of its byte size.

Scenario 3: Threshold comparison mixes units

offload_threshold = min(inline_text_limit, max_length) at lines 1436-1438 compares inline_text_limit (32KB, bytes-intended) with max_length (character count). For ASCII these are equivalent; for multi-byte content they diverge.

Double truncation path

Some callbacks pre-truncate raw_content before passing to parser.parse():

  • _format_content_safely (bigquery_agent_analytics_plugin.py:2082) truncates content into a string
  • parser.parse() in _log_event (bigquery_agent_analytics_plugin.py:2876) truncates again via _recursive_smart_truncate

For ASCII this is idempotent (truncating at the same character position twice). For multi-byte content, the two truncation passes could cut at different positions if one measures characters and the other's threshold was derived from a byte comparison.

Proposed fix

Normalize to one unit. Two options:

Option A: Use characters throughout (simpler, matches Python semantics)

Change the GCS offload check to use len(part.text) instead of len(part.text.encode("utf-8")):

# Line 1433:
text_len = len(part.text)  # characters, consistent with _truncate and max_content_length

Pro: consistent with _truncate, _recursive_smart_truncate, and max_content_length semantics. Simple one-line fix.
Con: doesn't account for actual GCS upload size (multi-byte text uploads more bytes than the character count suggests).

Option B: Use bytes throughout (controls actual storage size)

Change _truncate, _recursive_smart_truncate, and max_content_length documentation to specify bytes. This is a larger change that affects the entire plugin.

Recommendation: Option A — it's a one-line fix, consistent with the rest of the plugin, and max_content_length is already documented/used as a character count. The GCS upload size difference for multi-byte content is a minor overhead, not a correctness issue.

Affected code paths

Location Unit Used for
_parse_content_object:1433 Bytes GCS offload threshold check
_parse_content_object:1436-1438 Mixed min(inline_text_limit, max_length) — bytes vs characters
_truncate:1370 Characters Inline text truncation
_recursive_smart_truncate:309 Characters Dict/list string value truncation
BigQueryLoggerConfig.max_content_length:570 Characters Config value (500KB default)
HybridContentParser.inline_text_limit:1367 Bytes-intended 32KB hardcoded threshold

Metadata

Metadata

Assignees

Labels

bq[Component] This issue is related to Big Query integration

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions