Bug: Mixed byte/character units in GCS text offload threshold
Validated against: main at 2d61cb69
Problem
HybridContentParser._parse_content_object (CASE C, text handling) measures text size in bytes for the GCS offload decision but compares it against thresholds derived from max_content_length, which is a character count used everywhere else in the plugin.
GCS offload uses bytes (bigquery_agent_analytics_plugin.py:1433):
text_len = len(part.text.encode("utf-8")) # BYTES
Offload threshold derived from character-based config (bigquery_agent_analytics_plugin.py:1436-1438):
offload_threshold = self.inline_text_limit # 32KB — bytes-intended
if self.max_length != -1 and self.max_length < offload_threshold:
offload_threshold = self.max_length # max_content_length — CHARACTERS
_truncate uses characters (bigquery_agent_analytics_plugin.py:1370):
if self.max_length != -1 and len(text) > self.max_length: # CHARACTERS
return text[:self.max_length] + "...[TRUNCATED]"
_recursive_smart_truncate uses characters (bigquery_agent_analytics_plugin.py:309):
if max_len != -1 and len(obj) > max_len: # CHARACTERS
return obj[:max_len] + "...[TRUNCATED]"
Impact
For multi-byte content (CJK, emoji, Arabic, etc.), a string can be under the character limit but over the byte limit:
| Example |
Characters |
UTF-8 Bytes |
max_content_length (500K) |
| 400K CJK characters |
400,000 |
~1,200,000 |
Under char limit, over byte limit |
| 500K ASCII characters |
500,000 |
500,000 |
At both limits |
Scenario 1: False offload trigger
400K CJK characters → text_len = 1.2M bytes > offload_threshold (500K) → triggers GCS upload, even though the text is under the 500K character limit. The same text in ASCII would stay inline.
Scenario 2: Failed offload falls back incorrectly
If the GCS upload fails (line 1460-1466), the fallback calls _truncate(part.text) which uses len(text) (characters). The 400K-character text passes the character check and stays inline untruncated — but it was supposed to be offloaded because of its byte size.
Scenario 3: Threshold comparison mixes units
offload_threshold = min(inline_text_limit, max_length) at lines 1436-1438 compares inline_text_limit (32KB, bytes-intended) with max_length (character count). For ASCII these are equivalent; for multi-byte content they diverge.
Double truncation path
Some callbacks pre-truncate raw_content before passing to parser.parse():
_format_content_safely (bigquery_agent_analytics_plugin.py:2082) truncates content into a string
parser.parse() in _log_event (bigquery_agent_analytics_plugin.py:2876) truncates again via _recursive_smart_truncate
For ASCII this is idempotent (truncating at the same character position twice). For multi-byte content, the two truncation passes could cut at different positions if one measures characters and the other's threshold was derived from a byte comparison.
Proposed fix
Normalize to one unit. Two options:
Option A: Use characters throughout (simpler, matches Python semantics)
Change the GCS offload check to use len(part.text) instead of len(part.text.encode("utf-8")):
# Line 1433:
text_len = len(part.text) # characters, consistent with _truncate and max_content_length
Pro: consistent with _truncate, _recursive_smart_truncate, and max_content_length semantics. Simple one-line fix.
Con: doesn't account for actual GCS upload size (multi-byte text uploads more bytes than the character count suggests).
Option B: Use bytes throughout (controls actual storage size)
Change _truncate, _recursive_smart_truncate, and max_content_length documentation to specify bytes. This is a larger change that affects the entire plugin.
Recommendation: Option A — it's a one-line fix, consistent with the rest of the plugin, and max_content_length is already documented/used as a character count. The GCS upload size difference for multi-byte content is a minor overhead, not a correctness issue.
Affected code paths
| Location |
Unit |
Used for |
_parse_content_object:1433 |
Bytes |
GCS offload threshold check |
_parse_content_object:1436-1438 |
Mixed |
min(inline_text_limit, max_length) — bytes vs characters |
_truncate:1370 |
Characters |
Inline text truncation |
_recursive_smart_truncate:309 |
Characters |
Dict/list string value truncation |
BigQueryLoggerConfig.max_content_length:570 |
Characters |
Config value (500KB default) |
HybridContentParser.inline_text_limit:1367 |
Bytes-intended |
32KB hardcoded threshold |
Bug: Mixed byte/character units in GCS text offload threshold
Validated against:
mainat2d61cb69Problem
HybridContentParser._parse_content_object(CASE C, text handling) measures text size in bytes for the GCS offload decision but compares it against thresholds derived frommax_content_length, which is a character count used everywhere else in the plugin.GCS offload uses bytes (
bigquery_agent_analytics_plugin.py:1433):Offload threshold derived from character-based config (
bigquery_agent_analytics_plugin.py:1436-1438):_truncateuses characters (bigquery_agent_analytics_plugin.py:1370):_recursive_smart_truncateuses characters (bigquery_agent_analytics_plugin.py:309):Impact
For multi-byte content (CJK, emoji, Arabic, etc.), a string can be under the character limit but over the byte limit:
max_content_length(500K)Scenario 1: False offload trigger
400K CJK characters →
text_len = 1.2M bytes > offload_threshold (500K)→ triggers GCS upload, even though the text is under the 500K character limit. The same text in ASCII would stay inline.Scenario 2: Failed offload falls back incorrectly
If the GCS upload fails (line 1460-1466), the fallback calls
_truncate(part.text)which useslen(text)(characters). The 400K-character text passes the character check and stays inline untruncated — but it was supposed to be offloaded because of its byte size.Scenario 3: Threshold comparison mixes units
offload_threshold = min(inline_text_limit, max_length)at lines 1436-1438 comparesinline_text_limit(32KB, bytes-intended) withmax_length(character count). For ASCII these are equivalent; for multi-byte content they diverge.Double truncation path
Some callbacks pre-truncate
raw_contentbefore passing toparser.parse():_format_content_safely(bigquery_agent_analytics_plugin.py:2082) truncates content into a stringparser.parse()in_log_event(bigquery_agent_analytics_plugin.py:2876) truncates again via_recursive_smart_truncateFor ASCII this is idempotent (truncating at the same character position twice). For multi-byte content, the two truncation passes could cut at different positions if one measures characters and the other's threshold was derived from a byte comparison.
Proposed fix
Normalize to one unit. Two options:
Option A: Use characters throughout (simpler, matches Python semantics)
Change the GCS offload check to use
len(part.text)instead oflen(part.text.encode("utf-8")):Pro: consistent with
_truncate,_recursive_smart_truncate, andmax_content_lengthsemantics. Simple one-line fix.Con: doesn't account for actual GCS upload size (multi-byte text uploads more bytes than the character count suggests).
Option B: Use bytes throughout (controls actual storage size)
Change
_truncate,_recursive_smart_truncate, andmax_content_lengthdocumentation to specify bytes. This is a larger change that affects the entire plugin.Recommendation: Option A — it's a one-line fix, consistent with the rest of the plugin, and
max_content_lengthis already documented/used as a character count. The GCS upload size difference for multi-byte content is a minor overhead, not a correctness issue.Affected code paths
_parse_content_object:1433_parse_content_object:1436-1438min(inline_text_limit, max_length)— bytes vs characters_truncate:1370_recursive_smart_truncate:309BigQueryLoggerConfig.max_content_length:570HybridContentParser.inline_text_limit:1367