fix: byte/character unit mismatch in BigQuery analytics plugin GCS text offload

## Bug: Mixed byte/character units in GCS text offload threshold

**Validated against:** `main` at `2d61cb69`

### Problem

`HybridContentParser._parse_content_object` (CASE C, text handling) measures text size in **bytes** for the GCS offload decision but compares it against thresholds derived from `max_content_length`, which is a **character count** used everywhere else in the plugin.

**GCS offload uses bytes** (`bigquery_agent_analytics_plugin.py:1433`):
```python
text_len = len(part.text.encode("utf-8"))  # BYTES
```

**Offload threshold derived from character-based config** (`bigquery_agent_analytics_plugin.py:1436-1438`):
```python
offload_threshold = self.inline_text_limit  # 32KB — bytes-intended
if self.max_length != -1 and self.max_length < offload_threshold:
    offload_threshold = self.max_length  # max_content_length — CHARACTERS
```

**`_truncate` uses characters** (`bigquery_agent_analytics_plugin.py:1370`):
```python
if self.max_length != -1 and len(text) > self.max_length:  # CHARACTERS
    return text[:self.max_length] + "...[TRUNCATED]"
```

**`_recursive_smart_truncate` uses characters** (`bigquery_agent_analytics_plugin.py:309`):
```python
if max_len != -1 and len(obj) > max_len:  # CHARACTERS
    return obj[:max_len] + "...[TRUNCATED]"
```

### Impact

For multi-byte content (CJK, emoji, Arabic, etc.), a string can be under the character limit but over the byte limit:

| Example | Characters | UTF-8 Bytes | `max_content_length` (500K) |
|---|---|---|---|
| 400K CJK characters | 400,000 | ~1,200,000 | Under char limit, over byte limit |
| 500K ASCII characters | 500,000 | 500,000 | At both limits |

**Scenario 1: False offload trigger**

400K CJK characters → `text_len = 1.2M bytes > offload_threshold (500K)` → triggers GCS upload, even though the text is under the 500K character limit. The same text in ASCII would stay inline.

**Scenario 2: Failed offload falls back incorrectly**

If the GCS upload fails (line 1460-1466), the fallback calls `_truncate(part.text)` which uses `len(text)` (characters). The 400K-character text passes the character check and stays inline untruncated — but it was supposed to be offloaded because of its byte size.

**Scenario 3: Threshold comparison mixes units**

`offload_threshold = min(inline_text_limit, max_length)` at lines 1436-1438 compares `inline_text_limit` (32KB, bytes-intended) with `max_length` (character count). For ASCII these are equivalent; for multi-byte content they diverge.

### Double truncation path

Some callbacks pre-truncate `raw_content` before passing to `parser.parse()`:
- `_format_content_safely` (`bigquery_agent_analytics_plugin.py:2082`) truncates content into a string
- `parser.parse()` in `_log_event` (`bigquery_agent_analytics_plugin.py:2876`) truncates again via `_recursive_smart_truncate`

For ASCII this is idempotent (truncating at the same character position twice). For multi-byte content, the two truncation passes could cut at different positions if one measures characters and the other's threshold was derived from a byte comparison.

### Proposed fix

Normalize to one unit. Two options:

**Option A: Use characters throughout (simpler, matches Python semantics)**

Change the GCS offload check to use `len(part.text)` instead of `len(part.text.encode("utf-8"))`:

```python
# Line 1433:
text_len = len(part.text)  # characters, consistent with _truncate and max_content_length
```

Pro: consistent with `_truncate`, `_recursive_smart_truncate`, and `max_content_length` semantics. Simple one-line fix.
Con: doesn't account for actual GCS upload size (multi-byte text uploads more bytes than the character count suggests).

**Option B: Use bytes throughout (controls actual storage size)**

Change `_truncate`, `_recursive_smart_truncate`, and `max_content_length` documentation to specify bytes. This is a larger change that affects the entire plugin.

**Recommendation: Option A** — it's a one-line fix, consistent with the rest of the plugin, and `max_content_length` is already documented/used as a character count. The GCS upload size difference for multi-byte content is a minor overhead, not a correctness issue.

### Affected code paths

| Location | Unit | Used for |
|---|---|---|
| `_parse_content_object:1433` | **Bytes** | GCS offload threshold check |
| `_parse_content_object:1436-1438` | Mixed | `min(inline_text_limit, max_length)` — bytes vs characters |
| `_truncate:1370` | Characters | Inline text truncation |
| `_recursive_smart_truncate:309` | Characters | Dict/list string value truncation |
| `BigQueryLoggerConfig.max_content_length:570` | Characters | Config value (500KB default) |
| `HybridContentParser.inline_text_limit:1367` | Bytes-intended | 32KB hardcoded threshold |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: byte/character unit mismatch in BigQuery analytics plugin GCS text offload #5561

Bug: Mixed byte/character units in GCS text offload threshold

Problem

Impact

Double truncation path

Proposed fix

Affected code paths

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Example	Characters	UTF-8 Bytes	`max_content_length` (500K)
400K CJK characters	400,000	~1,200,000	Under char limit, over byte limit
500K ASCII characters	500,000	500,000	At both limits

Location	Unit	Used for
`_parse_content_object:1433`	Bytes	GCS offload threshold check
`_parse_content_object:1436-1438`	Mixed	`min(inline_text_limit, max_length)` — bytes vs characters
`_truncate:1370`	Characters	Inline text truncation
`_recursive_smart_truncate:309`	Characters	Dict/list string value truncation
`BigQueryLoggerConfig.max_content_length:570`	Characters	Config value (500KB default)
`HybridContentParser.inline_text_limit:1367`	Bytes-intended	32KB hardcoded threshold

fix: byte/character unit mismatch in BigQuery analytics plugin GCS text offload #5561

Description

Bug: Mixed byte/character units in GCS text offload threshold

Problem

Impact

Double truncation path

Proposed fix

Affected code paths

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions