Skip to content

feat: echo STT transcripts to thread before agent reply#571

Open
dogzzdogzz wants to merge 1 commit intoopenabdev:mainfrom
dogzzdogzz:feat/stt-transcript-echo
Open

feat: echo STT transcripts to thread before agent reply#571
dogzzdogzz wants to merge 1 commit intoopenabdev:mainfrom
dogzzdogzz:feat/stt-transcript-echo

Conversation

@dogzzdogzz
Copy link
Copy Markdown
Contributor

Summary

When STT transcribes a voice message, post the transcript back to the thread (no mentions) before the agent reply so users can verify what was heard. Discord and Slack today; platform-agnostic helper means future adapters get it for free.

  • One thread message per user message: > 🎤 <transcript> per clip.
  • Failure → > 🎤 (transcription failed) line + ⚠️ reaction on the user's original message.
  • Opt-out via [stt] echo_transcript = false (default true, mirrored as stt.echoTranscript in Helm values).

Closes #570.

Originally requested in Discord: https://discord.com/channels/1491295327620169908/1491365150664560881/1497784772230123560

Architecture

stt::post_echo(&Arc<dyn ChatAdapter>, &ChannelRef, &MessageRef, &[EchoEntry], &SttConfig) is the platform-agnostic helper. Discord (`src/discord.rs`) and Slack (`src/slack.rs`) collect a `Vec` while iterating audio attachments and call the helper before the agent dispatch. Gateway-based platforms (LINE / Telegram / future Teams) intentionally not wired today — their protocol carries text only. The helper signature is unchanged when audio plumbing lands there later.

Files changed

  • `src/config.rs` — `SttConfig.echo_transcript: bool` (default `true`).
  • `src/stt.rs` — `EchoEntry` enum, `format_echo_message`, `post_echo` with `MockAdapter`-driven tests.
  • `src/discord.rs`, `src/slack.rs` — wire echo into the audio attachment loop, call `post_echo` before `router.handle_message`.
  • `charts/openab/values.yaml`, `charts/openab/templates/configmap.yaml` — expose `echoTranscript` (default `true`, `hasKey` guard preserves the default while distinguishing unset vs. explicit `false`).
  • `docs/stt.md`, `docs/config-reference.md` — document `echo_transcript`.
  • `docs/superpowers/specs/` and `docs/superpowers/plans/` — design spec + TDD-style implementation plan that drove this work.

Test plan

  • `cargo test --bin openab` — 133/133 pass (10 in `stt::tests` cover format, post_echo success, failure, mixed, disabled config, empty entries).
  • `cargo clippy --all-targets -- -D warnings` — clean.
  • `helm lint charts/openab` — clean.
  • `helm template ...` with default values renders `echo_transcript = true`; with `--set agents.kiro.stt.echoTranscript=false` renders `echo_transcript = false`.
  • Manual smoke test: send a voice message in Discord — verify the bot posts `> 🎤 ` before the agent's reply.
  • Manual smoke test: same in Slack.
  • Manual smoke test: simulate STT failure (e.g. revoke API key briefly or attach an unsupported file) — verify the failure line + ⚠️ reaction.

Out of scope / follow-ups

  • LINE / Telegram / Teams via gateway — those need audio plumbing in the gateway protocol first. The helper signature accommodates them when that work lands.
  • Multi-clip ordering: `extra_blocks.insert(0, …)` reverses transcript order in the agent prompt while `echo_entries.push(…)` preserves upload order. Pre-existing in the agent-prompt path; out of scope for this PR.

🤖 Generated with Claude Code

@dogzzdogzz dogzzdogzz requested a review from thepagent as a code owner April 26, 2026 03:18
@github-actions github-actions Bot added the pending-screening PR awaiting automated screening label Apr 26, 2026
@dogzzdogzz dogzzdogzz force-pushed the feat/stt-transcript-echo branch 2 times, most recently from ee10184 to 7f74166 Compare April 26, 2026 03:34
When STT transcribes a voice message, optionally post the transcript back
to the thread (no mentions) before the agent reply so users can verify what
was heard. Default is OFF — opt in via [stt] echo_transcript = true.

- New config: [stt] echo_transcript (default false, opt-in)
- New helper: stt::post_echo with platform-agnostic ChatAdapter handle —
  future LINE/Telegram/Teams adapters get echo for free
- Format: > 🎤 <transcript> per clip, all in one thread message
- Failure: > 🎤 (transcription failed) line + ⚠️ reaction on the user msg
- Helm: agents.<name>.stt.echoTranscript (camelCase) wired through configmap
- Docs: docs/stt.md and docs/config-reference.md updated

Rebased on top of openabdev#567 (gateway config rendering).

Tests: 133/133 cargo. helm-unittest: 28/28. Clippy --all-targets -D warnings clean.
@shaun-agent
Copy link
Copy Markdown
Contributor

OpenAB PR Screening

This is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Click 👍 if you find this useful. Human review will be done within 24 hours. We appreciate your support and contribution 🙏

Screening report ## Intent

PR #571 makes voice-message STT behavior visible to users by posting the transcript back into the same Discord or Slack thread before the agent replies. The operator-visible problem is that users currently have no quick way to confirm what the bot heard before it acts on the transcription.

Feat

Feature. It adds configurable STT transcript echoing for Discord and Slack:

  • Successful clip: > 🎤 <transcript>
  • Failed clip: > 🎤 (transcription failed) plus a warning reaction on the original message
  • Config opt-out: [stt] echo_transcript = false
  • Helm exposure: stt.echoTranscript

Who It Serves

Primary beneficiaries: Discord and Slack end users.

Secondary beneficiaries: maintainers and deployers, because the behavior is centralized behind a platform-agnostic STT helper and configurable through normal config and Helm paths.

Rewritten Prompt

Implement configurable STT transcript echoing for voice-message workflows.

When Discord or Slack audio attachments are transcribed, post one transcript echo message into the same thread before dispatching the agent reply. Preserve upload order. Use the format > 🎤 <transcript> for successful clips and > 🎤 (transcription failed) for failed clips. On failure, also add a warning reaction to the original user message where the adapter supports it.

Add stt.echo_transcript, defaulting to true, with Helm value support as stt.echoTranscript. Keep the echo logic platform-agnostic so future adapters can reuse it when audio support exists. Add unit tests for formatting, disabled config, success, failure, mixed results, and empty input. Update STT and config documentation.

Merge Pitch

This is a useful UX improvement with a modest implementation surface. It makes STT behavior auditable in the conversation itself and reduces confusion when the agent responds to a misheard voice message.

Risk is moderate-low. The likely reviewer concerns are message ordering, accidental mentions or formatting injection, adapter-specific failure behavior, and whether the echo happens exactly once per user message before the agent response.

Best-Practice Comparison

OpenClaw principles that fit:

  • Explicit delivery routing is relevant. The echo must go to the same channel/thread as the original user message, not a guessed destination.
  • Run visibility is relevant in spirit. Echoing transcripts gives users a lightweight conversational audit trail of what STT produced.
  • Retry/backoff is only partly relevant. Echo failure should probably not block agent dispatch unless the product explicitly wants that.

OpenClaw principles that do not strongly fit:

  • Gateway-owned scheduling, durable job persistence, and isolated executions are not central here because this is synchronous chat-message handling, not scheduled work.

Hermes Agent principles that fit:

  • Self-contained prompts are indirectly relevant. The agent should still receive a clear transcript block independent of whether the user-facing echo succeeds.
  • Atomic persisted state is not directly applicable unless transcript echo state becomes durable later.

Hermes Agent principles that do not strongly fit:

  • Gateway daemon ticks, file locking, fresh scheduled sessions, and schedule overlap prevention are not relevant to this PR’s core behavior.

Implementation Options

Option 1: Conservative adapter-local echo
Wire transcript echo directly in Discord and Slack handlers with minimal shared code. Keep config support but avoid a broad helper abstraction.

Option 2: Balanced shared STT helper
Use the current proposed shape: EchoEntry, shared formatting, post_echo, config default-on behavior, Discord/Slack wiring, Helm/docs/tests. Keep gateway platforms out of scope until they have audio plumbing.

Option 3: Ambitious cross-platform transcript event model
Add a first-class transcript event path across adapters/gateway, with durable echo state, retry/backoff, delivery logs, and future support for LINE, Telegram, and Teams when audio transport exists.

Comparison Table

Option Speed to ship Complexity Reliability Maintainability User impact Fit for OpenAB right now
Conservative adapter-local echo High Low Medium Medium-low Medium Good for quick patch, weaker long-term shape
Balanced shared STT helper Medium-high Medium High High High Best fit
Ambitious transcript event model Low High Highest if completed Medium-high Highest long term Too large for this PR

Recommendation

Advance the balanced shared-helper approach.

It gives users the visible transcript behavior now, keeps the implementation reviewable, and leaves a clean extension point for future adapters without forcing gateway/media architecture work into this PR. For merge discussion, focus review on ordering, no-mention formatting, failure behavior, and whether echo-post failures should be non-blocking.

Follow-up work should be split separately for gateway audio support, durable delivery tracking, and the pre-existing multi-clip prompt-ordering issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature needs-rebase p2 Medium — planned work pending-contributor pending-screening PR awaiting automated screening

Projects

None yet

Development

Successfully merging this pull request may close these issues.

STT: echo transcript to thread before agent reply (Discord + Slack)

3 participants