Skip to content

[aw-failures] Failure Investigator (6h) — 14 failures, 3 clusters (2026-05-26 08:00–14:00 UTC) #34938

@github-actions

Description

@github-actions

Executive Summary

14 agentic-workflow runs failed in github/gh-aw between 2026-05-26 09:53 UTC and 13:27 UTC. Failures fall into 4 clusters; two are transient infrastructure incidents (recovered) and one is a P0 token/identity issue that warrants follow-up. One cluster (squid healthcheck) is already covered by tracking issue #34920.

Severity Cluster Runs Window (UTC) Cause
P0 Git access 403 — "Your account was suspended" 3 10:28–10:39 App/token suspension during checkout
P1 codeload.github.com action archive 403 8 12:21–12:47 Transient GitHub Actions CDN outage
P1 awf-squid container unhealthy 1 13:27 Already tracked in #34920
P2 Workflow-specific agent-step errors 2 09:53–09:56 Per-workflow bugs (Daily News, Daily AW Cross-Repo)

Failure Clusters (Detail)

P0 — Cluster B: Git 403 / "Your account was suspended"

Three consecutive runs across two workflows failed at checkout/setup with HTTP 403 and the warning Repository permission check failed: Sorry. Your account was suspended.

Run Workflow Started
§26446905995 AI Moderator 10:28:55Z
§26447363137 Package Specification Extractor 10:39:07Z
§26447374892 Functional Pragmatist 10:39:24Z

Tracked separately in sub-issue #34940.

P1 — Cluster A: codeload.github.com archive 403

Eight scheduled workflows failed within ~26 minutes at the Actions activation stage, all with the same line:

##[error]An action could not be found at the URI 'https://codeload.github.com/actions/github-script/tar.gz/3a2844b7e9c422d3c10d287c895573f7108da1b3'
##[error]Failed to download archive 'https://codeload.github.com/actions/github-script/tar.gz/3a2844b7e9c422d3c10d287c895573f7108da1b3' after 1 attempts.

This is GitHub Actions infrastructure (codeload CDN). All workflows used the same pinned SHA. Subsequent runs of the same workflows succeed — the outage was transient.

Affected runs
Run Workflow Started
§26447754048 Typist - Go Type Analysis 12:21:22Z
§26447848291 Constraint Solving — Problem of the Day 12:23:29Z
§26447851438 Package Specification Enforcer 12:23:33Z
§26448069508 Daily Agentic Workflow Token Usage Audit 12:28:24Z
§26448145685 PR Sous Chef 12:30:05Z
§26448217737 Daily Go Function Namer 12:31:35Z
§26448750987 Daily Token Consumption Report (Sentry OTel) 12:44:12Z
§26448892944 GitHub MCP Structural Analysis 12:47:35Z

No gh-aw code change is required. The runner logs say after 1 attempts — the action-download step is a GitHub runner internal; gh-aw cannot directly add retries there. Recommended monitoring only; not creating a sub-issue.

P1 — Cluster C: awf-squid healthcheck failure

Single run §26450931415 (Auto-Triage Issues). awf-squid started but went unhealthy after 10s:

Container awf-squid  Starting
Container awf-squid  Started
Container awf-squid  Waiting
Container awf-squid  Error
dependency failed to start: container awf-squid is unhealthy
[ERROR] Failed to start containers: Error: Command failed with exit code 1: docker compose up -d --pull never

Firewall version 0.25.55. Already tracked in #34920. Single occurrence — not creating a new sub-issue. If recurrence is observed within 24h, escalate to P0.

P2 — Cluster D: Workflow-specific agent errors

Run Workflow Engine Cause
§26445297038 Daily News copilot Exit 127 (command not found) in agent step
§26445441719 Daily AW Cross-Repo Compile Check claude Exit 128 (fatal: not in a git directory) in post-agent step

Isolated workflow-specific failures. Not creating sub-issues — each workflow's owner should investigate when the next failure occurs. Recording here for visibility.

Existing Issue Correlation

Existing issue State Action
#34920 "[aw] Auto-Triage Issues failed" Open Leave open — fresh tracking issue (created 13:30Z, same hour as the squid failure)
#34919 "[aw] No-Op Runs" Open Permanent tracker — no action

No prior aw-failure-investigator-tagged tracking issues are currently open (all previous ones have been closed). No issues need closing in this run.

Proposed Fix Roadmap

  • P0 Investigate 11-minute window of Your account was suspended 403s. Determine whether a GitHub App installation token, a PAT, or a bot account was temporarily suspended; check whether retry-on-403 logic in download_docker_images.sh / checkout would have helped. See sub-issue [aw-failures] 3 runs failed with Your account was suspended 403 (2026-05-26 10:28–10:39 UTC) #34940.
  • P1 Monitor for re-occurrence of awf-squid healthcheck failures on firewall 0.25.55. If it recurs, escalate via #34920.
  • P1 (Cluster A) No action — transient codeload outage, workflows recovered. Optional: add a comment in the gh-aw compile docs noting that GitHub-Actions-side codeload outages will hit all pinned-SHA workflows simultaneously.
  • P2 Daily News + Daily AW Cross-Repo Compile Check failures — owners to investigate independently when next failure occurs.

Sub-Issues Created

References

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions