Skip to content

fix: add retry for notification opened confirmation#2606

Open
nan-li wants to merge 1 commit intomainfrom
fix/add_retry_to_notification_opened_request
Open

fix: add retry for notification opened confirmation#2606
nan-li wants to merge 1 commit intomainfrom
fix/add_retry_to_notification_opened_request

Conversation

@nan-li
Copy link
Copy Markdown
Contributor

@nan-li nan-li commented Apr 7, 2026

Description

One Line Summary

Add in-memory retry logic for the notification opened confirmation PUT request.

Details

Motivation

Customer reports show the PUT /notifications/<id> request (confirming a notification was opened) can fail silently due to stale HTTP keep-alive connections, resulting in SocketException, ConnectException, or SocketTimeoutException with statusCode: -1 and no retry. This means opened events are lost with no recovery path.

Other SDK flows already have retry mechanisms — InAppBackendService.attemptFetchWithRetries for IAM fetches and PushRegistratorAbstractGoogle.registerInBackground for push registration — but the notification opened confirmation was fire-and-forget with no retries.

Scope

  • Adds a confirmNotificationOpened method to NotificationLifecycleService that wraps the existing updateNotificationAsOpened call with retry logic.
  • Retries up to NetworkUtils.maxNetworkRequestAttemptCount (3) times for retryable errors (network failures, 429, 5xx).
  • Non-retryable errors (400, 401, 403, 404, 409, 410) are thrown immediately without retry.
  • Delay between retries uses maxOf(retryAfterSeconds, linear backoff), consistent with OperationRepo.delayBeforeNextExecution.
  • Linear backoff: attempt * 15s → 15s, 30s.
  • The retry is in-memory only (does not persist across app kills). The call remains fire-and-forget via suspendifyWithErrorHandling so it does not block the rest of the notification opened flow.
  • No changes to the public API or other notification flows.

Alternatives considered

  1. Disabling HTTP keep-alive (Connection: close header) — Adding this header to HttpClient would force a fresh TCP connection per request, avoiding stale pooled connections entirely. This was not chosen because it would affect all SDK HTTP requests globally, adding latency from repeated TCP/TLS handshakes, when the issue only manifests for requests made after an idle gap (e.g. the delay between receiving and opening a notification).
  2. Refactoring into OperationRepo — Enqueuing the opened confirmation as a persistent Operation would give us disk-backed retry that survives app kills, matching the architecture used for user identity and subscription operations. This was not chosen because (a) it's a larger change with more risk for a targeted fix, and (b) other similar fire-and-forget network calls — receive receipts, IAM fetches, push registration — also don't go through OperationRepo and use their own in-memory retry strategies. An in-memory retry here is consistent with those patterns. This can be considered as a larger refactor in the future with other requests.

Testing

Unit testing

Existing NotificationLifecycleServiceTests continue to pass. The retry path exercises the same updateNotificationAsOpened backend call.

Manual testing

Tested by simulating network errors on the notification opened PUT request on an emulator and confirming retries occur with the expected backoff delays in logs.

Affected code checklist

  • Notifications
    • Display
    • Open
    • Push Processing
    • Confirm Deliveries
  • Outcomes
  • Sessions
  • In-App Messaging
  • REST API requests
  • Public API changes

Checklist

Overview

  • I have filled out all REQUIRED sections above
  • PR does one thing
  • Any Public API changes are explained in the PR details and conform to existing APIs

Testing

  • I have included test coverage for these changes, or explained why they are not needed
  • All automated tests pass, or I explained why that is not possible
  • I have personally tested this on my device, or explained why that is not possible

Final pass

  • Code is as readable as possible.
  • I have reviewed this PR myself, ensuring it meets each checklist item

The PUT request confirming a notification was opened can fail due to
stale HTTP keep-alive connections (SocketException, ConnectException,
SocketTimeoutException), resulting in statusCode: -1 with no retry.
Add retry logic (up to 3 attempts with linear backoff) for retryable
errors so transient network failures don't silently drop opened events.
@nan-li nan-li force-pushed the fix/add_retry_to_notification_opened_request branch from 9ec5c5d to 9e8232a Compare April 7, 2026 21:28
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Additional findings (outside current diff — PR may have been updated during review):

  • 🟡 OneSignalSDK/onesignal/notifications/src/main/java/com/onesignal/notifications/internal/lifecycle/impl/NotificationLifecycleService.kt:206-232 — The new confirmNotificationOpened retry loop cannot honor a server-provided OneSignal-Retry-Limit response header because BackendException has no retryLimit field — NotificationBackendService.updateNotificationAsOpened throws BackendException(response.statusCode, response.payload, response.retryAfterSeconds), silently discarding HttpResponse.retryLimit. If the notification-opened endpoint ever starts sending OneSignal-Retry-Limit: 0 or OneSignal-Retry-Limit: 1, the client will still retry up to maxNetworkRequestAttemptCount (3) times in violation of the server's instruction.

    Extended reasoning...

    What the bug is and how it manifests

    HttpClient parses the OneSignal-Retry-Limit response header into HttpResponse.retryLimit. The new confirmNotificationOpened loop catches BackendException to decide whether to retry, but BackendException only carries statusCode, response (payload), and retryAfterSeconds — there is no retryLimit field. The retry cap the server communicated is therefore unreachable inside the loop, and the code always retries up to NetworkUtils.maxNetworkRequestAttemptCount (3).

    The specific code path

    NotificationBackendService.updateNotificationAsOpened (line 46 of the original file) throws BackendException(response.statusCode, response.payload, response.retryAfterSeconds). confirmNotificationOpened (lines 206-232 of the modified file) catches this exception but has no way to read the server-provided retry limit from it, so it always drives through all three attempts regardless of any OneSignal-Retry-Limit header in the response.

    Why existing code doesn't prevent it

    InAppBackendService respects retryLimit by owning its own internal retry loop and reading response.retryLimit directly from HttpResponse (lines 231, 244). It avoids the problem entirely by never throwing BackendException for retryable status codes. The notification-opened path uses the opposite pattern: it delegates to a backend service that throws BackendException, then wraps that exception in a new retry loop — but the exception type loses the field needed to honor the server's cap.

    Addressing the refutation

    The refutation correctly notes that every other backend service (SubscriptionBackendService, UserBackendService, IdentityBackendService, etc.) also drops retryLimit via BackendException, and that there is currently no known evidence the notification-opened PUT endpoint sends OneSignal-Retry-Limit. These are valid points that lower the practical severity. However, this PR introduces a retry loop — none of those other services have one. The inconsistency is new: before this PR there was no retry at all for notification opens, so the missing field was irrelevant. Now that a loop exists, the gap becomes observable if the server ever sends the header for this endpoint.

    Impact

    If the server sends OneSignal-Retry-Limit: 0 (do not retry) or OneSignal-Retry-Limit: 1 (at most one retry) for a notification-opened PUT response, the client will ignore that instruction and retry up to 3 times. This is a protocol violation that could result in unnecessary load on the server and duplicate confirmation attempts during rate-limit scenarios (HTTP 429).

    How to fix it

    Add a retryLimit: Int? field to BackendException and populate it in NotificationBackendService.updateNotificationAsOpened: throw BackendException(response.statusCode, response.payload, response.retryAfterSeconds, retryLimit = response.retryLimit). Then update confirmNotificationOpened to track a mutable effectiveMaxAttempts initialized to maxNetworkRequestAttemptCount and updated on each caught exception: ex.retryLimit?.let { effectiveMaxAttempts = minOf(effectiveMaxAttempts, attempt + it) }.

    Step-by-step proof

    1. Server receives PUT /notifications/{id}/opened and responds with HTTP 429 and headers OneSignal-Retry-Limit: 0, Retry-After: 60.
    2. HttpClient parses this into HttpResponse(statusCode=429, retryLimit=0, retryAfterSeconds=60, ...).
    3. NotificationBackendService.updateNotificationAsOpened calls throw BackendException(429, payload, 60)retryLimit=0 is dropped.
    4. confirmNotificationOpened catches the exception; attempt=1 < maxNetworkRequestAttemptCount=3, so it delays and loops.
    5. Attempt 2 fires despite the server having said retryLimit=0. Server responds 429 again.
    6. Attempt 3 fires. The client has now made 3 total requests after a server instruction to make 0 additional requests.

@nan-li nan-li requested a review from a team April 7, 2026 21:56
Copy link
Copy Markdown
Contributor

@abdulraqeeb33 abdulraqeeb33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking the PR to get some more insight.

Is this an api that we are hitting directly?
Should we consider adding this to the operationRepo instead that takes care of making the call/retry when fail.

@nan-li
Copy link
Copy Markdown
Contributor Author

nan-li commented Apr 8, 2026

Blocking the PR to get some more insight.

Is this an api that we are hitting directly? Should we consider adding this to the operationRepo instead that takes care of making the call/retry when fail.

Yeah that was one of the alternatives considered mentioned, there's a few requests that are made outside of the op repo currently. And we can opt to do this refactor instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants