fix: add retry for notification opened confirmation by nan-li · Pull Request #2606 · OneSignal/OneSignal-Android-SDK

nan-li · 2026-04-07T21:22:59Z

Description

One Line Summary

Add in-memory retry logic for the notification opened confirmation PUT request.

Details

Motivation

Customer reports show the PUT /notifications/<id> request (confirming a notification was opened) can fail silently due to stale HTTP keep-alive connections, resulting in SocketException, ConnectException, or SocketTimeoutException with statusCode: -1 and no retry. This means opened events are lost with no recovery path.

Other SDK flows already have retry mechanisms — InAppBackendService.attemptFetchWithRetries for IAM fetches and PushRegistratorAbstractGoogle.registerInBackground for push registration — but the notification opened confirmation was fire-and-forget with no retries.

Scope

Adds a confirmNotificationOpened method to NotificationLifecycleService that wraps the existing updateNotificationAsOpened call with retry logic.
Retries up to NetworkUtils.maxNetworkRequestAttemptCount (3) times for retryable errors (network failures, 429, 5xx).
Non-retryable errors (400, 401, 403, 404, 409, 410) are thrown immediately without retry.
Delay between retries uses maxOf(retryAfterSeconds, linear backoff), consistent with OperationRepo.delayBeforeNextExecution.
Linear backoff: attempt * 15s → 15s, 30s.
The retry is in-memory only (does not persist across app kills). The call remains fire-and-forget via suspendifyWithErrorHandling so it does not block the rest of the notification opened flow.
No changes to the public API or other notification flows.

Alternatives considered

Disabling HTTP keep-alive (Connection: close header) — Adding this header to HttpClient would force a fresh TCP connection per request, avoiding stale pooled connections entirely. This was not chosen because it would affect all SDK HTTP requests globally, adding latency from repeated TCP/TLS handshakes, when the issue only manifests for requests made after an idle gap (e.g. the delay between receiving and opening a notification).
Refactoring into OperationRepo — Enqueuing the opened confirmation as a persistent Operation would give us disk-backed retry that survives app kills, matching the architecture used for user identity and subscription operations. This was not chosen because (a) it's a larger change with more risk for a targeted fix, and (b) other similar fire-and-forget network calls — receive receipts, IAM fetches, push registration — also don't go through OperationRepo and use their own in-memory retry strategies. An in-memory retry here is consistent with those patterns. This can be considered as a larger refactor in the future with other requests.

Testing

Unit testing

Existing NotificationLifecycleServiceTests continue to pass. The retry path exercises the same updateNotificationAsOpened backend call.

Manual testing

Tested by simulating network errors on the notification opened PUT request on an emulator and confirming retries occur with the expected backoff delays in logs.

Affected code checklist

Checklist

Overview

I have filled out all REQUIRED sections above
PR does one thing
Any Public API changes are explained in the PR details and conform to existing APIs

Testing

I have included test coverage for these changes, or explained why they are not needed
All automated tests pass, or I explained why that is not possible
I have personally tested this on my device, or explained why that is not possible

Final pass

Code is as readable as possible.
I have reviewed this PR myself, ensuring it meets each checklist item

The PUT request confirming a notification was opened can fail due to stale HTTP keep-alive connections (SocketException, ConnectException, SocketTimeoutException), resulting in statusCode: -1 with no retry. Add retry logic (up to 3 attempts with linear backoff) for retryable errors so transient network failures don't silently drop opened events.

claude

Additional findings (outside current diff — PR may have been updated during review):

🟡 OneSignalSDK/onesignal/notifications/src/main/java/com/onesignal/notifications/internal/lifecycle/impl/NotificationLifecycleService.kt:206-232 — The new confirmNotificationOpened retry loop cannot honor a server-provided OneSignal-Retry-Limit response header because BackendException has no retryLimit field — NotificationBackendService.updateNotificationAsOpened throws BackendException(response.statusCode, response.payload, response.retryAfterSeconds), silently discarding HttpResponse.retryLimit. If the notification-opened endpoint ever starts sending OneSignal-Retry-Limit: 0 or OneSignal-Retry-Limit: 1, the client will still retry up to maxNetworkRequestAttemptCount (3) times in violation of the server's instruction.
Extended reasoning...

What the bug is and how it manifests

HttpClient parses the OneSignal-Retry-Limit response header into HttpResponse.retryLimit. The new confirmNotificationOpened loop catches BackendException to decide whether to retry, but BackendException only carries statusCode, response (payload), and retryAfterSeconds — there is no retryLimit field. The retry cap the server communicated is therefore unreachable inside the loop, and the code always retries up to NetworkUtils.maxNetworkRequestAttemptCount (3).

The specific code path

NotificationBackendService.updateNotificationAsOpened (line 46 of the original file) throws BackendException(response.statusCode, response.payload, response.retryAfterSeconds). confirmNotificationOpened (lines 206-232 of the modified file) catches this exception but has no way to read the server-provided retry limit from it, so it always drives through all three attempts regardless of any OneSignal-Retry-Limit header in the response.

Why existing code doesn't prevent it

InAppBackendService respects retryLimit by owning its own internal retry loop and reading response.retryLimit directly from HttpResponse (lines 231, 244). It avoids the problem entirely by never throwing BackendException for retryable status codes. The notification-opened path uses the opposite pattern: it delegates to a backend service that throws BackendException, then wraps that exception in a new retry loop — but the exception type loses the field needed to honor the server's cap.

Addressing the refutation

The refutation correctly notes that every other backend service (SubscriptionBackendService, UserBackendService, IdentityBackendService, etc.) also drops retryLimit via BackendException, and that there is currently no known evidence the notification-opened PUT endpoint sends OneSignal-Retry-Limit. These are valid points that lower the practical severity. However, this PR introduces a retry loop — none of those other services have one. The inconsistency is new: before this PR there was no retry at all for notification opens, so the missing field was irrelevant. Now that a loop exists, the gap becomes observable if the server ever sends the header for this endpoint.

Impact

If the server sends OneSignal-Retry-Limit: 0 (do not retry) or OneSignal-Retry-Limit: 1 (at most one retry) for a notification-opened PUT response, the client will ignore that instruction and retry up to 3 times. This is a protocol violation that could result in unnecessary load on the server and duplicate confirmation attempts during rate-limit scenarios (HTTP 429).

How to fix it

Add a retryLimit: Int? field to BackendException and populate it in NotificationBackendService.updateNotificationAsOpened: throw BackendException(response.statusCode, response.payload, response.retryAfterSeconds, retryLimit = response.retryLimit). Then update confirmNotificationOpened to track a mutable effectiveMaxAttempts initialized to maxNetworkRequestAttemptCount and updated on each caught exception: ex.retryLimit?.let { effectiveMaxAttempts = minOf(effectiveMaxAttempts, attempt + it) }.

Step-by-step proof
1. Server receives PUT /notifications/{id}/opened and responds with HTTP 429 and headers OneSignal-Retry-Limit: 0, Retry-After: 60.
2. HttpClient parses this into HttpResponse(statusCode=429, retryLimit=0, retryAfterSeconds=60, ...).
3. NotificationBackendService.updateNotificationAsOpened calls throw BackendException(429, payload, 60) — retryLimit=0 is dropped.
4. confirmNotificationOpened catches the exception; attempt=1 < maxNetworkRequestAttemptCount=3, so it delays and loops.
5. Attempt 2 fires despite the server having said retryLimit=0. Server responds 429 again.
6. Attempt 3 fires. The client has now made 3 total requests after a server instruction to make 0 additional requests.

...ain/java/com/onesignal/notifications/internal/lifecycle/impl/NotificationLifecycleService.kt

abdulraqeeb33

Blocking the PR to get some more insight.

Is this an api that we are hitting directly?
Should we consider adding this to the operationRepo instead that takes care of making the call/retry when fail.

nan-li · 2026-04-08T19:31:19Z

Blocking the PR to get some more insight.

Is this an api that we are hitting directly? Should we consider adding this to the operationRepo instead that takes care of making the call/retry when fail.

Yeah that was one of the alternatives considered mentioned, there's a few requests that are made outside of the op repo currently. And we can opt to do this refactor instead.

nan-li force-pushed the fix/add_retry_to_notification_opened_request branch from 9ec5c5d to 9e8232a Compare April 7, 2026 21:28

claude bot reviewed Apr 7, 2026

View reviewed changes

...ain/java/com/onesignal/notifications/internal/lifecycle/impl/NotificationLifecycleService.kt Show resolved Hide resolved

nan-li requested a review from a team April 7, 2026 21:56

fadi-george approved these changes Apr 7, 2026

View reviewed changes

abdulraqeeb33 requested changes Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add retry for notification opened confirmation#2606

fix: add retry for notification opened confirmation#2606
nan-li wants to merge 1 commit intomainfrom
fix/add_retry_to_notification_opened_request

nan-li commented Apr 7, 2026 •

edited

Loading

Uh oh!

claude bot left a comment

Uh oh!

Uh oh!

abdulraqeeb33 left a comment

Uh oh!

nan-li commented Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nan-li commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

One Line Summary

Details

Motivation

Scope

Alternatives considered

Testing

Unit testing

Manual testing

Affected code checklist

Checklist

Overview

Testing

Final pass

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

abdulraqeeb33 left a comment

Choose a reason for hiding this comment

Uh oh!

nan-li commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nan-li commented Apr 7, 2026 •

edited

Loading

nan-li commented Apr 8, 2026 •

edited

Loading