fix: add retry for notification opened confirmation#2606
Conversation
The PUT request confirming a notification was opened can fail due to stale HTTP keep-alive connections (SocketException, ConnectException, SocketTimeoutException), resulting in statusCode: -1 with no retry. Add retry logic (up to 3 attempts with linear backoff) for retryable errors so transient network failures don't silently drop opened events.
9ec5c5d to
9e8232a
Compare
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🟡
OneSignalSDK/onesignal/notifications/src/main/java/com/onesignal/notifications/internal/lifecycle/impl/NotificationLifecycleService.kt:206-232— The newconfirmNotificationOpenedretry loop cannot honor a server-providedOneSignal-Retry-Limitresponse header becauseBackendExceptionhas noretryLimitfield —NotificationBackendService.updateNotificationAsOpenedthrowsBackendException(response.statusCode, response.payload, response.retryAfterSeconds), silently discardingHttpResponse.retryLimit. If the notification-opened endpoint ever starts sendingOneSignal-Retry-Limit: 0orOneSignal-Retry-Limit: 1, the client will still retry up tomaxNetworkRequestAttemptCount(3) times in violation of the server's instruction.Extended reasoning...
What the bug is and how it manifests
HttpClientparses theOneSignal-Retry-Limitresponse header intoHttpResponse.retryLimit. The newconfirmNotificationOpenedloop catchesBackendExceptionto decide whether to retry, butBackendExceptiononly carriesstatusCode,response(payload), andretryAfterSeconds— there is noretryLimitfield. The retry cap the server communicated is therefore unreachable inside the loop, and the code always retries up toNetworkUtils.maxNetworkRequestAttemptCount(3).The specific code path
NotificationBackendService.updateNotificationAsOpened(line 46 of the original file) throwsBackendException(response.statusCode, response.payload, response.retryAfterSeconds).confirmNotificationOpened(lines 206-232 of the modified file) catches this exception but has no way to read the server-provided retry limit from it, so it always drives through all three attempts regardless of anyOneSignal-Retry-Limitheader in the response.Why existing code doesn't prevent it
InAppBackendServicerespectsretryLimitby owning its own internal retry loop and readingresponse.retryLimitdirectly fromHttpResponse(lines 231, 244). It avoids the problem entirely by never throwingBackendExceptionfor retryable status codes. The notification-opened path uses the opposite pattern: it delegates to a backend service that throwsBackendException, then wraps that exception in a new retry loop — but the exception type loses the field needed to honor the server's cap.Addressing the refutation
The refutation correctly notes that every other backend service (
SubscriptionBackendService,UserBackendService,IdentityBackendService, etc.) also dropsretryLimitviaBackendException, and that there is currently no known evidence the notification-opened PUT endpoint sendsOneSignal-Retry-Limit. These are valid points that lower the practical severity. However, this PR introduces a retry loop — none of those other services have one. The inconsistency is new: before this PR there was no retry at all for notification opens, so the missing field was irrelevant. Now that a loop exists, the gap becomes observable if the server ever sends the header for this endpoint.Impact
If the server sends
OneSignal-Retry-Limit: 0(do not retry) orOneSignal-Retry-Limit: 1(at most one retry) for a notification-opened PUT response, the client will ignore that instruction and retry up to 3 times. This is a protocol violation that could result in unnecessary load on the server and duplicate confirmation attempts during rate-limit scenarios (HTTP 429).How to fix it
Add a
retryLimit: Int?field toBackendExceptionand populate it inNotificationBackendService.updateNotificationAsOpened:throw BackendException(response.statusCode, response.payload, response.retryAfterSeconds, retryLimit = response.retryLimit). Then updateconfirmNotificationOpenedto track a mutableeffectiveMaxAttemptsinitialized tomaxNetworkRequestAttemptCountand updated on each caught exception:ex.retryLimit?.let { effectiveMaxAttempts = minOf(effectiveMaxAttempts, attempt + it) }.Step-by-step proof
- Server receives PUT /notifications/{id}/opened and responds with HTTP 429 and headers
OneSignal-Retry-Limit: 0,Retry-After: 60. HttpClientparses this intoHttpResponse(statusCode=429, retryLimit=0, retryAfterSeconds=60, ...).NotificationBackendService.updateNotificationAsOpenedcallsthrow BackendException(429, payload, 60)—retryLimit=0is dropped.confirmNotificationOpenedcatches the exception;attempt=1 < maxNetworkRequestAttemptCount=3, so it delays and loops.- Attempt 2 fires despite the server having said
retryLimit=0. Server responds 429 again. - Attempt 3 fires. The client has now made 3 total requests after a server instruction to make 0 additional requests.
- Server receives PUT /notifications/{id}/opened and responds with HTTP 429 and headers
...ain/java/com/onesignal/notifications/internal/lifecycle/impl/NotificationLifecycleService.kt
Show resolved
Hide resolved
abdulraqeeb33
left a comment
There was a problem hiding this comment.
Blocking the PR to get some more insight.
Is this an api that we are hitting directly?
Should we consider adding this to the operationRepo instead that takes care of making the call/retry when fail.
Yeah that was one of the alternatives considered mentioned, there's a few requests that are made outside of the op repo currently. And we can opt to do this refactor instead. |
Description
One Line Summary
Add in-memory retry logic for the notification opened confirmation PUT request.
Details
Motivation
Customer reports show the
PUT /notifications/<id>request (confirming a notification was opened) can fail silently due to stale HTTP keep-alive connections, resulting inSocketException,ConnectException, orSocketTimeoutExceptionwithstatusCode: -1and no retry. This means opened events are lost with no recovery path.Other SDK flows already have retry mechanisms —
InAppBackendService.attemptFetchWithRetriesfor IAM fetches andPushRegistratorAbstractGoogle.registerInBackgroundfor push registration — but the notification opened confirmation was fire-and-forget with no retries.Scope
confirmNotificationOpenedmethod toNotificationLifecycleServicethat wraps the existingupdateNotificationAsOpenedcall with retry logic.NetworkUtils.maxNetworkRequestAttemptCount(3) times for retryable errors (network failures, 429, 5xx).maxOf(retryAfterSeconds, linear backoff), consistent withOperationRepo.delayBeforeNextExecution.attempt * 15s→ 15s, 30s.suspendifyWithErrorHandlingso it does not block the rest of the notification opened flow.Alternatives considered
Connection: closeheader) — Adding this header toHttpClientwould force a fresh TCP connection per request, avoiding stale pooled connections entirely. This was not chosen because it would affect all SDK HTTP requests globally, adding latency from repeated TCP/TLS handshakes, when the issue only manifests for requests made after an idle gap (e.g. the delay between receiving and opening a notification).OperationRepo— Enqueuing the opened confirmation as a persistentOperationwould give us disk-backed retry that survives app kills, matching the architecture used for user identity and subscription operations. This was not chosen because (a) it's a larger change with more risk for a targeted fix, and (b) other similar fire-and-forget network calls — receive receipts, IAM fetches, push registration — also don't go throughOperationRepoand use their own in-memory retry strategies. An in-memory retry here is consistent with those patterns. This can be considered as a larger refactor in the future with other requests.Testing
Unit testing
Existing
NotificationLifecycleServiceTestscontinue to pass. The retry path exercises the sameupdateNotificationAsOpenedbackend call.Manual testing
Tested by simulating network errors on the notification opened PUT request on an emulator and confirming retries occur with the expected backoff delays in logs.
Affected code checklist
Checklist
Overview
Testing
Final pass