fix(jwt): race conditions and IAM reliability issues with identity verification enabled#2609
fix(jwt): race conditions and IAM reliability issues with identity verification enabled#2609nan-li wants to merge 4 commits intofeat/identity_verification_5.8from
Conversation
917e157 to
c61f42e
Compare
2b6e2cf to
60c8d9a
Compare
|
@claude review |
650e96f to
cfc3e6e
Compare
|
@claude review |
|
@claude review |
|
@claude review |
89393ca to
071ecef
Compare
|
@claude review |
4c0feb5 to
86b6c36
Compare
|
@claude review |
|
@claude review |
6f91de9 to
5d860c9
Compare
| private var pendingJwtRetryExternalId: String? = null | ||
|
|
||
| @Volatile | ||
| private var pendingJwtRetryRywData: RywData? = null | ||
|
|
||
| private val identityModelChangeHandler = | ||
| object : ISingletonModelStoreChangeHandler<IdentityModel> { | ||
| override fun onModelReplaced( | ||
| model: IdentityModel, | ||
| tag: String, | ||
| ) { } | ||
| ) { | ||
| pendingJwtRetryExternalId = null | ||
| pendingJwtRetryRywData = null | ||
| } |
There was a problem hiding this comment.
🔴 In onJwtUpdated, pendingJwtRetryExternalId (R1) and pendingJwtRetryRywData (R2) are two separate @Volatile reads with no synchronization between them, while identityModelChangeHandler.onModelReplaced zeroes both fields without any lock. A concurrent user switch between R1 and R2 leaves onJwtUpdated with a non-null retryExternalId from the old user but a null retryRywData, causing fetchMessages(null) to fire under the new user's identity context. To fix, add a shared lock (e.g. synchronized(pendingJwtRetryLock)) spanning the read-check-clear block in onJwtUpdated and the two null writes in onModelReplaced, mirroring the synchronized(jwtInvalidatedLock) pattern already applied to UserManager in this same PR.
Extended reasoning...
What the bug is and how it manifests
InAppMessagesManager stores two @Volatile fields — pendingJwtRetryExternalId and pendingJwtRetryRywData — as the retry state for IAM fetches that received a 401 response. In onJwtUpdated (lines 1039–1059), the method first reads pendingJwtRetryExternalId (R1) and uses it as a guard: if null, it returns early; if mismatched, it returns. Then, as a second independent volatile read, it reads pendingJwtRetryRywData (R2) into retryRywData. Concurrently, identityModelChangeHandler.onModelReplaced (lines 145–148) writes pendingJwtRetryExternalId = null and then pendingJwtRetryRywData = null without holding any lock.
The specific code path that triggers it
Race sequence:
- Thread A (
onJwtUpdated("alice")): R1 readspendingJwtRetryExternalId= "alice" — passes the?: returnguard. - Thread A: passes the
externalId != retryExternalIdcheck. - Thread B (
onModelReplaced, user switch to "bob"): writespendingJwtRetryExternalId = null, thenpendingJwtRetryRywData = null. - Thread A: R2 reads
pendingJwtRetryRywData— getsnull(Thread B already cleared it). - Thread A: calls
suspendifyOnIO { fetchMessages(null) }with null rywData, under the new user's identity context (since_identityModelStore.modelis now bob's model).
Why existing code does not prevent it
@Volatile on both fields guarantees cross-thread memory visibility for each individual read/write, but provides no atomicity across the compound sequence (R1 → guard check → R2). The JVM memory model allows Thread B's writes to interpose between Thread A's R1 and R2. Neither field uses synchronized, AtomicReference, nor any other construct that would make the two-field check-and-read atomic with respect to the two-field clear in onModelReplaced.
What the impact would be
The spurious fetchMessages(null) fires under bob's identity context (bob's externalId, bob's aliasLabel/aliasValue, bob's subscriptionId). Inside fetchMessages, the fetchIAMMutex rate-limiter sets lastTimeFetchedIAMs = now. When the legitimate RYW-aware fetch for bob subsequently arrives (via onModelUpdated), the rate-limit check (now - lastTimeFetchedIAMs!!) < fetchIAMMinInterval (default 30 s) returns early, permanently dropping the RYW-consistent fetch for the rest of the session. Alternatively, the spurious fetch populates IAMs for bob without the RYW token, bypassing the read-your-write consistency mechanism this entire PR is designed to provide.
Step-by-step proof
- JWT expires for alice;
fetchMessagescatches the 401 and storespendingJwtRetryExternalId = "alice",pendingJwtRetryRywData = rywData. - App calls
OneSignal.login("bob");identityModelChangeHandler.onModelReplacedis invoked on Thread B. - Simultaneously,
OneSignal.updateUserJwt("alice", newJwt)firesonJwtUpdated("alice")on Thread A. - Thread A completes R1 (
pendingJwtRetryExternalId = "alice", non-null, passes guard). - Thread B executes:
pendingJwtRetryExternalId = null,pendingJwtRetryRywData = null. - Thread A executes R2:
val retryRywData = pendingJwtRetryRywData→ null. - Thread A clears pending state (already null) and calls
fetchMessages(null)under bob's identity. fetchMessages(null)acquiresfetchIAMMutex, passes the rate-limit check (or doesn't, depending on timing), and setslastTimeFetchedIAMs = now.- Bob's legitimate RYW-consistent fetch arrives and is blocked by the 30-second rate limiter.
How to fix it
Introduce a plain Any() lock object (e.g. private val pendingJwtRetryLock = Any()) and use synchronized(pendingJwtRetryLock) to protect the compound read-check-clear in onJwtUpdated and the two null writes in onModelReplaced. This is the same pattern the PR already applies to UserManager with jwtInvalidatedLock, confirming the intent — the analogous guard was simply missed in InAppMessagesManager.
| <ID>LongMethod:InAppRepository.kt$InAppRepository$override suspend fun cleanCachedInAppMessages()</ID> | ||
| <ID>LongParameterList:IInAppBackendService.kt$IInAppBackendService$( appId: String, subscriptionId: String, variantId: String?, messageId: String, clickId: String?, isFirstClick: Boolean, )</ID> | ||
| <ID>LongParameterList:InAppDisplayer.kt$InAppDisplayer$( private val _applicationService: IApplicationService, private val _lifecycle: IInAppLifecycleService, private val _promptFactory: IInAppMessagePromptFactory, private val _backend: IInAppBackendService, private val _influenceManager: IInfluenceManager, private val _configModelStore: ConfigModelStore, private val _languageContext: ILanguageContext, private val _time: ITime, )</ID> | ||
| <ID>LongParameterList:InAppMessagesManager.kt$InAppMessagesManager$( private val _applicationService: IApplicationService, private val _sessionService: ISessionService, private val _influenceManager: IInfluenceManager, private val _configModelStore: ConfigModelStore, private val _userManager: IUserManager, private val _identityModelStore: IdentityModelStore, private val _subscriptionManager: ISubscriptionManager, private val _outcomeEventsController: IOutcomeEventsController, private val _state: InAppStateService, private val _prefs: IInAppPreferencesController, private val _repository: IInAppRepository, private val _backend: IInAppBackendService, private val _triggerController: ITriggerController, private val _triggerModelStore: TriggerModelStore, private val _displayer: IInAppDisplayer, private val _lifecycle: IInAppLifecycleService, private val _languageContext: ILanguageContext, private val _time: ITime, private val _consistencyManager: IConsistencyManager, )</ID> | ||
| <ID>LongParameterList:InAppMessagesManager.kt$InAppMessagesManager$( private val _applicationService: IApplicationService, private val _sessionService: ISessionService, private val _influenceManager: IInfluenceManager, private val _configModelStore: ConfigModelStore, private val _userManager: IUserManager, private val _identityModelStore: IdentityModelStore, private val _subscriptionManager: ISubscriptionManager, private val _outcomeEventsController: IOutcomeEventsController, private val _state: InAppStateService, private val _prefs: IInAppPreferencesController, private val _repository: IInAppRepository, private val _backend: IInAppBackendService, private val _triggerController: ITriggerController, private val _triggerModelStore: TriggerModelStore, private val _displayer: IInAppDisplayer, private val _lifecycle: IInAppLifecycleService, private val _languageContext: ILanguageContext, private val _time: ITime, private val _consistencyManager: IConsistencyManager, private val _jwtTokenStore: JwtTokenStore, )</ID> | ||
| <ID>LongParameterList:OneSignalAnimate.kt$OneSignalAnimate$( view: View, deltaFromY: Float, deltaToY: Float, duration: Int, interpolator: Interpolator?, animCallback: Animation.AnimationListener?, )</ID> | ||
| <ID>MagicNumber:DraggableRelativeLayout.kt$DraggableRelativeLayout$3</ID> | ||
| <ID>MagicNumber:DraggableRelativeLayout.kt$DraggableRelativeLayout$3000</ID> |
There was a problem hiding this comment.
🟡 The PR adds _jwtTokenStore: JwtTokenStore as a new constructor parameter to InAppMessagesManager and correctly updates the LongParameterList baseline entry, but omits the corresponding ConstructorParameterNaming baseline entry. All 19 other underscore-prefixed constructor parameters have their own ConstructorParameterNaming suppression entries; _jwtTokenStore is the sole omission. Add the missing entry: ConstructorParameterNaming:InAppMessagesManager.kt$InAppMessagesManager$private val _jwtTokenStore: JwtTokenStore.
Extended reasoning...
What the bug is
The PR introduces private val _jwtTokenStore: JwtTokenStore as a new constructor parameter to InAppMessagesManager. Detekt enforces a ConstructorParameterNaming rule that flags underscore-prefixed constructor parameters. All 19 existing underscore-prefixed constructor parameters in InAppMessagesManager have their own ConstructorParameterNaming suppression entries in detekt-baseline-in-app-messages.xml (from _applicationService through _userManager), but the new _jwtTokenStore parameter has no corresponding entry.
The specific code path that triggers it
When detekt runs in CI, it scans InAppMessagesManager.kt and finds private val _jwtTokenStore: JwtTokenStore in the constructor. It raises a ConstructorParameterNaming violation. Because no baseline entry suppresses this violation, it is treated as a new unbaselined finding. Detekt baseline-check mode fails the build on any new unbaselined violations.
Why existing code does not prevent it
The PR author correctly updated the LongParameterList baseline entry (line 71) to include _jwtTokenStore. However, ConstructorParameterNaming is a separate rule requiring its own per-parameter baseline entry. The two rules are tracked independently in the XML. The omission is easy to miss because LongParameterList and ConstructorParameterNaming entries appear in different sections of the file.
What the impact would be
The detekt CI check will fail when the PR is merged, blocking the release build. No functional behavior is affected; this is purely a build-tooling issue.
How to fix it
Add the following entry to the <CurrentIssues> section of detekt-baseline-in-app-messages.xml, alphabetically between the _influenceManager and _languageContext entries:
<ID>ConstructorParameterNaming:InAppMessagesManager.kt$InAppMessagesManager$private val _jwtTokenStore: JwtTokenStore</ID>
Step-by-step proof
- PR adds
private val _jwtTokenStore: JwtTokenStoretoInAppMessagesManagerconstructor. - Detekt scans the constructor; all 19 existing underscore-prefixed parameters are suppressed via baseline entries.
- Detekt finds
_jwtTokenStorewith no baseline entry. - Detekt reports one new
ConstructorParameterNamingviolation. - CI baseline check: new violation not in baseline → build fails.
| val newMessages = | ||
| try { | ||
| _backend.listInAppMessages(appId, aliasLabel, aliasValue, subscriptionId, rywData, sessionDurationProvider, jwt) | ||
| } catch (ex: BackendException) { | ||
| if (NetworkUtils.getResponseStatusType(ex.statusCode) == NetworkUtils.ResponseStatusType.UNAUTHORIZED) { | ||
| Logging.debug("InAppMessagesManager.fetchMessages: ${ex.statusCode} response. Will retry after JWT refresh for externalId=$externalId") | ||
| lastTimeFetchedIAMs = null | ||
| pendingJwtRetryRywData = rywData | ||
| pendingJwtRetryExternalId = externalId | ||
| } | ||
| null | ||
| } |
There was a problem hiding this comment.
🔴 The JWT 401 retry mechanism has a temporal ordering flaw: onJwtUpdated is the sole trigger for retrying an IAM fetch after a 401, but the developer's putJwt call (which fires onJwtUpdated) can arrive before the in-flight IAM request processes its 401 response and sets pendingJwtRetryExternalId. When this happens, onJwtUpdated reads pendingJwtRetryExternalId == null and returns as a no-op; by the time the 401 catch block finally sets the sentinel, no future onJwtUpdated event will ever fire, permanently losing the IAM retry for the entire session. The fix is to check in the 401 catch block whether _jwtTokenStore.getJwt(externalId) already differs from the jwt value used in the failed request; if so, the JWT was rotated during the network call, so skip the sentinel and launch the retry immediately.
Extended reasoning...
What the bug is and how it manifests
InAppMessagesManager.fetchMessages() reads the current JWT at the start (line ~336 of InAppMessagesManager.kt), then makes an HTTP call that can stay in-flight for 100–2000 ms. The PR's 401 recovery design relies on a two-step handshake: (1) the catch block sets pendingJwtRetryExternalId = externalId after the 401 lands, and (2) onJwtUpdated later fires when the developer provides a fresh JWT, reads the sentinel, and launches the retry. If the developer's JWT provision races ahead of step (1), the entire mechanism silently fails.
The specific code path that triggers it
Concrete timeline:
- T=0 ms:
fetchMessages()readsjwt = _jwtTokenStore.getJwt("alice")(expired). HTTP call starts. - T=100 ms: An OperationRepo operation also uses the same expired JWT, receives a 401, and calls
_userManager.fireJwtInvalidated("alice"). - T=150 ms: Developer's
onUserJwtInvalidatedfires. Developer callsOneSignal.updateUserJwt("alice", freshJwt)→JwtTokenStore.putJwt()→jwtUpdateNotifier.fire { it.onJwtUpdated("alice") }. - T=150 ms:
InAppMessagesManager.onJwtUpdated("alice")runs. ReadspendingJwtRetryExternalId→ null (IAM HTTP call still in-flight). Guard?: returnfires — no-op. - T=300 ms: IAM HTTP response (401) arrives. Catch block executes:
lastTimeFetchedIAMs = null,pendingJwtRetryRywData = rywData,pendingJwtRetryExternalId = "alice". - Sentinel is now set, but
onJwtUpdated("alice")will never fire again — the developer already provided the JWT once in response to the one JWT-invalidated event.
The IAM retry is permanently lost for the session.
Why existing code does not prevent it
@Volatile on pendingJwtRetryExternalId and pendingJwtRetryRywData ensures cross-thread memory visibility but does nothing to close the temporal gap between the sentinel write (in the catch block, after the network round-trip) and the notification from onJwtUpdated (which already ran and returned). There is no second chance: the SDK fires fireJwtInvalidated once per 401 event, the developer calls updateUserJwt once in response, and putJwt fires its notifier exactly once.
What the impact would be
Any session where (a) IV is enabled, (b) the JWT expires mid-session, (c) an OperationRepo operation and an in-flight IAM request both encounter the expired JWT at the same time (a near-certainty on session start), and (d) the developer responds promptly to onUserJwtInvalidated — all conditions that are simultaneously true in the common case — will silently receive no in-app messages for the remainder of the session. Recovery only happens if the app goes to background/foreground (triggering onSessionStarted/onModelReplaced) before the user would normally see the IAM.
How to fix it
In the 401 catch block, after setting the sentinel, immediately check whether the JWT in _jwtTokenStore has already been rotated since the request started:
lastTimeFetchedIAMs = null
pendingJwtRetryRywData = rywData
pendingJwtRetryExternalId = externalId
// If the JWT was already refreshed while the request was in-flight, fire the retry now
// rather than waiting for an onJwtUpdated that will never come.
if (_jwtTokenStore.getJwt(externalId) \!= jwt) {
pendingJwtRetryExternalId = null
pendingJwtRetryRywData = null
suspendifyOnIO { fetchMessages(rywData) }
}This closes the race window by checking after the fact whether the notification was missed.
Step-by-step proof
fetchMessages()capturesval jwt = _jwtTokenStore.getJwt("alice")= expiredToken.- HTTP request is in-flight for 300 ms.
- OperationRepo 401 →
fireJwtInvalidated("alice")→ developer callsputJwt("alice", freshToken). jwtUpdateNotifier.fire { onJwtUpdated("alice") }runs synchronously on the caller's thread.onJwtUpdated:pendingJwtRetryExternalId ?: return— field is still null → returns.- 300 ms later: IAM 401 response → catch block:
pendingJwtRetryExternalId = "alice". _jwtTokenStore.getJwt("alice")now returns freshToken ≠ expiredToken, but the check does not exist in the current code.- No further
onJwtUpdatedfires. IAM messages are never shown this session.
| private val identityModelChangeHandler = | ||
| object : ISingletonModelStoreChangeHandler<IdentityModel> { | ||
| override fun onModelReplaced( | ||
| model: IdentityModel, | ||
| tag: String, | ||
| ) { } | ||
| ) { | ||
| pendingJwtRetryExternalId = null | ||
| pendingJwtRetryRywData = null | ||
| } |
There was a problem hiding this comment.
🔴 When OneSignal.login("userB") is called within 30 seconds of User A's last IAM fetch, identityModelChangeHandler.onModelReplaced clears pendingJwtRetryExternalId and pendingJwtRetryRywData but omits resetting lastTimeFetchedIAMs, causing the rate-limiter to silently block User B's first IAM fetch for the remainder of the 30-second window. Fix: add lastTimeFetchedIAMs = null in onModelReplaced, alongside the existing clears for the JWT retry fields.
Extended reasoning...
What the bug is and how it manifests
identityModelChangeHandler.onModelReplaced (lines 142–148) clears pendingJwtRetryExternalId and pendingJwtRetryRywData on user switch, but omits resetting lastTimeFetchedIAMs. This field records when IAMs were last fetched and is per-user state. Omitting the reset causes the rate-limiter in fetchMessages to fire for the new user's first fetch.
The specific code path that triggers it
The rate-limiter check (line 323) is:
if (rywData == null && lastTimeFetchedIAMs != null && (now - lastTimeFetchedIAMs!!) < fetchIAMMinInterval) { return }
This guard is only bypassed when rywData != null. When createUser for the new user returns no ryw_token, resolveConditionsWithID resolves the IamFetchReadyCondition deferred with null, so fetchMessages(null) is called. With lastTimeFetchedIAMs still holding User A's fetch timestamp T and (now - T) < 30s, the condition evaluates to true and fetchMessages returns early without fetching.
Why existing code does not prevent it
onModelReplaced was explicitly designed to clear per-user fetch state on user switch — it already clears pendingJwtRetryExternalId and pendingJwtRetryRywData. The omission of lastTimeFetchedIAMs is a clear oversight. The rate-limiter was designed to prevent redundant fetches for the same user; it incorrectly fires here for a different, newly-logged-in user whose IAMs have never been fetched.
What the impact would be
In environments where the backend does not return ryw_token from createUser (the common non-RYW case), User B's IAMs are silently never fetched for up to 30 seconds after login. There is no error logged, no exception thrown — in-app messages are simply not shown. The only recovery is if another event (trigger change, subscription change) incidentally triggers fetchMessagesWhenConditionIsMet after the 30-second window expires.
How to fix it
Add lastTimeFetchedIAMs = null inside identityModelChangeHandler.onModelReplaced, alongside the existing clears:
override fun onModelReplaced(model: IdentityModel, tag: String) {
pendingJwtRetryExternalId = null
pendingJwtRetryRywData = null
lastTimeFetchedIAMs = null // add this line
}
This is consistent with the intent of the function and mirrors how UserManager.onModelReplaced clears pendingJwtInvalidatedExternalId on user switch in this same PR.
Step-by-step proof
- User A is logged in. A session starts; IAMs are fetched at time T. lastTimeFetchedIAMs = T.
- Within 30 seconds, OneSignal.login("userB") is called.
- identityModelChangeHandler.onModelReplaced fires: pendingJwtRetryExternalId = null, pendingJwtRetryRywData = null. lastTimeFetchedIAMs still equals T.
- createUser for User B succeeds with no ryw_token. resolveConditionsWithID resolves the IamFetchReadyCondition deferred with null. fetchMessages(null) is called.
- Rate-limiter check: rywData==null AND lastTimeFetchedIAMs!=null AND (now - T) < 30000ms → true → return early.
- User B's IAMs are never fetched. No error is surfaced; in-app messages are silently suppressed for the new user.
| backendSubscriptions.remove(backendSubscription) | ||
| } | ||
|
|
||
| if (response.rywData != null) { | ||
| _consistencyManager.setRywData(backendOneSignalId, IamFetchRywTokenKey.USER, response.rywData) | ||
| } | ||
| _consistencyManager.resolveConditionsWithID(IamFetchReadyCondition.ID) |
There was a problem hiding this comment.
🔴 The unconditional resolveConditionsWithID(IamFetchReadyCondition.ID) call at LoginUserOperationExecutor.kt:234 fires two concurrent IAM backend calls on every login when createUser returns rywData: Path A resolves the pre-login IamFetchReadyCondition("") with null and calls fetchMessages(null), while Path B's coroutine (scheduled by onModelUpdated) later finds the condition already stored via setRywData and calls fetchMessages(rywData). Path A wins the rate limiter first (no ryw delay, lastTimeFetchedIAMs==null), displaying potentially stale IAMs before Path B overwrites them after the ryw delay; fix by removing the unconditional resolveConditionsWithID call when rywData is present and relying solely on setRywData + checkConditionsAndComplete to resolve Path B's condition.
Extended reasoning...
What the bug is and how it manifests
The PR adds two lines at LoginUserOperationExecutor.kt:228-234 after a successful createUser:
if (response.rywData != null) {
_consistencyManager.setRywData(backendOneSignalId, IamFetchRywTokenKey.USER, response.rywData)
}
_consistencyManager.resolveConditionsWithID(IamFetchReadyCondition.ID) // unconditionalThe resolveConditionsWithID call uses the static ID string shared by ALL IamFetchReadyCondition instances regardless of their constructor key. When login returns rywData != null, this produces two concurrent IAM backend fetches.
The specific code path that triggers it
Execution trace when createUser succeeds with rywData != null:
-
Before login:
onSessionStarted()callsfetchMessagesWhenConditionIsMet(). At that point_userManager.onesignalId == ""(local ID maps to empty string), so it registersIamFetchReadyCondition("")in ConsistencyManager and suspends at.await(). -
On the OperationRepo thread after createUser:
identityModel.setStringProperty(backendOneSignalId)firesonModelUpdatedsynchronously.onModelUpdatedcallssuspendifyOnIO { ... }, which schedules Path B's coroutine onDispatchers.IObut returns immediately — the coroutine has not started yet.setRywData(backendOneSignalId, USER, rywData)stores the token under keybackendOneSignalId.checkConditionsAndCompleteevaluates existing conditions: onlyIamFetchReadyCondition("")exists, andisMet("")returns false (no entry for key ""). Nothing resolves.resolveConditionsWithID("IamFetchReadyCondition")runs. It findsIamFetchReadyCondition("")(its.idmatches the static constant), completes its deferred with null, and removes it. Path A's coroutine resumes withrywData = null.
-
Path A calls
fetchMessages(null). The rate-limiter check isrywData == null && lastTimeFetchedIAMs != null && elapsed < interval. SincelastTimeFetchedIAMs == null(first call), the check is false → rate limiter passes. Path A acquiresfetchIAMMutex, setslastTimeFetchedIAMs = now, and makes backend call gradle? #1 (no RYW token, stale). -
Path B starts on the IO pool:
getRywDataFromAwaitableCondition(IamFetchReadyCondition(backendOneSignalId))adds a new condition.checkConditionsAndCompletefindsindexedTokens[backendOneSignalId]already populated from step 2 → immediately resolves the deferred with the actualrywData. Path B callsfetchMessages(rywData). The rate-limiter check isrywData == null && ...— sincerywData != null, the entire condition short-circuits to false → rate limiter unconditionally bypassed. Path B makes backend call Added possibility to send status 'opened' for a message back to onesignal backend from client source code #2 (with RYW token, RYW-consistent).
Why existing code does not prevent it
The rate-limiter fix introduced in this PR (rywData == null && at InAppMessagesManager.kt:323) was designed to let RYW-aware fetches bypass the 30-second interval. But it simultaneously allows Path A (which has null rywData because it was resolved with null by the unconditional resolveConditionsWithID) to pass the rate limiter on the first call, since lastTimeFetchedIAMs is null then. Both conditions for blocking a fetch are absent for each path.
What the impact would be
- Wasted backend call on every new user login when identity verification is enabled and the server returns
ryw_token(the common case). - Stale IAMs briefly displayed: Path A has no ryw delay and completes first, setting
this.messagesto stale results and callingevaluateInAppMessages(). Path B completes after the ryw delay and overwrites with the RYW-consistent response. - Concurrent unsynchronized writes to
this.messages(a plainvarfield, not@Volatile) from two IO-dispatcher coroutines — no memory-visibility guarantee between them. - trigger-changed flag loss: If Path A's
earlySessionTriggersprocessing setsisTriggerChanged=trueon stale message objects, Path B then replacesthis.messageswith fresh message objects that lack the flag.
How to fix it
Do not call resolveConditionsWithID when rywData is present. The condition registered by Path B will be self-completed by checkConditionsAndComplete inside the already-called setRywData. The corrected block:
if (response.rywData != null) {
_consistencyManager.setRywData(backendOneSignalId, IamFetchRywTokenKey.USER, response.rywData)
// Do NOT call resolveConditionsWithID here — setRywData already triggers checkConditionsAndComplete
// which will immediately satisfy IamFetchReadyCondition(backendOneSignalId) once Path B registers it.
} else {
_consistencyManager.resolveConditionsWithID(IamFetchReadyCondition.ID)
}Step-by-step proof
- App starts;
onesignalIdis local →_userManager.onesignalId == "".fetchMessagesWhenConditionIsMetregistersIamFetchReadyCondition(""), suspends at.await(). LoginUserOperationexecutes;createUserreturns withrywData = RywData("tok123", 500).- On OperationRepo thread:
identityModel.setStringProperty("backend-abc")→ schedules Path B on IO, returns immediately. setRywData("backend-abc", USER, rywData)→indexedTokens["backend-abc"] = rywData;checkConditionsAndCompletesees onlyIamFetchReadyCondition("")→isMet("")= false, nothing resolves.resolveConditionsWithID("IamFetchReadyCondition")→ findsIamFetchReadyCondition("")by static ID match, completes with null.- Path A resumes:
rywData = null.fetchMessages(null). Rate limiter:null == null && null != null→ false → passes.fetchIAMMutexacquired,lastTimeFetchedIAMs = T1, HTTP GET without RYW token → stale IAMs shown. - Path B starts on IO: registers
IamFetchReadyCondition("backend-abc");checkConditionsAndComplete→isMet("backend-abc")= true (stored in step 4) → deferred immediately resolves withrywData. - Path B:
fetchMessages(rywData). Rate limiter:rywData != null→ bypassed unconditionally.fetchIAMMutexacquired (Path A already released it), HTTP GET with RYW token → RYW-consistent IAMs overwrite stale result.
| val useIV = model.useIdentityVerification | ||
|
|
||
| var jwtInvalidatedExternalId: String? = null | ||
| if (useIV == true) { | ||
| Logging.debug("IdentityVerificationService: IV enabled, purging anonymous operations") | ||
| _operationRepo.removeOperationsWithoutExternalId() | ||
| suspendifyOnIO { | ||
| _operationRepo.awaitInitialized() | ||
|
|
||
| val externalId = _identityModelStore.model.externalId | ||
| if (externalId != null && _jwtTokenStore.getJwt(externalId) == null) { | ||
| Logging.debug("IdentityVerificationService: IV enabled but no JWT for $externalId, will fire invalidated event after queue wake") | ||
| jwtInvalidatedExternalId = externalId | ||
| var jwtInvalidatedExternalId: String? = null | ||
| if (useIV == true) { | ||
| Logging.debug("IdentityVerificationService: IV enabled, purging anonymous operations") | ||
| _operationRepo.removeOperationsWithoutExternalId() | ||
|
|
||
| val externalId = _identityModelStore.model.externalId | ||
| if (externalId != null && _jwtTokenStore.getJwt(externalId) == null) { | ||
| Logging.debug("IdentityVerificationService: IV enabled but no JWT for $externalId, will fire invalidated event after queue wake") | ||
| jwtInvalidatedExternalId = externalId | ||
| } | ||
| } | ||
| } | ||
|
|
||
| _operationRepo.forceExecuteOperations() | ||
| _operationRepo.forceExecuteOperations() | ||
|
|
||
| jwtInvalidatedExternalId?.let { _userManager.fireJwtInvalidated(it) } | ||
| jwtInvalidatedExternalId?.let { _userManager.fireJwtInvalidated(it) } | ||
| } |
There was a problem hiding this comment.
🔴 In the PR's refactored IdentityVerificationService.onModelReplaced(), useIV is correctly captured before the suspendifyOnIO block but val externalId = _identityModelStore.model.externalId is read inside the coroutine after awaitInitialized() returns, creating a race window where a concurrent OneSignal.login("otherUser") replaces the identity model and causes fireJwtInvalidated to fire for the wrong user. Fix: capture externalId alongside useIV before the suspendifyOnIO block, matching how useIV is already captured from the model parameter.
Extended reasoning...
What the bug is and how it manifests
The PR wraps the entire HYDRATE handler body in suspendifyOnIO { _operationRepo.awaitInitialized() ... } to fix a purge-before-initialization race. While useIV = model.useIdentityVerification is correctly captured from the model parameter before the coroutine (line 44), the read val externalId = _identityModelStore.model.externalId (line 54) was moved inside the coroutine, after awaitInitialized() suspends. The _identityModelStore.model reference is a live reference to the current model, not the model at HYDRATE time, so it can change underneath the coroutine while it is suspended.
The specific code path that triggers it
awaitInitialized() suspends the coroutine while OperationRepo.loadSavedOperations() loads persisted operations from disk — a window that can be tens to hundreds of milliseconds on slower devices. During this window, if the developer calls OneSignal.login("userB"), the SDK replaces the IdentityModel in _identityModelStore. When awaitInitialized() resumes, _identityModelStore.model.externalId returns userB's externalId rather than the one present at HYDRATE time.
Why existing code does not prevent it
Capturing useIV from the model constructor parameter (which is a snapshot passed at HYDRATE time) is correct and safe. However, _identityModelStore.model is a live property getter that always returns the current model. There is no snapshot or copy made of the externalId before the suspension point; the read is simply deferred into the coroutine body, past the only concurrency boundary introduced by this PR.
What the impact would be
Two failure modes: (1) If userB has a valid JWT, jwtInvalidatedExternalId remains null — the original user who triggered the beta migration path never receives fireJwtInvalidated, their operations remain gated on a JWT event that will never arrive. (2) If userB has no JWT, fireJwtInvalidated("userB") fires; the developer calls updateUserJwt for userB while the original user's operations remain blocked indefinitely. This affects the beta migration path (first app open after IV is enabled) with a concurrent login.
How to fix it
Capture externalId at the same point useIV is captured, before the suspendifyOnIO block:
val useIV = model.useIdentityVerification
val externalId = _identityModelStore.model.externalId // capture before coroutine
suspendifyOnIO {
_operationRepo.awaitInitialized()
if (useIV == true) {
_operationRepo.removeOperationsWithoutExternalId()
if (externalId \!= null && _jwtTokenStore.getJwt(externalId) == null) {
jwtInvalidatedExternalId = externalId
}
}
...
}Step-by-step proof
- App starts with IV=true; config HYDRATE arrives.
onModelReplacedfires withuseIV=true. The coroutine is launched and suspends atawaitInitialized(). awaitInitialized()is pending (loading saved ops from disk). During this window, the developer callsOneSignal.login("userB").- The SDK replaces
_identityModelStore.modelwith userB'sIdentityModel;_identityModelStore.model.externalIdnow returns"userB". awaitInitialized()completes. The coroutine resumes and readsval externalId = _identityModelStore.model.externalId→"userB".- If userB has no JWT:
fireJwtInvalidated("userB")fires. Developer provides JWT for userB. The original user's ops are never unblocked. - If userB has a valid JWT:
jwtInvalidatedExternalIdstays null. The original user who triggered the IV migration never gets the event. Operations for the original user remain blocked forever.
| <ID>LongParameterList:IOutcomeEventsBackendService.kt$IOutcomeEventsBackendService$( appId: String, userId: String, subscriptionId: String, deviceType: String, direct: Boolean?, event: OutcomeEvent, )</ID> | ||
| <ID>LongParameterList:IParamsBackendService.kt$ParamsObject$( var googleProjectNumber: String? = null, var enterprise: Boolean? = null, var useIdentityVerification: Boolean? = null, var notificationChannels: JSONArray? = null, var firebaseAnalytics: Boolean? = null, var restoreTTLFilter: Boolean? = null, var clearGroupOnSummaryClick: Boolean? = null, var receiveReceiptEnabled: Boolean? = null, var disableGMSMissingPrompt: Boolean? = null, var unsubscribeWhenNotificationsDisabled: Boolean? = null, var locationShared: Boolean? = null, var requiresUserPrivacyConsent: Boolean? = null, var opRepoExecutionInterval: Long? = null, var influenceParams: InfluenceParamsObject, var fcmParams: FCMParamsObject, )</ID> | ||
| <ID>LongParameterList:IUserBackendService.kt$IUserBackendService$( appId: String, aliasLabel: String, aliasValue: String, properties: PropertiesObject, refreshDeviceMetadata: Boolean, propertyiesDelta: PropertiesDeltasObject, jwt: String? = null, )</ID> | ||
| <ID>LongParameterList:LoginUserOperationExecutor.kt$LoginUserOperationExecutor$( private val _identityOperationExecutor: IdentityOperationExecutor, private val _application: IApplicationService, private val _deviceService: IDeviceService, private val _userBackend: IUserBackendService, private val _identityModelStore: IdentityModelStore, private val _propertiesModelStore: PropertiesModelStore, private val _subscriptionsModelStore: SubscriptionModelStore, private val _configModelStore: ConfigModelStore, private val _languageContext: ILanguageContext, private val _jwtTokenStore: JwtTokenStore, )</ID> | ||
| <ID>LongParameterList:LoginUserOperationExecutor.kt$LoginUserOperationExecutor$( private val _identityOperationExecutor: IdentityOperationExecutor, private val _application: IApplicationService, private val _deviceService: IDeviceService, private val _userBackend: IUserBackendService, private val _identityModelStore: IdentityModelStore, private val _propertiesModelStore: PropertiesModelStore, private val _subscriptionsModelStore: SubscriptionModelStore, private val _configModelStore: ConfigModelStore, private val _languageContext: ILanguageContext, private val _jwtTokenStore: JwtTokenStore, private val _consistencyManager: IConsistencyManager, )</ID> | ||
| <ID>LongParameterList:OutcomeEventsController.kt$OutcomeEventsController$( private val _session: ISessionService, private val _influenceManager: IInfluenceManager, private val _outcomeEventsCache: IOutcomeEventsRepository, private val _outcomeEventsPreferences: IOutcomeEventsPreferences, private val _outcomeEventsBackend: IOutcomeEventsBackendService, private val _configModelStore: ConfigModelStore, private val _identityModelStore: IdentityModelStore, private val _subscriptionManager: ISubscriptionManager, private val _deviceService: IDeviceService, private val _time: ITime, )</ID> | ||
| <ID>LongParameterList:RefreshUserOperationExecutor.kt$RefreshUserOperationExecutor$( private val _userBackend: IUserBackendService, private val _identityModelStore: IdentityModelStore, private val _propertiesModelStore: PropertiesModelStore, private val _subscriptionsModelStore: SubscriptionModelStore, private val _configModelStore: ConfigModelStore, private val _buildUserService: IRebuildUserService, private val _newRecordState: NewRecordsState, private val _jwtTokenStore: JwtTokenStore, )</ID> | ||
| <ID>LongParameterList:SubscriptionObject.kt$SubscriptionObject$( val id: String? = null, val type: SubscriptionObjectType? = null, val token: String? = null, val enabled: Boolean? = null, val notificationTypes: Int? = null, val sdk: String? = null, val deviceModel: String? = null, val deviceOS: String? = null, val rooted: Boolean? = null, val netType: Int? = null, val carrier: String? = null, val appVersion: String? = null, )</ID> |
There was a problem hiding this comment.
🟡 The PR correctly updates the LongParameterList baseline entry for LoginUserOperationExecutor to include _consistencyManager: IConsistencyManager, but omits the required ConstructorParameterNaming suppression entry for that parameter. All 10 other underscore-prefixed constructor parameters in LoginUserOperationExecutor have individual ConstructorParameterNaming entries, and both peer executors (SubscriptionOperationExecutor, UpdateUserOperationExecutor) have the analogous entry for their own _consistencyManager parameters. Add the missing baseline entry to prevent the detekt CI check from failing.
Extended reasoning...
What the bug is and how it manifests
The detekt static-analysis tool enforces a ConstructorParameterNaming rule that flags underscore-prefixed constructor parameters (e.g. private val _consistencyManager). These violations are suppressed by adding per-parameter entries to the detekt baseline XML file. This PR adds private val _consistencyManager: IConsistencyManager as a new constructor parameter to LoginUserOperationExecutor and correctly updates the LongParameterList baseline entry to reflect the new parameter, but does not add the corresponding ConstructorParameterNaming entry. When detekt runs in CI baseline-check mode, any violation not present in the baseline XML is treated as a new unbaselined finding and fails the build.
The specific code path that triggers it
When the CI pipeline runs ./gradlew detekt (or the equivalent baseline-check task) against LoginUserOperationExecutor.kt, detekt scans the constructor and finds private val _consistencyManager: IConsistencyManager. It raises a ConstructorParameterNaming violation. The baseline checker compares this against detekt-baseline-core.xml and finds no matching suppression entry, so it reports one new unbaselined violation and returns a non-zero exit code.
Why existing code does not prevent it
The LongParameterList and ConstructorParameterNaming rules are tracked independently in the XML, under separate <ID> entries. Updating the LongParameterList entry (which lists all parameters in a single string) has no effect on the ConstructorParameterNaming rule, which requires one entry per parameter. The two-rule pattern is well-established in this codebase: every other underscore-prefixed parameter in LoginUserOperationExecutor (_application, _configModelStore, _deviceService, _identityModelStore, _identityOperationExecutor, _jwtTokenStore, _languageContext, _propertiesModelStore, _subscriptionsModelStore, _userBackend) has its own ConstructorParameterNaming entry. The omission of _consistencyManager is an oversight.
What the impact would be
The detekt CI check will fail after this PR is merged, blocking the release build pipeline. There is no functional impact on SDK behavior; this is purely a build-tooling issue. The build failure is deterministic and will reproduce on every CI run until the entry is added.
How to fix it
Add the following entry to the <CurrentIssues> section of OneSignalSDK/detekt/detekt-baseline-core.xml, in alphabetical order among the other ConstructorParameterNaming:LoginUserOperationExecutor entries:
<ID>ConstructorParameterNaming:LoginUserOperationExecutor.kt$LoginUserOperationExecutor$private val _consistencyManager: IConsistencyManager</ID>Step-by-step proof
- PR adds
private val _consistencyManager: IConsistencyManagertoLoginUserOperationExecutor's constructor. - The
LongParameterListbaseline entry is updated to include_consistencyManager— this is correct. - Detekt runs
ConstructorParameterNamingon the constructor; all 10 existing underscore-prefixed parameters are suppressed via baseline entries. - Detekt finds
_consistencyManagerwith no matchingConstructorParameterNamingbaseline entry. - Detekt reports one new unbaselined violation; CI baseline-check mode fails the build.
- For comparison,
SubscriptionOperationExecutorandUpdateUserOperationExecutorboth already haveConstructorParameterNamingentries for their_consistencyManagerparameters, confirming the expected pattern.
fbc53f2 to
fb2353f
Compare
IdentityVerificationService.onModelReplaced (config HYDRATE) could fire before OperationRepo.loadSavedOperations() finished, causing removeOperationsWithoutExternalId() to run against an empty queue. Wrap the HYDRATE handler in suspendifyOnIO + awaitInitialized() so the purge waits for the queue to be fully populated, following the same pattern as RecoverFromDroppedLoginBug.
fireJwtInvalidated now buffers the externalId under a synchronized lock when no listeners are subscribed (e.g. during SDK init HYDRATE) and replays it when the first IUserJwtInvalidatedListener is added. Clears pending event on user switch (onModelReplaced) to prevent stale replay.
LoginUserOperationExecutor.createUser() was not passing the RYW token from the backend response to the ConsistencyManager, causing the IamFetchRywTokenKey condition to never resolve after login. This meant InAppMessagesManager coroutines awaited forever and IAMs were never fetched for the logged-in user. - Add rywData field to CreateUserResponse - Parse ryw_token/ryw_delay in JSONConverter.convertToCreateUserResponse - Set RYW data in ConsistencyManager after successful createUser - Fix resolveConditionsWithID to only remove matching conditions - Wrap resolveConditionsWithID in mutex for thread safety
When an IAM fetch returns an unauthorized response (401 or 403), the SDK now saves the pending fetch state and automatically retries once the JWT is refreshed for the same user via updateUserJwt. Switching users clears any stale retry. - Add IJwtUpdateListener to JwtTokenStore for post-putJwt notification - InAppMessagesManager subscribes and retries on JWT update - Reset rate-limiter on 401 so retry is not throttled - Use @volatile on cross-thread fields (lastTimeFetchedIAMs, pendingJwtRetryExternalId, pendingJwtRetryRywData) - RYW-aware fetches bypass rate limiter to avoid stale data - Handle 401 in fetchInAppMessagesWithoutRywToken fallback path
5d860c9 to
2784adf
Compare
|
Closing in favor of a clean PR — the review history here is noisy from iterative bot feedback cycles. Reopening with the same changes squashed into 4 logical commits. |
|
|
||
| fetchIAMMutex.withLock { | ||
| val now = _time.currentTimeMillis | ||
| if (lastTimeFetchedIAMs != null && (now - lastTimeFetchedIAMs!!) < _configModelStore.model.fetchIAMMinInterval) { | ||
| if (rywData == null && lastTimeFetchedIAMs != null && (now - lastTimeFetchedIAMs!!) < _configModelStore.model.fetchIAMMinInterval) { | ||
| return | ||
| } | ||
|
|
There was a problem hiding this comment.
🔴 Inside fetchIAMMutex.withLock, the @volatile field lastTimeFetchedIAMs is read twice in one &&-expression: once for the null check (lastTimeFetchedIAMs != null) and once for the arithmetic (now - lastTimeFetchedIAMs!!). The 401 catch block at line 352 writes lastTimeFetchedIAMs = null outside the mutex, creating a TOCTOU where an OS-level preemption between the two JVM GETFIELD instructions can turn a passing null-check into a NullPointerException. Fix: read into a local val before the condition: val lastFetched = lastTimeFetchedIAMs; if (rywData == null && lastFetched != null && (now - lastFetched) < threshold).
Extended reasoning...
What the bug is and how it manifests
The PR adds @volatile to lastTimeFetchedIAMs (line 121) and adds a write of lastTimeFetchedIAMs = null in the 401 catch block (line 352) outside fetchIAMMutex. Inside the mutex withLock body (lines 326-332), the expression reads the same @volatile field twice: first for lastTimeFetchedIAMs != null and then for lastTimeFetchedIAMs!!. These are separate JVM GETFIELD bytecode operations. Because @volatile guarantees each read goes directly to main memory, a write on another thread between the two reads is immediately visible, making the null check and the dereference non-atomic.
The specific code path that triggers it
On Dispatchers.IO (a thread pool), multiple coroutines can execute concurrently on separate OS threads. Coroutine B sets lastTimeFetchedIAMs = now_B inside the mutex, then exits the lock and begins its HTTP call. The 401 response eventually arrives and Coroutine B's catch block (which runs outside the mutex) writes lastTimeFetchedIAMs = null. If Thread 2 is running Coroutine A's withLock body and the OS preempts Thread 2 between the two GETFIELD instructions—right after Coroutine A reads non-null for the null check, but before it reads the field again for !!—Thread 1 can write null. Thread 2 then resumes, executes lastTimeFetchedIAMs!! on a null Long?, and crashes with NullPointerException.
Addressing the refutation
The refutation argues the race requires an HTTP round-trip (~100ms) to complete within the nanosecond window between two bytecodes, which is 'essentially impossible.' This argument conflates two separate events. The HTTP call does not need to complete during the nanosecond window—it only needs to have completed before Coroutine A enters withLock. The race scenario is: Coroutine B already received its 401 response and its catch-block coroutine is scheduled on the IO thread pool but has not yet been dispatched; simultaneously Coroutine A acquires the mutex and begins the non-suspending withLock body. The OS can preempt Thread 2 between any two machine instructions; no 100ms event needs to happen within nanoseconds. On Android (ARM multi-core), OS thread preemption between two ldar instructions (volatile reads) is a legitimate JVM memory model hazard.
That said, the practical probability is low. The window is narrow, and in common test scenarios the race does not manifest. However: (1) @volatile was added in this very PR—without fixing the double-read pattern it creates a false sense of correctness; (2) the fix is a single-line change with zero downside; (3) the two GETFIELDs on a nullable boxed Long? are distinct memory operations that the JVM compiler cannot merge into one, making the race structurally real regardless of probability.
Why existing synchronization does not prevent it
fetchIAMMutex.withLock blocks other coroutines from entering the locked block, but does not prevent OS threads from executing code outside the mutex concurrently. The 401 catch block is intentionally outside the mutex. @volatile ensures the null write is immediately visible to all threads—which is precisely what makes the race dangerous rather than masked by caching.
What the impact would be
A crash (NullPointerException) in fetchMessages on a production device. The stack trace would point into the rate-limiter check inside fetchIAMMutex.withLock, which may be difficult to diagnose without knowing the race.
How to fix it
Read the volatile field exactly once into a local val before the compound boolean expression:
val lastFetched = lastTimeFetchedIAMs
if (rywData == null && lastFetched != null && (now - lastFetched) < _configModelStore.model.fetchIAMMinInterval) {
return
}
A local val is never subject to concurrent writes, so the null check and arithmetic operate on the same captured value. This is the standard JMM safe-access pattern for nullable volatile fields.
| override fun onJwtUpdated(externalId: String) { | ||
| val retryExternalId = pendingJwtRetryExternalId ?: return | ||
| if (externalId != retryExternalId) return | ||
|
|
||
| val retryRywData = pendingJwtRetryRywData | ||
|
|
||
| Logging.debug("InAppMessagesManager.onJwtUpdated: JWT refreshed for $externalId, retrying IAM fetch") | ||
| pendingJwtRetryExternalId = null | ||
| pendingJwtRetryRywData = null | ||
|
|
||
| suspendifyOnIO { | ||
| fetchMessages(retryRywData) | ||
| } | ||
| } |
There was a problem hiding this comment.
🔴 In onJwtUpdated(), the pending retry state is cleared (pendingJwtRetryExternalId = null, pendingJwtRetryRywData = null) before launching suspendifyOnIO { fetchMessages(retryRywData) }. If the app is backgrounded when onJwtUpdated fires—a realistic scenario when the developer calls updateUserJwt() in response to a push notification—fetchMessages() hits the early-return guard and exits without fetching. Since the pending state was already cleared and onFocus() is a no-op, there is no recovery path for the rest of the session. Fix: defer clearing the pending state until after the foreground check passes inside the coroutine, or re-arm the retry in onFocus().
Extended reasoning...
What the bug is and how it manifests
In InAppMessagesManager.onJwtUpdated() (lines 1055-1061), the method clears pendingJwtRetryExternalId = null and pendingJwtRetryRywData = null synchronously on the calling thread, then launches suspendifyOnIO { fetchMessages(retryRywData) }. Inside fetchMessages(), the first guard is if (!_applicationService.isInForeground) { return } (line ~304). If the app is backgrounded at the moment onJwtUpdated fires, the pending state has already been wiped but the actual fetch exits immediately—leaving the IAM retry permanently lost.
The specific code path that triggers it
- fetchMessages() is called while app is in foreground. A 401 response arrives and the catch block stores pendingJwtRetryExternalId = externalId and pendingJwtRetryRywData = rywData.
- The app moves to the background (user switches apps, or a push notification triggers background processing).
- The developer calls OneSignal.updateUserJwt(externalId, newJwt) from a background service or receiver—the expected response to onUserJwtInvalidated.
- JwtTokenStore.putJwt fires jwtUpdateNotifier.fire { it.onJwtUpdated(externalId) }.
- InAppMessagesManager.onJwtUpdated executes: reads retryExternalId (non-null, matches), then unconditionally sets pendingJwtRetryExternalId = null and pendingJwtRetryRywData = null, then calls suspendifyOnIO { fetchMessages(retryRywData) }.
- The scheduled coroutine runs on Dispatchers.IO; fetchMessages evaluates !_applicationService.isInForeground → true → return.
- The retry is permanently dropped.
Why existing code does not prevent it
onFocus(firedOnSubscribe: Boolean) at line 1063 is an explicit no-op override. It performs no fetch and has no retry logic. There is no other foreground-reentry hook in InAppMessagesManager that checks the pending retry fields. The only other recovery path is onSessionStarted(), which requires the app to go to background long enough to trigger a new session (typically 30 seconds), which may be a significant delay. The @volatile annotations on the pending fields ensure memory visibility but provide no protection against the ordering problem: the clear happens before the coroutine is dispatched, and the coroutine sees isInForeground = false with no way to know there was a pending retry.
What the impact would be
Users in apps with identity verification enabled who experience a JWT expiry while backgrounded—and whose developers correctly call updateUserJwt from a background context (push notification handler, WorkManager job, etc.)—will silently receive no in-app messages for the rest of the session. This is a realistic, common Android scenario where short-lived JWTs and background processing are normal. No error is surfaced to the developer or end user.
How to fix it
Option A (minimal): Only clear the pending state after the foreground check succeeds inside the coroutine:
suspendifyOnIO {
if (_applicationService.isInForeground) {
pendingJwtRetryExternalId = null
pendingJwtRetryRywData = null
fetchMessages(retryRywData)
}
// else: pending state remains set; will retry when app resumes
}
Option B: Override onFocus() to check the pending fields and call suspendifyOnIO { fetchMessages(retryRywData) } if they are set, re-arming the retry on foreground.
Step-by-step proof
- App in foreground: onSessionStarted() → fetchMessages() → HTTP 401 → catch block sets pendingJwtRetryExternalId = "alice", pendingJwtRetryRywData = rywData.
- User presses home; _applicationService.isInForeground becomes false.
- Push notification arrives; developer receiver calls OneSignal.updateUserJwt("alice", newJwt).
- JwtTokenStore.putJwt → onJwtUpdated("alice") runs on the binder thread.
- onJwtUpdated: reads retryExternalId = "alice" (passes guard), then sets pendingJwtRetryExternalId = null and pendingJwtRetryRywData = null, then launches coroutine.
- Coroutine on IO pool: fetchMessages(rywData) → !isInForeground is true → return.
- User returns to the app within session timeout. onFocus() fires → no-op.
- pendingJwtRetryExternalId is null; no retry is ever triggered. IAMs are never shown for this session.
Description
One Line Summary
Fix multiple issues with identity verification (IV) enabled: startup race conditions, missing JWT invalidated callback, stuck IAM fetch after login, and IAM retry on expired JWT.
Details
Motivation
With identity verification enabled, several issues surfaced during testing:
onUserJwtInvalidatedlistener may not be called with JWT was missing at SDK init timeScope
All changes are gated behind identity verification being enabled (
useIdentityVerification == true). Normal (non-IV) SDK behavior is unaffected.Commit 1 -- Fix race condition: purge anonymous ops AFTER queue is loaded
IdentityVerificationService.onModelReplaced(config HYDRATE) could fire beforeOperationRepo.loadSavedOperations()finished, causingremoveOperationsWithoutExternalId()to run against an empty queuesuspendifyOnIO+awaitInitialized(), following the same pattern asRecoverFromDroppedLoginBugCommit 2 -- Replay JWT invalidated event to late-registered listeners
fireJwtInvalidatednow buffers theexternalIdwhen no listeners are subscribed (e.g. during SDK init HYDRATE) and replays it when the firstIUserJwtInvalidatedListeneris addedCommit 3 -- Fix IAM fetch stuck after login with IV enabled
LoginUserOperationExecutor.createUser()was not passing the RYW token from the backend response to theConsistencyManager, causingIamFetchReadyConditionto not resolverywDatafield toCreateUserResponse, parsedryw_token/ryw_delayinJSONConverter, and set RYW data inConsistencyManagerafter successfulcreateUserCommit 4 -- Retry IAM fetch after JWT refresh on 401/403 response
InAppBackendServicenow throwsBackendExceptionon unauthorized (401/403) responses instead of returningnullInAppMessagesManagercatches the exception, stores a pending retry state, and retries the fetch whenJwtTokenStorenotifies that the JWT has been refreshed for the same userTesting
Unit testing
LoginUserOperationExecutorTeststo include the newIConsistencyManagerdependency (all 16 constructor calls)InAppBackendServiceTests: 401 response throwsBackendExceptionInAppBackendServiceTests: 401 from fallback (no RYW token) path throwsBackendExceptionInAppMessagesManagerTestscovering the JWT 401 retry flow:Manual testing
updateUserJwtis calledAffected code checklist
Checklist
Overview
Testing
Final pass