Skip to content

fix(jwt): race conditions and IAM reliability issues with identity verification enabled#2609

Closed
nan-li wants to merge 4 commits intofeat/identity_verification_5.8from
fix/jwt_misc_race_conditions
Closed

fix(jwt): race conditions and IAM reliability issues with identity verification enabled#2609
nan-li wants to merge 4 commits intofeat/identity_verification_5.8from
fix/jwt_misc_race_conditions

Conversation

@nan-li
Copy link
Copy Markdown
Contributor

@nan-li nan-li commented Apr 9, 2026

Description

One Line Summary

Fix multiple issues with identity verification (IV) enabled: startup race conditions, missing JWT invalidated callback, stuck IAM fetch after login, and IAM retry on expired JWT.

Details

Motivation

With identity verification enabled, several issues surfaced during testing:

  • Anonymous operations were sometimes not purged on cold start due to timing
  • The onUserJwtInvalidated listener may not be called with JWT was missing at SDK init time
  • In-App Messages may not be fetched after the first login
  • If a JWT expired mid-session, the IAM fetch could silently failed with no recovery

Scope

All changes are gated behind identity verification being enabled (useIdentityVerification == true). Normal (non-IV) SDK behavior is unaffected.

Commit 1 -- Fix race condition: purge anonymous ops AFTER queue is loaded

  • IdentityVerificationService.onModelReplaced (config HYDRATE) could fire before OperationRepo.loadSavedOperations() finished, causing removeOperationsWithoutExternalId() to run against an empty queue
  • Fixed by wrapping the HYDRATE handler in suspendifyOnIO + awaitInitialized(), following the same pattern as RecoverFromDroppedLoginBug

Commit 2 -- Replay JWT invalidated event to late-registered listeners

  • fireJwtInvalidated now buffers the externalId when no listeners are subscribed (e.g. during SDK init HYDRATE) and replays it when the first IUserJwtInvalidatedListener is added

Commit 3 -- Fix IAM fetch stuck after login with IV enabled

  • LoginUserOperationExecutor.createUser() was not passing the RYW token from the backend response to the ConsistencyManager, causing IamFetchReadyCondition to not resolve
  • Added rywData field to CreateUserResponse, parsed ryw_token/ryw_delay in JSONConverter, and set RYW data in ConsistencyManager after successful createUser

Commit 4 -- Retry IAM fetch after JWT refresh on 401/403 response

  • InAppBackendService now throws BackendException on unauthorized (401/403) responses instead of returning null
  • InAppMessagesManager catches the exception, stores a pending retry state, and retries the fetch when JwtTokenStore notifies that the JWT has been refreshed for the same user
  • Pending retry is cleared on user switch to avoid stale retries

Testing

Unit testing

  • Updated LoginUserOperationExecutorTests to include the new IConsistencyManager dependency (all 16 constructor calls)
  • Added test in InAppBackendServiceTests: 401 response throws BackendException
  • Added test in InAppBackendServiceTests: 401 from fallback (no RYW token) path throws BackendException
  • Added 5 tests in InAppMessagesManagerTests covering the JWT 401 retry flow:
    • 401 stores pending retry state
    • JWT refresh for matching user triggers retry
    • JWT refresh for different user does not retry
    • No-op when no pending retry exists
    • User switch clears pending retry state

Manual testing

  • Tested with IV enabled on a fresh install: verified no IAM fetch for anonymous user, IAM fetched after login
  • Tested JWT expiry mid-session: verified IAM fetch retries after updateUserJwt is called
  • Tested login/logout cycle: verified pending retry is cleared on user switch

Affected code checklist

  • In-App Messaging
  • REST API requests
  • Public API changes

Checklist

Overview

  • I have filled out all REQUIRED sections above
  • PR does one thing (fix IV-related issues for 5.8 release)
  • Any Public API changes are explained in the PR details and conform to existing APIs

Testing

  • I have included test coverage for these changes, or explained why they are not needed
  • All automated tests pass, or I explained why that is not possible
  • I have personally tested this on my device, or explained why that is not possible

Final pass

  • Code is as readable as possible
  • I have reviewed this PR myself, ensuring it meets each checklist item

@nan-li nan-li changed the title Fix/jwt misc race conditions fix(jwt): misc race conditions Apr 9, 2026
@nan-li nan-li changed the title fix(jwt): misc race conditions fix(jwt): Fix race conditions and reliability issues with identity verification enabled Apr 9, 2026
@nan-li nan-li changed the title fix(jwt): Fix race conditions and reliability issues with identity verification enabled fix(jwt): Fix race conditions and IAM reliability issues with identity verification enabled Apr 9, 2026
@nan-li nan-li changed the title fix(jwt): Fix race conditions and IAM reliability issues with identity verification enabled fix(jwt): race conditions and IAM reliability issues with identity verification enabled Apr 9, 2026
@nan-li nan-li force-pushed the fix/jwt_misc_race_conditions branch 2 times, most recently from 917e157 to c61f42e Compare April 9, 2026 20:01
@nan-li nan-li force-pushed the fix/jwt_misc_race_conditions branch 2 times, most recently from 2b6e2cf to 60c8d9a Compare April 10, 2026 03:01
@nan-li nan-li requested review from a team and abdulraqeeb33 April 10, 2026 04:22
@nan-li
Copy link
Copy Markdown
Contributor Author

nan-li commented Apr 10, 2026

@claude review

@nan-li nan-li force-pushed the fix/jwt_misc_race_conditions branch from 650e96f to cfc3e6e Compare April 10, 2026 18:01
@nan-li
Copy link
Copy Markdown
Contributor Author

nan-li commented Apr 10, 2026

@claude review

@nan-li
Copy link
Copy Markdown
Contributor Author

nan-li commented Apr 10, 2026

@claude review

@nan-li
Copy link
Copy Markdown
Contributor Author

nan-li commented Apr 13, 2026

@claude review

@nan-li nan-li force-pushed the fix/jwt_misc_race_conditions branch from 89393ca to 071ecef Compare April 13, 2026 15:03
@nan-li
Copy link
Copy Markdown
Contributor Author

nan-li commented Apr 13, 2026

@claude review

@nan-li nan-li force-pushed the fix/jwt_misc_race_conditions branch from 4c0feb5 to 86b6c36 Compare April 13, 2026 20:01
@OneSignal OneSignal deleted a comment from Claude AI Apr 13, 2026
@OneSignal OneSignal deleted a comment from Claude AI Apr 13, 2026
@OneSignal OneSignal deleted a comment from Claude AI Apr 13, 2026
@OneSignal OneSignal deleted a comment from Claude AI Apr 13, 2026
@OneSignal OneSignal deleted a comment from Claude AI Apr 13, 2026
@OneSignal OneSignal deleted a comment from Claude AI Apr 13, 2026
@nan-li
Copy link
Copy Markdown
Contributor Author

nan-li commented Apr 13, 2026

@claude review

@nan-li
Copy link
Copy Markdown
Contributor Author

nan-li commented Apr 13, 2026

@claude review

@nan-li nan-li force-pushed the fix/jwt_misc_race_conditions branch from 6f91de9 to 5d860c9 Compare April 13, 2026 21:16
Comment on lines +135 to +148
private var pendingJwtRetryExternalId: String? = null

@Volatile
private var pendingJwtRetryRywData: RywData? = null

private val identityModelChangeHandler =
object : ISingletonModelStoreChangeHandler<IdentityModel> {
override fun onModelReplaced(
model: IdentityModel,
tag: String,
) { }
) {
pendingJwtRetryExternalId = null
pendingJwtRetryRywData = null
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 In onJwtUpdated, pendingJwtRetryExternalId (R1) and pendingJwtRetryRywData (R2) are two separate @Volatile reads with no synchronization between them, while identityModelChangeHandler.onModelReplaced zeroes both fields without any lock. A concurrent user switch between R1 and R2 leaves onJwtUpdated with a non-null retryExternalId from the old user but a null retryRywData, causing fetchMessages(null) to fire under the new user's identity context. To fix, add a shared lock (e.g. synchronized(pendingJwtRetryLock)) spanning the read-check-clear block in onJwtUpdated and the two null writes in onModelReplaced, mirroring the synchronized(jwtInvalidatedLock) pattern already applied to UserManager in this same PR.

Extended reasoning...

What the bug is and how it manifests

InAppMessagesManager stores two @Volatile fields — pendingJwtRetryExternalId and pendingJwtRetryRywData — as the retry state for IAM fetches that received a 401 response. In onJwtUpdated (lines 1039–1059), the method first reads pendingJwtRetryExternalId (R1) and uses it as a guard: if null, it returns early; if mismatched, it returns. Then, as a second independent volatile read, it reads pendingJwtRetryRywData (R2) into retryRywData. Concurrently, identityModelChangeHandler.onModelReplaced (lines 145–148) writes pendingJwtRetryExternalId = null and then pendingJwtRetryRywData = null without holding any lock.

The specific code path that triggers it

Race sequence:

  1. Thread A (onJwtUpdated("alice")): R1 reads pendingJwtRetryExternalId = "alice" — passes the ?: return guard.
  2. Thread A: passes the externalId != retryExternalId check.
  3. Thread B (onModelReplaced, user switch to "bob"): writes pendingJwtRetryExternalId = null, then pendingJwtRetryRywData = null.
  4. Thread A: R2 reads pendingJwtRetryRywData — gets null (Thread B already cleared it).
  5. Thread A: calls suspendifyOnIO { fetchMessages(null) } with null rywData, under the new user's identity context (since _identityModelStore.model is now bob's model).

Why existing code does not prevent it

@Volatile on both fields guarantees cross-thread memory visibility for each individual read/write, but provides no atomicity across the compound sequence (R1 → guard check → R2). The JVM memory model allows Thread B's writes to interpose between Thread A's R1 and R2. Neither field uses synchronized, AtomicReference, nor any other construct that would make the two-field check-and-read atomic with respect to the two-field clear in onModelReplaced.

What the impact would be

The spurious fetchMessages(null) fires under bob's identity context (bob's externalId, bob's aliasLabel/aliasValue, bob's subscriptionId). Inside fetchMessages, the fetchIAMMutex rate-limiter sets lastTimeFetchedIAMs = now. When the legitimate RYW-aware fetch for bob subsequently arrives (via onModelUpdated), the rate-limit check (now - lastTimeFetchedIAMs!!) < fetchIAMMinInterval (default 30 s) returns early, permanently dropping the RYW-consistent fetch for the rest of the session. Alternatively, the spurious fetch populates IAMs for bob without the RYW token, bypassing the read-your-write consistency mechanism this entire PR is designed to provide.

Step-by-step proof

  1. JWT expires for alice; fetchMessages catches the 401 and stores pendingJwtRetryExternalId = "alice", pendingJwtRetryRywData = rywData.
  2. App calls OneSignal.login("bob"); identityModelChangeHandler.onModelReplaced is invoked on Thread B.
  3. Simultaneously, OneSignal.updateUserJwt("alice", newJwt) fires onJwtUpdated("alice") on Thread A.
  4. Thread A completes R1 (pendingJwtRetryExternalId = "alice", non-null, passes guard).
  5. Thread B executes: pendingJwtRetryExternalId = null, pendingJwtRetryRywData = null.
  6. Thread A executes R2: val retryRywData = pendingJwtRetryRywData → null.
  7. Thread A clears pending state (already null) and calls fetchMessages(null) under bob's identity.
  8. fetchMessages(null) acquires fetchIAMMutex, passes the rate-limit check (or doesn't, depending on timing), and sets lastTimeFetchedIAMs = now.
  9. Bob's legitimate RYW-consistent fetch arrives and is blocked by the 30-second rate limiter.

How to fix it

Introduce a plain Any() lock object (e.g. private val pendingJwtRetryLock = Any()) and use synchronized(pendingJwtRetryLock) to protect the compound read-check-clear in onJwtUpdated and the two null writes in onModelReplaced. This is the same pattern the PR already applies to UserManager with jwtInvalidatedLock, confirming the intent — the analogous guard was simply missed in InAppMessagesManager.

Comment on lines 68 to 74
<ID>LongMethod:InAppRepository.kt$InAppRepository$override suspend fun cleanCachedInAppMessages()</ID>
<ID>LongParameterList:IInAppBackendService.kt$IInAppBackendService$( appId: String, subscriptionId: String, variantId: String?, messageId: String, clickId: String?, isFirstClick: Boolean, )</ID>
<ID>LongParameterList:InAppDisplayer.kt$InAppDisplayer$( private val _applicationService: IApplicationService, private val _lifecycle: IInAppLifecycleService, private val _promptFactory: IInAppMessagePromptFactory, private val _backend: IInAppBackendService, private val _influenceManager: IInfluenceManager, private val _configModelStore: ConfigModelStore, private val _languageContext: ILanguageContext, private val _time: ITime, )</ID>
<ID>LongParameterList:InAppMessagesManager.kt$InAppMessagesManager$( private val _applicationService: IApplicationService, private val _sessionService: ISessionService, private val _influenceManager: IInfluenceManager, private val _configModelStore: ConfigModelStore, private val _userManager: IUserManager, private val _identityModelStore: IdentityModelStore, private val _subscriptionManager: ISubscriptionManager, private val _outcomeEventsController: IOutcomeEventsController, private val _state: InAppStateService, private val _prefs: IInAppPreferencesController, private val _repository: IInAppRepository, private val _backend: IInAppBackendService, private val _triggerController: ITriggerController, private val _triggerModelStore: TriggerModelStore, private val _displayer: IInAppDisplayer, private val _lifecycle: IInAppLifecycleService, private val _languageContext: ILanguageContext, private val _time: ITime, private val _consistencyManager: IConsistencyManager, )</ID>
<ID>LongParameterList:InAppMessagesManager.kt$InAppMessagesManager$( private val _applicationService: IApplicationService, private val _sessionService: ISessionService, private val _influenceManager: IInfluenceManager, private val _configModelStore: ConfigModelStore, private val _userManager: IUserManager, private val _identityModelStore: IdentityModelStore, private val _subscriptionManager: ISubscriptionManager, private val _outcomeEventsController: IOutcomeEventsController, private val _state: InAppStateService, private val _prefs: IInAppPreferencesController, private val _repository: IInAppRepository, private val _backend: IInAppBackendService, private val _triggerController: ITriggerController, private val _triggerModelStore: TriggerModelStore, private val _displayer: IInAppDisplayer, private val _lifecycle: IInAppLifecycleService, private val _languageContext: ILanguageContext, private val _time: ITime, private val _consistencyManager: IConsistencyManager, private val _jwtTokenStore: JwtTokenStore, )</ID>
<ID>LongParameterList:OneSignalAnimate.kt$OneSignalAnimate$( view: View, deltaFromY: Float, deltaToY: Float, duration: Int, interpolator: Interpolator?, animCallback: Animation.AnimationListener?, )</ID>
<ID>MagicNumber:DraggableRelativeLayout.kt$DraggableRelativeLayout$3</ID>
<ID>MagicNumber:DraggableRelativeLayout.kt$DraggableRelativeLayout$3000</ID>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The PR adds _jwtTokenStore: JwtTokenStore as a new constructor parameter to InAppMessagesManager and correctly updates the LongParameterList baseline entry, but omits the corresponding ConstructorParameterNaming baseline entry. All 19 other underscore-prefixed constructor parameters have their own ConstructorParameterNaming suppression entries; _jwtTokenStore is the sole omission. Add the missing entry: ConstructorParameterNaming:InAppMessagesManager.kt$InAppMessagesManager$private val _jwtTokenStore: JwtTokenStore.

Extended reasoning...

What the bug is

The PR introduces private val _jwtTokenStore: JwtTokenStore as a new constructor parameter to InAppMessagesManager. Detekt enforces a ConstructorParameterNaming rule that flags underscore-prefixed constructor parameters. All 19 existing underscore-prefixed constructor parameters in InAppMessagesManager have their own ConstructorParameterNaming suppression entries in detekt-baseline-in-app-messages.xml (from _applicationService through _userManager), but the new _jwtTokenStore parameter has no corresponding entry.

The specific code path that triggers it

When detekt runs in CI, it scans InAppMessagesManager.kt and finds private val _jwtTokenStore: JwtTokenStore in the constructor. It raises a ConstructorParameterNaming violation. Because no baseline entry suppresses this violation, it is treated as a new unbaselined finding. Detekt baseline-check mode fails the build on any new unbaselined violations.

Why existing code does not prevent it

The PR author correctly updated the LongParameterList baseline entry (line 71) to include _jwtTokenStore. However, ConstructorParameterNaming is a separate rule requiring its own per-parameter baseline entry. The two rules are tracked independently in the XML. The omission is easy to miss because LongParameterList and ConstructorParameterNaming entries appear in different sections of the file.

What the impact would be

The detekt CI check will fail when the PR is merged, blocking the release build. No functional behavior is affected; this is purely a build-tooling issue.

How to fix it

Add the following entry to the <CurrentIssues> section of detekt-baseline-in-app-messages.xml, alphabetically between the _influenceManager and _languageContext entries:

<ID>ConstructorParameterNaming:InAppMessagesManager.kt$InAppMessagesManager$private val _jwtTokenStore: JwtTokenStore</ID>

Step-by-step proof

  1. PR adds private val _jwtTokenStore: JwtTokenStore to InAppMessagesManager constructor.
  2. Detekt scans the constructor; all 19 existing underscore-prefixed parameters are suppressed via baseline entries.
  3. Detekt finds _jwtTokenStore with no baseline entry.
  4. Detekt reports one new ConstructorParameterNaming violation.
  5. CI baseline check: new violation not in baseline → build fails.

Comment on lines +340 to +351
val newMessages =
try {
_backend.listInAppMessages(appId, aliasLabel, aliasValue, subscriptionId, rywData, sessionDurationProvider, jwt)
} catch (ex: BackendException) {
if (NetworkUtils.getResponseStatusType(ex.statusCode) == NetworkUtils.ResponseStatusType.UNAUTHORIZED) {
Logging.debug("InAppMessagesManager.fetchMessages: ${ex.statusCode} response. Will retry after JWT refresh for externalId=$externalId")
lastTimeFetchedIAMs = null
pendingJwtRetryRywData = rywData
pendingJwtRetryExternalId = externalId
}
null
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The JWT 401 retry mechanism has a temporal ordering flaw: onJwtUpdated is the sole trigger for retrying an IAM fetch after a 401, but the developer's putJwt call (which fires onJwtUpdated) can arrive before the in-flight IAM request processes its 401 response and sets pendingJwtRetryExternalId. When this happens, onJwtUpdated reads pendingJwtRetryExternalId == null and returns as a no-op; by the time the 401 catch block finally sets the sentinel, no future onJwtUpdated event will ever fire, permanently losing the IAM retry for the entire session. The fix is to check in the 401 catch block whether _jwtTokenStore.getJwt(externalId) already differs from the jwt value used in the failed request; if so, the JWT was rotated during the network call, so skip the sentinel and launch the retry immediately.

Extended reasoning...

What the bug is and how it manifests

InAppMessagesManager.fetchMessages() reads the current JWT at the start (line ~336 of InAppMessagesManager.kt), then makes an HTTP call that can stay in-flight for 100–2000 ms. The PR's 401 recovery design relies on a two-step handshake: (1) the catch block sets pendingJwtRetryExternalId = externalId after the 401 lands, and (2) onJwtUpdated later fires when the developer provides a fresh JWT, reads the sentinel, and launches the retry. If the developer's JWT provision races ahead of step (1), the entire mechanism silently fails.

The specific code path that triggers it

Concrete timeline:

  1. T=0 ms: fetchMessages() reads jwt = _jwtTokenStore.getJwt("alice") (expired). HTTP call starts.
  2. T=100 ms: An OperationRepo operation also uses the same expired JWT, receives a 401, and calls _userManager.fireJwtInvalidated("alice").
  3. T=150 ms: Developer's onUserJwtInvalidated fires. Developer calls OneSignal.updateUserJwt("alice", freshJwt)JwtTokenStore.putJwt()jwtUpdateNotifier.fire { it.onJwtUpdated("alice") }.
  4. T=150 ms: InAppMessagesManager.onJwtUpdated("alice") runs. Reads pendingJwtRetryExternalIdnull (IAM HTTP call still in-flight). Guard ?: return fires — no-op.
  5. T=300 ms: IAM HTTP response (401) arrives. Catch block executes: lastTimeFetchedIAMs = null, pendingJwtRetryRywData = rywData, pendingJwtRetryExternalId = "alice".
  6. Sentinel is now set, but onJwtUpdated("alice") will never fire again — the developer already provided the JWT once in response to the one JWT-invalidated event.

The IAM retry is permanently lost for the session.

Why existing code does not prevent it

@Volatile on pendingJwtRetryExternalId and pendingJwtRetryRywData ensures cross-thread memory visibility but does nothing to close the temporal gap between the sentinel write (in the catch block, after the network round-trip) and the notification from onJwtUpdated (which already ran and returned). There is no second chance: the SDK fires fireJwtInvalidated once per 401 event, the developer calls updateUserJwt once in response, and putJwt fires its notifier exactly once.

What the impact would be

Any session where (a) IV is enabled, (b) the JWT expires mid-session, (c) an OperationRepo operation and an in-flight IAM request both encounter the expired JWT at the same time (a near-certainty on session start), and (d) the developer responds promptly to onUserJwtInvalidated — all conditions that are simultaneously true in the common case — will silently receive no in-app messages for the remainder of the session. Recovery only happens if the app goes to background/foreground (triggering onSessionStarted/onModelReplaced) before the user would normally see the IAM.

How to fix it

In the 401 catch block, after setting the sentinel, immediately check whether the JWT in _jwtTokenStore has already been rotated since the request started:

lastTimeFetchedIAMs = null
pendingJwtRetryRywData = rywData
pendingJwtRetryExternalId = externalId
// If the JWT was already refreshed while the request was in-flight, fire the retry now
// rather than waiting for an onJwtUpdated that will never come.
if (_jwtTokenStore.getJwt(externalId) \!= jwt) {
    pendingJwtRetryExternalId = null
    pendingJwtRetryRywData = null
    suspendifyOnIO { fetchMessages(rywData) }
}

This closes the race window by checking after the fact whether the notification was missed.

Step-by-step proof

  1. fetchMessages() captures val jwt = _jwtTokenStore.getJwt("alice") = expiredToken.
  2. HTTP request is in-flight for 300 ms.
  3. OperationRepo 401 → fireJwtInvalidated("alice") → developer calls putJwt("alice", freshToken).
  4. jwtUpdateNotifier.fire { onJwtUpdated("alice") } runs synchronously on the caller's thread.
  5. onJwtUpdated: pendingJwtRetryExternalId ?: return — field is still null → returns.
  6. 300 ms later: IAM 401 response → catch block: pendingJwtRetryExternalId = "alice".
  7. _jwtTokenStore.getJwt("alice") now returns freshToken ≠ expiredToken, but the check does not exist in the current code.
  8. No further onJwtUpdated fires. IAM messages are never shown this session.

Comment on lines 140 to +148
private val identityModelChangeHandler =
object : ISingletonModelStoreChangeHandler<IdentityModel> {
override fun onModelReplaced(
model: IdentityModel,
tag: String,
) { }
) {
pendingJwtRetryExternalId = null
pendingJwtRetryRywData = null
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 When OneSignal.login("userB") is called within 30 seconds of User A's last IAM fetch, identityModelChangeHandler.onModelReplaced clears pendingJwtRetryExternalId and pendingJwtRetryRywData but omits resetting lastTimeFetchedIAMs, causing the rate-limiter to silently block User B's first IAM fetch for the remainder of the 30-second window. Fix: add lastTimeFetchedIAMs = null in onModelReplaced, alongside the existing clears for the JWT retry fields.

Extended reasoning...

What the bug is and how it manifests

identityModelChangeHandler.onModelReplaced (lines 142–148) clears pendingJwtRetryExternalId and pendingJwtRetryRywData on user switch, but omits resetting lastTimeFetchedIAMs. This field records when IAMs were last fetched and is per-user state. Omitting the reset causes the rate-limiter in fetchMessages to fire for the new user's first fetch.

The specific code path that triggers it

The rate-limiter check (line 323) is:
if (rywData == null && lastTimeFetchedIAMs != null && (now - lastTimeFetchedIAMs!!) < fetchIAMMinInterval) { return }

This guard is only bypassed when rywData != null. When createUser for the new user returns no ryw_token, resolveConditionsWithID resolves the IamFetchReadyCondition deferred with null, so fetchMessages(null) is called. With lastTimeFetchedIAMs still holding User A's fetch timestamp T and (now - T) < 30s, the condition evaluates to true and fetchMessages returns early without fetching.

Why existing code does not prevent it

onModelReplaced was explicitly designed to clear per-user fetch state on user switch — it already clears pendingJwtRetryExternalId and pendingJwtRetryRywData. The omission of lastTimeFetchedIAMs is a clear oversight. The rate-limiter was designed to prevent redundant fetches for the same user; it incorrectly fires here for a different, newly-logged-in user whose IAMs have never been fetched.

What the impact would be

In environments where the backend does not return ryw_token from createUser (the common non-RYW case), User B's IAMs are silently never fetched for up to 30 seconds after login. There is no error logged, no exception thrown — in-app messages are simply not shown. The only recovery is if another event (trigger change, subscription change) incidentally triggers fetchMessagesWhenConditionIsMet after the 30-second window expires.

How to fix it

Add lastTimeFetchedIAMs = null inside identityModelChangeHandler.onModelReplaced, alongside the existing clears:
override fun onModelReplaced(model: IdentityModel, tag: String) {
pendingJwtRetryExternalId = null
pendingJwtRetryRywData = null
lastTimeFetchedIAMs = null // add this line
}
This is consistent with the intent of the function and mirrors how UserManager.onModelReplaced clears pendingJwtInvalidatedExternalId on user switch in this same PR.

Step-by-step proof

  1. User A is logged in. A session starts; IAMs are fetched at time T. lastTimeFetchedIAMs = T.
  2. Within 30 seconds, OneSignal.login("userB") is called.
  3. identityModelChangeHandler.onModelReplaced fires: pendingJwtRetryExternalId = null, pendingJwtRetryRywData = null. lastTimeFetchedIAMs still equals T.
  4. createUser for User B succeeds with no ryw_token. resolveConditionsWithID resolves the IamFetchReadyCondition deferred with null. fetchMessages(null) is called.
  5. Rate-limiter check: rywData==null AND lastTimeFetchedIAMs!=null AND (now - T) < 30000ms → true → return early.
  6. User B's IAMs are never fetched. No error is surfaced; in-app messages are silently suppressed for the new user.

Comment on lines 228 to +234
backendSubscriptions.remove(backendSubscription)
}

if (response.rywData != null) {
_consistencyManager.setRywData(backendOneSignalId, IamFetchRywTokenKey.USER, response.rywData)
}
_consistencyManager.resolveConditionsWithID(IamFetchReadyCondition.ID)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The unconditional resolveConditionsWithID(IamFetchReadyCondition.ID) call at LoginUserOperationExecutor.kt:234 fires two concurrent IAM backend calls on every login when createUser returns rywData: Path A resolves the pre-login IamFetchReadyCondition("") with null and calls fetchMessages(null), while Path B's coroutine (scheduled by onModelUpdated) later finds the condition already stored via setRywData and calls fetchMessages(rywData). Path A wins the rate limiter first (no ryw delay, lastTimeFetchedIAMs==null), displaying potentially stale IAMs before Path B overwrites them after the ryw delay; fix by removing the unconditional resolveConditionsWithID call when rywData is present and relying solely on setRywData + checkConditionsAndComplete to resolve Path B's condition.

Extended reasoning...

What the bug is and how it manifests

The PR adds two lines at LoginUserOperationExecutor.kt:228-234 after a successful createUser:

if (response.rywData != null) {
    _consistencyManager.setRywData(backendOneSignalId, IamFetchRywTokenKey.USER, response.rywData)
}
_consistencyManager.resolveConditionsWithID(IamFetchReadyCondition.ID)  // unconditional

The resolveConditionsWithID call uses the static ID string shared by ALL IamFetchReadyCondition instances regardless of their constructor key. When login returns rywData != null, this produces two concurrent IAM backend fetches.

The specific code path that triggers it

Execution trace when createUser succeeds with rywData != null:

  1. Before login: onSessionStarted() calls fetchMessagesWhenConditionIsMet(). At that point _userManager.onesignalId == "" (local ID maps to empty string), so it registers IamFetchReadyCondition("") in ConsistencyManager and suspends at .await().

  2. On the OperationRepo thread after createUser:

    • identityModel.setStringProperty(backendOneSignalId) fires onModelUpdated synchronously. onModelUpdated calls suspendifyOnIO { ... }, which schedules Path B's coroutine on Dispatchers.IO but returns immediately — the coroutine has not started yet.
    • setRywData(backendOneSignalId, USER, rywData) stores the token under key backendOneSignalId. checkConditionsAndComplete evaluates existing conditions: only IamFetchReadyCondition("") exists, and isMet("") returns false (no entry for key ""). Nothing resolves.
    • resolveConditionsWithID("IamFetchReadyCondition") runs. It finds IamFetchReadyCondition("") (its .id matches the static constant), completes its deferred with null, and removes it. Path A's coroutine resumes with rywData = null.
  3. Path A calls fetchMessages(null). The rate-limiter check is rywData == null && lastTimeFetchedIAMs != null && elapsed < interval. Since lastTimeFetchedIAMs == null (first call), the check is false → rate limiter passes. Path A acquires fetchIAMMutex, sets lastTimeFetchedIAMs = now, and makes backend call gradle? #1 (no RYW token, stale).

  4. Path B starts on the IO pool: getRywDataFromAwaitableCondition(IamFetchReadyCondition(backendOneSignalId)) adds a new condition. checkConditionsAndComplete finds indexedTokens[backendOneSignalId] already populated from step 2 → immediately resolves the deferred with the actual rywData. Path B calls fetchMessages(rywData). The rate-limiter check is rywData == null && ... — since rywData != null, the entire condition short-circuits to false → rate limiter unconditionally bypassed. Path B makes backend call Added possibility to send status 'opened' for a message back to onesignal backend from client source code #2 (with RYW token, RYW-consistent).

Why existing code does not prevent it

The rate-limiter fix introduced in this PR (rywData == null && at InAppMessagesManager.kt:323) was designed to let RYW-aware fetches bypass the 30-second interval. But it simultaneously allows Path A (which has null rywData because it was resolved with null by the unconditional resolveConditionsWithID) to pass the rate limiter on the first call, since lastTimeFetchedIAMs is null then. Both conditions for blocking a fetch are absent for each path.

What the impact would be

  1. Wasted backend call on every new user login when identity verification is enabled and the server returns ryw_token (the common case).
  2. Stale IAMs briefly displayed: Path A has no ryw delay and completes first, setting this.messages to stale results and calling evaluateInAppMessages(). Path B completes after the ryw delay and overwrites with the RYW-consistent response.
  3. Concurrent unsynchronized writes to this.messages (a plain var field, not @Volatile) from two IO-dispatcher coroutines — no memory-visibility guarantee between them.
  4. trigger-changed flag loss: If Path A's earlySessionTriggers processing sets isTriggerChanged=true on stale message objects, Path B then replaces this.messages with fresh message objects that lack the flag.

How to fix it

Do not call resolveConditionsWithID when rywData is present. The condition registered by Path B will be self-completed by checkConditionsAndComplete inside the already-called setRywData. The corrected block:

if (response.rywData != null) {
    _consistencyManager.setRywData(backendOneSignalId, IamFetchRywTokenKey.USER, response.rywData)
    // Do NOT call resolveConditionsWithID here — setRywData already triggers checkConditionsAndComplete
    // which will immediately satisfy IamFetchReadyCondition(backendOneSignalId) once Path B registers it.
} else {
    _consistencyManager.resolveConditionsWithID(IamFetchReadyCondition.ID)
}

Step-by-step proof

  1. App starts; onesignalId is local → _userManager.onesignalId == "". fetchMessagesWhenConditionIsMet registers IamFetchReadyCondition(""), suspends at .await().
  2. LoginUserOperation executes; createUser returns with rywData = RywData("tok123", 500).
  3. On OperationRepo thread: identityModel.setStringProperty("backend-abc") → schedules Path B on IO, returns immediately.
  4. setRywData("backend-abc", USER, rywData)indexedTokens["backend-abc"] = rywData; checkConditionsAndComplete sees only IamFetchReadyCondition("")isMet("") = false, nothing resolves.
  5. resolveConditionsWithID("IamFetchReadyCondition") → finds IamFetchReadyCondition("") by static ID match, completes with null.
  6. Path A resumes: rywData = null. fetchMessages(null). Rate limiter: null == null && null != null → false → passes. fetchIAMMutex acquired, lastTimeFetchedIAMs = T1, HTTP GET without RYW token → stale IAMs shown.
  7. Path B starts on IO: registers IamFetchReadyCondition("backend-abc"); checkConditionsAndCompleteisMet("backend-abc") = true (stored in step 4) → deferred immediately resolves with rywData.
  8. Path B: fetchMessages(rywData). Rate limiter: rywData != null → bypassed unconditionally. fetchIAMMutex acquired (Path A already released it), HTTP GET with RYW token → RYW-consistent IAMs overwrite stale result.

Comment on lines 44 to +64
val useIV = model.useIdentityVerification

var jwtInvalidatedExternalId: String? = null
if (useIV == true) {
Logging.debug("IdentityVerificationService: IV enabled, purging anonymous operations")
_operationRepo.removeOperationsWithoutExternalId()
suspendifyOnIO {
_operationRepo.awaitInitialized()

val externalId = _identityModelStore.model.externalId
if (externalId != null && _jwtTokenStore.getJwt(externalId) == null) {
Logging.debug("IdentityVerificationService: IV enabled but no JWT for $externalId, will fire invalidated event after queue wake")
jwtInvalidatedExternalId = externalId
var jwtInvalidatedExternalId: String? = null
if (useIV == true) {
Logging.debug("IdentityVerificationService: IV enabled, purging anonymous operations")
_operationRepo.removeOperationsWithoutExternalId()

val externalId = _identityModelStore.model.externalId
if (externalId != null && _jwtTokenStore.getJwt(externalId) == null) {
Logging.debug("IdentityVerificationService: IV enabled but no JWT for $externalId, will fire invalidated event after queue wake")
jwtInvalidatedExternalId = externalId
}
}
}

_operationRepo.forceExecuteOperations()
_operationRepo.forceExecuteOperations()

jwtInvalidatedExternalId?.let { _userManager.fireJwtInvalidated(it) }
jwtInvalidatedExternalId?.let { _userManager.fireJwtInvalidated(it) }
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 In the PR's refactored IdentityVerificationService.onModelReplaced(), useIV is correctly captured before the suspendifyOnIO block but val externalId = _identityModelStore.model.externalId is read inside the coroutine after awaitInitialized() returns, creating a race window where a concurrent OneSignal.login("otherUser") replaces the identity model and causes fireJwtInvalidated to fire for the wrong user. Fix: capture externalId alongside useIV before the suspendifyOnIO block, matching how useIV is already captured from the model parameter.

Extended reasoning...

What the bug is and how it manifests

The PR wraps the entire HYDRATE handler body in suspendifyOnIO { _operationRepo.awaitInitialized() ... } to fix a purge-before-initialization race. While useIV = model.useIdentityVerification is correctly captured from the model parameter before the coroutine (line 44), the read val externalId = _identityModelStore.model.externalId (line 54) was moved inside the coroutine, after awaitInitialized() suspends. The _identityModelStore.model reference is a live reference to the current model, not the model at HYDRATE time, so it can change underneath the coroutine while it is suspended.

The specific code path that triggers it

awaitInitialized() suspends the coroutine while OperationRepo.loadSavedOperations() loads persisted operations from disk — a window that can be tens to hundreds of milliseconds on slower devices. During this window, if the developer calls OneSignal.login("userB"), the SDK replaces the IdentityModel in _identityModelStore. When awaitInitialized() resumes, _identityModelStore.model.externalId returns userB's externalId rather than the one present at HYDRATE time.

Why existing code does not prevent it

Capturing useIV from the model constructor parameter (which is a snapshot passed at HYDRATE time) is correct and safe. However, _identityModelStore.model is a live property getter that always returns the current model. There is no snapshot or copy made of the externalId before the suspension point; the read is simply deferred into the coroutine body, past the only concurrency boundary introduced by this PR.

What the impact would be

Two failure modes: (1) If userB has a valid JWT, jwtInvalidatedExternalId remains null — the original user who triggered the beta migration path never receives fireJwtInvalidated, their operations remain gated on a JWT event that will never arrive. (2) If userB has no JWT, fireJwtInvalidated("userB") fires; the developer calls updateUserJwt for userB while the original user's operations remain blocked indefinitely. This affects the beta migration path (first app open after IV is enabled) with a concurrent login.

How to fix it

Capture externalId at the same point useIV is captured, before the suspendifyOnIO block:

val useIV = model.useIdentityVerification
val externalId = _identityModelStore.model.externalId  // capture before coroutine
suspendifyOnIO {
    _operationRepo.awaitInitialized()
    if (useIV == true) {
        _operationRepo.removeOperationsWithoutExternalId()
        if (externalId \!= null && _jwtTokenStore.getJwt(externalId) == null) {
            jwtInvalidatedExternalId = externalId
        }
    }
    ...
}

Step-by-step proof

  1. App starts with IV=true; config HYDRATE arrives. onModelReplaced fires with useIV=true. The coroutine is launched and suspends at awaitInitialized().
  2. awaitInitialized() is pending (loading saved ops from disk). During this window, the developer calls OneSignal.login("userB").
  3. The SDK replaces _identityModelStore.model with userB's IdentityModel; _identityModelStore.model.externalId now returns "userB".
  4. awaitInitialized() completes. The coroutine resumes and reads val externalId = _identityModelStore.model.externalId"userB".
  5. If userB has no JWT: fireJwtInvalidated("userB") fires. Developer provides JWT for userB. The original user's ops are never unblocked.
  6. If userB has a valid JWT: jwtInvalidatedExternalId stays null. The original user who triggered the IV migration never gets the event. Operations for the original user remain blocked forever.

Comment on lines 213 to 219
<ID>LongParameterList:IOutcomeEventsBackendService.kt$IOutcomeEventsBackendService$( appId: String, userId: String, subscriptionId: String, deviceType: String, direct: Boolean?, event: OutcomeEvent, )</ID>
<ID>LongParameterList:IParamsBackendService.kt$ParamsObject$( var googleProjectNumber: String? = null, var enterprise: Boolean? = null, var useIdentityVerification: Boolean? = null, var notificationChannels: JSONArray? = null, var firebaseAnalytics: Boolean? = null, var restoreTTLFilter: Boolean? = null, var clearGroupOnSummaryClick: Boolean? = null, var receiveReceiptEnabled: Boolean? = null, var disableGMSMissingPrompt: Boolean? = null, var unsubscribeWhenNotificationsDisabled: Boolean? = null, var locationShared: Boolean? = null, var requiresUserPrivacyConsent: Boolean? = null, var opRepoExecutionInterval: Long? = null, var influenceParams: InfluenceParamsObject, var fcmParams: FCMParamsObject, )</ID>
<ID>LongParameterList:IUserBackendService.kt$IUserBackendService$( appId: String, aliasLabel: String, aliasValue: String, properties: PropertiesObject, refreshDeviceMetadata: Boolean, propertyiesDelta: PropertiesDeltasObject, jwt: String? = null, )</ID>
<ID>LongParameterList:LoginUserOperationExecutor.kt$LoginUserOperationExecutor$( private val _identityOperationExecutor: IdentityOperationExecutor, private val _application: IApplicationService, private val _deviceService: IDeviceService, private val _userBackend: IUserBackendService, private val _identityModelStore: IdentityModelStore, private val _propertiesModelStore: PropertiesModelStore, private val _subscriptionsModelStore: SubscriptionModelStore, private val _configModelStore: ConfigModelStore, private val _languageContext: ILanguageContext, private val _jwtTokenStore: JwtTokenStore, )</ID>
<ID>LongParameterList:LoginUserOperationExecutor.kt$LoginUserOperationExecutor$( private val _identityOperationExecutor: IdentityOperationExecutor, private val _application: IApplicationService, private val _deviceService: IDeviceService, private val _userBackend: IUserBackendService, private val _identityModelStore: IdentityModelStore, private val _propertiesModelStore: PropertiesModelStore, private val _subscriptionsModelStore: SubscriptionModelStore, private val _configModelStore: ConfigModelStore, private val _languageContext: ILanguageContext, private val _jwtTokenStore: JwtTokenStore, private val _consistencyManager: IConsistencyManager, )</ID>
<ID>LongParameterList:OutcomeEventsController.kt$OutcomeEventsController$( private val _session: ISessionService, private val _influenceManager: IInfluenceManager, private val _outcomeEventsCache: IOutcomeEventsRepository, private val _outcomeEventsPreferences: IOutcomeEventsPreferences, private val _outcomeEventsBackend: IOutcomeEventsBackendService, private val _configModelStore: ConfigModelStore, private val _identityModelStore: IdentityModelStore, private val _subscriptionManager: ISubscriptionManager, private val _deviceService: IDeviceService, private val _time: ITime, )</ID>
<ID>LongParameterList:RefreshUserOperationExecutor.kt$RefreshUserOperationExecutor$( private val _userBackend: IUserBackendService, private val _identityModelStore: IdentityModelStore, private val _propertiesModelStore: PropertiesModelStore, private val _subscriptionsModelStore: SubscriptionModelStore, private val _configModelStore: ConfigModelStore, private val _buildUserService: IRebuildUserService, private val _newRecordState: NewRecordsState, private val _jwtTokenStore: JwtTokenStore, )</ID>
<ID>LongParameterList:SubscriptionObject.kt$SubscriptionObject$( val id: String? = null, val type: SubscriptionObjectType? = null, val token: String? = null, val enabled: Boolean? = null, val notificationTypes: Int? = null, val sdk: String? = null, val deviceModel: String? = null, val deviceOS: String? = null, val rooted: Boolean? = null, val netType: Int? = null, val carrier: String? = null, val appVersion: String? = null, )</ID>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The PR correctly updates the LongParameterList baseline entry for LoginUserOperationExecutor to include _consistencyManager: IConsistencyManager, but omits the required ConstructorParameterNaming suppression entry for that parameter. All 10 other underscore-prefixed constructor parameters in LoginUserOperationExecutor have individual ConstructorParameterNaming entries, and both peer executors (SubscriptionOperationExecutor, UpdateUserOperationExecutor) have the analogous entry for their own _consistencyManager parameters. Add the missing baseline entry to prevent the detekt CI check from failing.

Extended reasoning...

What the bug is and how it manifests

The detekt static-analysis tool enforces a ConstructorParameterNaming rule that flags underscore-prefixed constructor parameters (e.g. private val _consistencyManager). These violations are suppressed by adding per-parameter entries to the detekt baseline XML file. This PR adds private val _consistencyManager: IConsistencyManager as a new constructor parameter to LoginUserOperationExecutor and correctly updates the LongParameterList baseline entry to reflect the new parameter, but does not add the corresponding ConstructorParameterNaming entry. When detekt runs in CI baseline-check mode, any violation not present in the baseline XML is treated as a new unbaselined finding and fails the build.

The specific code path that triggers it

When the CI pipeline runs ./gradlew detekt (or the equivalent baseline-check task) against LoginUserOperationExecutor.kt, detekt scans the constructor and finds private val _consistencyManager: IConsistencyManager. It raises a ConstructorParameterNaming violation. The baseline checker compares this against detekt-baseline-core.xml and finds no matching suppression entry, so it reports one new unbaselined violation and returns a non-zero exit code.

Why existing code does not prevent it

The LongParameterList and ConstructorParameterNaming rules are tracked independently in the XML, under separate <ID> entries. Updating the LongParameterList entry (which lists all parameters in a single string) has no effect on the ConstructorParameterNaming rule, which requires one entry per parameter. The two-rule pattern is well-established in this codebase: every other underscore-prefixed parameter in LoginUserOperationExecutor (_application, _configModelStore, _deviceService, _identityModelStore, _identityOperationExecutor, _jwtTokenStore, _languageContext, _propertiesModelStore, _subscriptionsModelStore, _userBackend) has its own ConstructorParameterNaming entry. The omission of _consistencyManager is an oversight.

What the impact would be

The detekt CI check will fail after this PR is merged, blocking the release build pipeline. There is no functional impact on SDK behavior; this is purely a build-tooling issue. The build failure is deterministic and will reproduce on every CI run until the entry is added.

How to fix it

Add the following entry to the <CurrentIssues> section of OneSignalSDK/detekt/detekt-baseline-core.xml, in alphabetical order among the other ConstructorParameterNaming:LoginUserOperationExecutor entries:

<ID>ConstructorParameterNaming:LoginUserOperationExecutor.kt$LoginUserOperationExecutor$private val _consistencyManager: IConsistencyManager</ID>

Step-by-step proof

  1. PR adds private val _consistencyManager: IConsistencyManager to LoginUserOperationExecutor's constructor.
  2. The LongParameterList baseline entry is updated to include _consistencyManager — this is correct.
  3. Detekt runs ConstructorParameterNaming on the constructor; all 10 existing underscore-prefixed parameters are suppressed via baseline entries.
  4. Detekt finds _consistencyManager with no matching ConstructorParameterNaming baseline entry.
  5. Detekt reports one new unbaselined violation; CI baseline-check mode fails the build.
  6. For comparison, SubscriptionOperationExecutor and UpdateUserOperationExecutor both already have ConstructorParameterNaming entries for their _consistencyManager parameters, confirming the expected pattern.

@nan-li nan-li force-pushed the feat/identity_verification_5.8 branch from fbc53f2 to fb2353f Compare April 15, 2026 03:04
nan-li added 4 commits April 14, 2026 20:10
IdentityVerificationService.onModelReplaced (config HYDRATE) could
fire before OperationRepo.loadSavedOperations() finished, causing
removeOperationsWithoutExternalId() to run against an empty queue.

Wrap the HYDRATE handler in suspendifyOnIO + awaitInitialized() so
the purge waits for the queue to be fully populated, following the
same pattern as RecoverFromDroppedLoginBug.
fireJwtInvalidated now buffers the externalId under a synchronized
lock when no listeners are subscribed (e.g. during SDK init HYDRATE)
and replays it when the first IUserJwtInvalidatedListener is added.
Clears pending event on user switch (onModelReplaced) to prevent
stale replay.
LoginUserOperationExecutor.createUser() was not passing the RYW token
from the backend response to the ConsistencyManager, causing the
IamFetchRywTokenKey condition to never resolve after login. This meant
InAppMessagesManager coroutines awaited forever and IAMs were never
fetched for the logged-in user.

- Add rywData field to CreateUserResponse
- Parse ryw_token/ryw_delay in JSONConverter.convertToCreateUserResponse
- Set RYW data in ConsistencyManager after successful createUser
- Fix resolveConditionsWithID to only remove matching conditions
- Wrap resolveConditionsWithID in mutex for thread safety
When an IAM fetch returns an unauthorized response (401 or 403), the
SDK now saves the pending fetch state and automatically retries once
the JWT is refreshed for the same user via updateUserJwt. Switching
users clears any stale retry.

- Add IJwtUpdateListener to JwtTokenStore for post-putJwt notification
- InAppMessagesManager subscribes and retries on JWT update
- Reset rate-limiter on 401 so retry is not throttled
- Use @volatile on cross-thread fields (lastTimeFetchedIAMs,
  pendingJwtRetryExternalId, pendingJwtRetryRywData)
- RYW-aware fetches bypass rate limiter to avoid stale data
- Handle 401 in fetchInAppMessagesWithoutRywToken fallback path
@nan-li nan-li force-pushed the fix/jwt_misc_race_conditions branch from 5d860c9 to 2784adf Compare April 15, 2026 03:17
@nan-li
Copy link
Copy Markdown
Contributor Author

nan-li commented Apr 15, 2026

Closing in favor of a clean PR — the review history here is noisy from iterative bot feedback cycles. Reopening with the same changes squashed into 4 logical commits.

Comment on lines 326 to 332

fetchIAMMutex.withLock {
val now = _time.currentTimeMillis
if (lastTimeFetchedIAMs != null && (now - lastTimeFetchedIAMs!!) < _configModelStore.model.fetchIAMMinInterval) {
if (rywData == null && lastTimeFetchedIAMs != null && (now - lastTimeFetchedIAMs!!) < _configModelStore.model.fetchIAMMinInterval) {
return
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Inside fetchIAMMutex.withLock, the @volatile field lastTimeFetchedIAMs is read twice in one &&-expression: once for the null check (lastTimeFetchedIAMs != null) and once for the arithmetic (now - lastTimeFetchedIAMs!!). The 401 catch block at line 352 writes lastTimeFetchedIAMs = null outside the mutex, creating a TOCTOU where an OS-level preemption between the two JVM GETFIELD instructions can turn a passing null-check into a NullPointerException. Fix: read into a local val before the condition: val lastFetched = lastTimeFetchedIAMs; if (rywData == null && lastFetched != null && (now - lastFetched) < threshold).

Extended reasoning...

What the bug is and how it manifests

The PR adds @volatile to lastTimeFetchedIAMs (line 121) and adds a write of lastTimeFetchedIAMs = null in the 401 catch block (line 352) outside fetchIAMMutex. Inside the mutex withLock body (lines 326-332), the expression reads the same @volatile field twice: first for lastTimeFetchedIAMs != null and then for lastTimeFetchedIAMs!!. These are separate JVM GETFIELD bytecode operations. Because @volatile guarantees each read goes directly to main memory, a write on another thread between the two reads is immediately visible, making the null check and the dereference non-atomic.

The specific code path that triggers it

On Dispatchers.IO (a thread pool), multiple coroutines can execute concurrently on separate OS threads. Coroutine B sets lastTimeFetchedIAMs = now_B inside the mutex, then exits the lock and begins its HTTP call. The 401 response eventually arrives and Coroutine B's catch block (which runs outside the mutex) writes lastTimeFetchedIAMs = null. If Thread 2 is running Coroutine A's withLock body and the OS preempts Thread 2 between the two GETFIELD instructions—right after Coroutine A reads non-null for the null check, but before it reads the field again for !!—Thread 1 can write null. Thread 2 then resumes, executes lastTimeFetchedIAMs!! on a null Long?, and crashes with NullPointerException.

Addressing the refutation

The refutation argues the race requires an HTTP round-trip (~100ms) to complete within the nanosecond window between two bytecodes, which is 'essentially impossible.' This argument conflates two separate events. The HTTP call does not need to complete during the nanosecond window—it only needs to have completed before Coroutine A enters withLock. The race scenario is: Coroutine B already received its 401 response and its catch-block coroutine is scheduled on the IO thread pool but has not yet been dispatched; simultaneously Coroutine A acquires the mutex and begins the non-suspending withLock body. The OS can preempt Thread 2 between any two machine instructions; no 100ms event needs to happen within nanoseconds. On Android (ARM multi-core), OS thread preemption between two ldar instructions (volatile reads) is a legitimate JVM memory model hazard.

That said, the practical probability is low. The window is narrow, and in common test scenarios the race does not manifest. However: (1) @volatile was added in this very PR—without fixing the double-read pattern it creates a false sense of correctness; (2) the fix is a single-line change with zero downside; (3) the two GETFIELDs on a nullable boxed Long? are distinct memory operations that the JVM compiler cannot merge into one, making the race structurally real regardless of probability.

Why existing synchronization does not prevent it

fetchIAMMutex.withLock blocks other coroutines from entering the locked block, but does not prevent OS threads from executing code outside the mutex concurrently. The 401 catch block is intentionally outside the mutex. @volatile ensures the null write is immediately visible to all threads—which is precisely what makes the race dangerous rather than masked by caching.

What the impact would be

A crash (NullPointerException) in fetchMessages on a production device. The stack trace would point into the rate-limiter check inside fetchIAMMutex.withLock, which may be difficult to diagnose without knowing the race.

How to fix it

Read the volatile field exactly once into a local val before the compound boolean expression:

val lastFetched = lastTimeFetchedIAMs
if (rywData == null && lastFetched != null && (now - lastFetched) < _configModelStore.model.fetchIAMMinInterval) {
    return
}

A local val is never subject to concurrent writes, so the null check and arithmetic operate on the same captured value. This is the standard JMM safe-access pattern for nullable volatile fields.

Comment on lines +1048 to +1061
override fun onJwtUpdated(externalId: String) {
val retryExternalId = pendingJwtRetryExternalId ?: return
if (externalId != retryExternalId) return

val retryRywData = pendingJwtRetryRywData

Logging.debug("InAppMessagesManager.onJwtUpdated: JWT refreshed for $externalId, retrying IAM fetch")
pendingJwtRetryExternalId = null
pendingJwtRetryRywData = null

suspendifyOnIO {
fetchMessages(retryRywData)
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 In onJwtUpdated(), the pending retry state is cleared (pendingJwtRetryExternalId = null, pendingJwtRetryRywData = null) before launching suspendifyOnIO { fetchMessages(retryRywData) }. If the app is backgrounded when onJwtUpdated fires—a realistic scenario when the developer calls updateUserJwt() in response to a push notification—fetchMessages() hits the early-return guard and exits without fetching. Since the pending state was already cleared and onFocus() is a no-op, there is no recovery path for the rest of the session. Fix: defer clearing the pending state until after the foreground check passes inside the coroutine, or re-arm the retry in onFocus().

Extended reasoning...

What the bug is and how it manifests

In InAppMessagesManager.onJwtUpdated() (lines 1055-1061), the method clears pendingJwtRetryExternalId = null and pendingJwtRetryRywData = null synchronously on the calling thread, then launches suspendifyOnIO { fetchMessages(retryRywData) }. Inside fetchMessages(), the first guard is if (!_applicationService.isInForeground) { return } (line ~304). If the app is backgrounded at the moment onJwtUpdated fires, the pending state has already been wiped but the actual fetch exits immediately—leaving the IAM retry permanently lost.

The specific code path that triggers it

  1. fetchMessages() is called while app is in foreground. A 401 response arrives and the catch block stores pendingJwtRetryExternalId = externalId and pendingJwtRetryRywData = rywData.
  2. The app moves to the background (user switches apps, or a push notification triggers background processing).
  3. The developer calls OneSignal.updateUserJwt(externalId, newJwt) from a background service or receiver—the expected response to onUserJwtInvalidated.
  4. JwtTokenStore.putJwt fires jwtUpdateNotifier.fire { it.onJwtUpdated(externalId) }.
  5. InAppMessagesManager.onJwtUpdated executes: reads retryExternalId (non-null, matches), then unconditionally sets pendingJwtRetryExternalId = null and pendingJwtRetryRywData = null, then calls suspendifyOnIO { fetchMessages(retryRywData) }.
  6. The scheduled coroutine runs on Dispatchers.IO; fetchMessages evaluates !_applicationService.isInForeground → true → return.
  7. The retry is permanently dropped.

Why existing code does not prevent it

onFocus(firedOnSubscribe: Boolean) at line 1063 is an explicit no-op override. It performs no fetch and has no retry logic. There is no other foreground-reentry hook in InAppMessagesManager that checks the pending retry fields. The only other recovery path is onSessionStarted(), which requires the app to go to background long enough to trigger a new session (typically 30 seconds), which may be a significant delay. The @volatile annotations on the pending fields ensure memory visibility but provide no protection against the ordering problem: the clear happens before the coroutine is dispatched, and the coroutine sees isInForeground = false with no way to know there was a pending retry.

What the impact would be

Users in apps with identity verification enabled who experience a JWT expiry while backgrounded—and whose developers correctly call updateUserJwt from a background context (push notification handler, WorkManager job, etc.)—will silently receive no in-app messages for the rest of the session. This is a realistic, common Android scenario where short-lived JWTs and background processing are normal. No error is surfaced to the developer or end user.

How to fix it

Option A (minimal): Only clear the pending state after the foreground check succeeds inside the coroutine:

suspendifyOnIO {
    if (_applicationService.isInForeground) {
        pendingJwtRetryExternalId = null
        pendingJwtRetryRywData = null
        fetchMessages(retryRywData)
    }
    // else: pending state remains set; will retry when app resumes
}

Option B: Override onFocus() to check the pending fields and call suspendifyOnIO { fetchMessages(retryRywData) } if they are set, re-arming the retry on foreground.

Step-by-step proof

  1. App in foreground: onSessionStarted() → fetchMessages() → HTTP 401 → catch block sets pendingJwtRetryExternalId = "alice", pendingJwtRetryRywData = rywData.
  2. User presses home; _applicationService.isInForeground becomes false.
  3. Push notification arrives; developer receiver calls OneSignal.updateUserJwt("alice", newJwt).
  4. JwtTokenStore.putJwt → onJwtUpdated("alice") runs on the binder thread.
  5. onJwtUpdated: reads retryExternalId = "alice" (passes guard), then sets pendingJwtRetryExternalId = null and pendingJwtRetryRywData = null, then launches coroutine.
  6. Coroutine on IO pool: fetchMessages(rywData) → !isInForeground is true → return.
  7. User returns to the app within session timeout. onFocus() fires → no-op.
  8. pendingJwtRetryExternalId is null; no retry is ever triggered. IAMs are never shown for this session.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant