fix: login race condition and JWT FAIL_UNAUTHORIZED blocking callers#2600
Conversation
There was a problem hiding this comment.
This PR makes / blocking (via ) and changes failure semantics in the operation repo — these are non-trivial concurrency changes that warrant a human look, particularly around main-thread ANR exposure for the non-suspend API.
Extended reasoning...
Overview\nThe PR touches two files: (login/logout threading model) and (waiter wake-up on FAIL_UNAUTHORIZED and FAIL_PAUSE_OPREPO). Both changes are small in line count but meaningful in behavioral impact.\n\n### Security risks\nNo security-sensitive paths (auth, crypto, permissions) are directly altered. The JWT invalidation path in OperationRepo is adjusted, but the fix only changes when waiters are notified — the actual JWT invalidation logic is untouched.\n\n### Level of scrutiny\nHigh. The / change from fire-and-forget / to converts these methods from non-blocking to fully blocking. If a developer calls on the Android main thread (which is possible with the deprecated API), will block that thread for the full duration of , which involves operation repo enqueueing and potentially network waits — a classic ANR vector. The OperationRepo changes alter the contract of : callers now receive immediately on FAIL_UNAUTHORIZED/FAIL_PAUSE_OPREPO rather than waiting for the eventual retry result. This is intentional per the PR description, but changes observable behavior for any caller that inspects the return value.\n\n### Other factors\nNo automated bug reports were filed. The PR description is clear and the intent is sound (fix a real hang/race condition), but the threading changes are in production-critical paths where subtle mistakes can cause ANRs or silent operation drops. Human review is appropriate here.
4b98693 to
ac70e95
Compare
ac70e95 to
f816aa7
Compare
nan-li
left a comment
There was a problem hiding this comment.
small nit
tested login flows with JWT and without, and login flows where JWT=OFF can fail with
OneSignal.loginSuspend("alice")
OneSignal.loginSuspend("1") // will fail
OneSignal.loginSuspend("chris")
| * Full login: switches user models then waits for the backend operation. | ||
| * Used by [loginSuspend] where the caller can afford to suspend. | ||
| */ | ||
| suspend fun login( |
There was a problem hiding this comment.
nit: do we need this method? Since in OneSignalImp.login(), we call switchUser() and enqueueLogin() manually, could loginsuspend() also call these 2 manually - as there's added confusion that LoginHelper.login()is only called byloginSuspend()but notlogin()`
1. Split LoginHelper into switchUser() + enqueueLogin() so the non-suspend login() can synchronously swap user models (instant, no network) while the backend create-user call fires in the background. This prevents ANRs on main thread while ensuring tags target the correct user after each login() call. loginSuspend() also calls switchUser() + enqueueLogin() directly (no wrapper method) for clarity. 2. Wake enqueueAndWait waiters on FAIL_UNAUTHORIZED and FAIL_PAUSE_OPREPO before re-queuing operations. Re-queued items use waiter=null so future retries don't reference stale waiters. This prevents loginSuspend from hanging forever when a JWT is invalid or the op repo pauses. Fixes SDK-4338 Made-with: Cursor
f816aa7 to
f47c974
Compare
|
Broken tests are unrelated. will be fixed in the next PR #2605 |
a292935
into
feat/identity_verification_5.8
Summary
login()race condition: SplitLoginHelperintoswitchUser()(synchronous local model swap) +enqueueLogin()(async network call). The non-suspendlogin()now callsswitchUser()synchronously on the calling thread, then fires offenqueueLogin()in the background viasuspendifyOnIO(FF ON) orThread { runBlocking }(FF OFF). This ensures tags target the correct user without blocking the main thread or causing ANRs.loginSuspend()still uses the fulllogin()which awaits the backend operation.FAIL_UNAUTHORIZED/FAIL_PAUSE_OPREPOhanging callers: WakeenqueueAndWaitwaiters withfalsebefore re-queuing operations. Re-queued items usewaiter = nullso future retries (afterupdateUserJwt) don't reference stale waiters. This preventsloginSuspendfrom hanging forever when a JWT is invalid or the op repo pauses.Files Changed
LoginHelper.kt— Split intoswitchUser()(instant, synchronous model swap under lock) +enqueueLogin()(suspend, network call) +login()(both, forloginSuspend)OneSignalImp.kt—login()callsswitchUser()synchronously then dispatchesenqueueLogin()respecting the FF flag;logout()unchangedOperationRepo.kt—FAIL_UNAUTHORIZEDandFAIL_PAUSE_OPREPObranches wake waiters before re-queuing withwaiter = nullTest Results
All tests run on emulator (Medium Phone AVD - API 16) with fresh app data cleared between each run.
Test code
Non-suspend (called from
onCreateon main thread):Suspend (called from
CoroutineScope(Dispatchers.IO)):Full test matrix
login()alice1→alice, bob2→bob, chris3→chrislogin()loginSuspend()loginSuspend()login()alice1→alice, bob2→bob, chris3→chrislogin()loginSuspend()loginSuspend()Example logcat output (non-suspend, FF OFF)
Example logcat output (suspend, FF ON)
Notes
switchUser()is synchronous but instant (local model swap only), network calls are dispatched to background.FAIL_UNAUTHORIZED/ waiter-wake path was not exercised because IV was not enforced by the backend on the test app. To fully validate the 401 waiter fix, run against an app with identity verification enabled on the dashboard.Test plan
login()+addTag()in immediate succession — each tag lands on correct user (FF OFF)login()+addTag()in immediate succession — each tag lands on correct user (FF ON)login()with VALID/INVALID/VALID JWT — tags correct, no blocking (FF OFF)login()with VALID/INVALID/VALID JWT — tags correct, no blocking (FF ON)loginSuspend()+addTag()sequential — each call waits for backend, tags correct (FF OFF)loginSuspend()+addTag()sequential — each call waits for backend, tags correct (FF ON)loginSuspend()with VALID/INVALID/VALID JWT — bob does not block chris (FF OFF)loginSuspend()with VALID/INVALID/VALID JWT — bob does not block chris (FF ON)login()from main threadFAIL_UNAUTHORIZEDwaiter-wake path with IV enabled on backend (401 triggers wake + re-queue)updateUserJwt()causes re-queued operations to execute after JWT refreshFixes SDK-4338