Race condition in push monitors #937

Closed
opened 2026-02-28 02:04:29 -05:00 by deekerman · 3 comments
Owner

Originally created by @kaysond on GitHub (Mar 27, 2022).

⚠️ Please verify that this bug has NOT been raised before.

  • I checked and didn't find similar issue

🛡️ Security Policy

Description

There is a race condition in the code for push monitors that can cause repeat notifications within the same second or shortly thereafter. (see some background in #922)
image

http logs show that the GET calls are generally 60s apart, but because of discrepancies in system time down to the ms level, it can appear that they happen a second apart, even if they're very close together (ie.e 17:58:05.999 to 17:59:06.000)

[25/Mar/2022:17:54:05 -0700]  "GET /api/push/xxxx?msg=OK&ping=92.177 HTTP/1.1" "-" "curl/7.74.0" TLSv1.3 TLS_AES_256_GCM_SHA384 200 5567
[25/Mar/2022:17:55:05 -0700] "GET /api/push/xxxx?msg=OK&ping=92.196 HTTP/1.1" "-" "curl/7.74.0" TLSv1.3 TLS_AES_256_GCM_SHA384 200 5567
-----------------------------
[25/Mar/2022:17:58:05 -0700] "GET /api/push/xxx?msg=OK&ping=90.552 HTTP/1.1" "-" "curl/7.74.0" TLSv1.3 TLS_AES_256_GCM_SHA384 200 5567
[25/Mar/2022:17:59:06 -0700]  "GET /api/push/xxx?msg=OK&ping=91.308 HTTP/1.1" "-" "curl/7.74.0" TLSv1.3 TLS_AES_256_GCM_SHA384 200 5567

👟 Reproduction steps

Repro is tricky, but it could be possible by monitoring uptime kuma debug logs and synchronizing to them. Basically, the push URL calls have to happen just shortly after the monitor's calls to beat() or safe_beat()

👀 Expected behavior

Uptime kuma should be robust against very small clock discrepancies

😓 Actual Behavior

Get double notifications

🐻 Uptime-Kuma Version

1.12.1

💻 Operating System and Arch

Docker on Ubuntu 20.04

🌐 Browser

Firefox

🐋 Docker Version

Latest

🟩 NodeJS Version

No response

📝 Relevant log output

No response

Originally created by @kaysond on GitHub (Mar 27, 2022). ### ⚠️ Please verify that this bug has NOT been raised before. - [X] I checked and didn't find similar issue ### 🛡️ Security Policy - [X] I agree to have read this project [Security Policy](https://github.com/louislam/uptime-kuma/security/policy) ### Description There is a race condition in the code for push monitors that can cause repeat notifications within the same second or shortly thereafter. (see some background in #922) ![image](https://user-images.githubusercontent.com/1147328/160219491-5d077488-49b8-4a77-99b4-7397873384f3.png) http logs show that the GET calls are generally 60s apart, but because of discrepancies in system time down to the ms level, it can appear that they happen a second apart, even if they're very close together (ie.e 17:58:05.999 to 17:59:06.000) ``` [25/Mar/2022:17:54:05 -0700] "GET /api/push/xxxx?msg=OK&ping=92.177 HTTP/1.1" "-" "curl/7.74.0" TLSv1.3 TLS_AES_256_GCM_SHA384 200 5567 [25/Mar/2022:17:55:05 -0700] "GET /api/push/xxxx?msg=OK&ping=92.196 HTTP/1.1" "-" "curl/7.74.0" TLSv1.3 TLS_AES_256_GCM_SHA384 200 5567 ----------------------------- [25/Mar/2022:17:58:05 -0700] "GET /api/push/xxx?msg=OK&ping=90.552 HTTP/1.1" "-" "curl/7.74.0" TLSv1.3 TLS_AES_256_GCM_SHA384 200 5567 [25/Mar/2022:17:59:06 -0700] "GET /api/push/xxx?msg=OK&ping=91.308 HTTP/1.1" "-" "curl/7.74.0" TLSv1.3 TLS_AES_256_GCM_SHA384 200 5567 ``` ### 👟 Reproduction steps Repro is tricky, but it could be possible by monitoring uptime kuma debug logs and synchronizing to them. Basically, the push URL calls have to happen just shortly after the monitor's calls to `beat()` or `safe_beat()` ### 👀 Expected behavior Uptime kuma should be robust against very small clock discrepancies ### 😓 Actual Behavior Get double notifications ### 🐻 Uptime-Kuma Version 1.12.1 ### 💻 Operating System and Arch Docker on Ubuntu 20.04 ### 🌐 Browser Firefox ### 🐋 Docker Version Latest ### 🟩 NodeJS Version _No response_ ### 📝 Relevant log output _No response_
deekerman 2026-02-28 02:04:29 -05:00
  • closed this issue
  • added the
    bug
    label
Author
Owner

@kaysond commented on GitHub (Mar 27, 2022):

The race condition is caused because the monitor is running on a setTimeout that is totally independent of the call of the push URL. That means that beat() and router.get("/api/push/:pushToken") can both get called at the same time in seconds (e.g. both at 17:15:02), but if beat() happens first, it will mark the monitor as down, then it will immediately mark it as up when router.get() is run.

This is partly an issue because setTimeout is not a guaranteed execution time. Node will run the function at least as late as the timeout, but it could be later. This means that the time at which beat()/safeBeat() are run will drift! Suppose we start with beat() being called at 16:15:30, 16:16:30, and so on, and router.get() being called at 16:15:50, 16:16:50, and so on. This won't cause any issues because every call to beat() will see the previous heartbeat.

Because of the behavior of setTimeout, beat() will start being called later and later (or if you just have bad luck the calls could get synchronized somehow on startup). At some point, beat() will also start getting called at xx:xx:50, causing the problem.

github.com/louislam/uptime-kuma@12237dec6e/server/model/monitor.js (L309-L329)

github.com/louislam/uptime-kuma@12237dec6e/server/model/monitor.js (L468-L475)

github.com/louislam/uptime-kuma@12237dec6e/server/routers/api-router.js (L20-L84)

I would recommend following the approach of healthchecks. Healthchecks waits for the first "ping" (aka push URL call), then 60s after that ping, checks again to make sure another ping has happened. There is also a grace period to accommodate clock differences. For example, if you have a 5s grace time, the monitor won't be marked as down unless 65s has elapsed since the last ping.

I think the right thing to do, though it may be a little difficult with the way you have the architecture set up, is to do the first setTimeout on monitor.start(), but every other setTimeout should be run from router.get("/api/push:pushToken"). The timeout would be set to interval + grace period, like setTimeout(safebeat, (interval + gracePeriod)*1000).

@kaysond commented on GitHub (Mar 27, 2022): The race condition is caused because the monitor is running on a `setTimeout` that is totally independent of the call of the push URL. That means that `beat()` and `router.get("/api/push/:pushToken")` can both get called at the same time in seconds (e.g. both at 17:15:02), but if `beat()` happens first, it will mark the monitor as down, then it will immediately mark it as up when `router.get()` is run. This is partly an issue because `setTimeout` is not a guaranteed execution time. Node will run the function *at least* as late as the timeout, but it could be later. This means that the time at which `beat()`/`safeBeat()` are run will drift! Suppose we start with `beat()` being called at 16:15:30, 16:16:30, and so on, and `router.get()` being called at 16:15:50, 16:16:50, and so on. This won't cause any issues because every call to `beat()` will see the previous `heartbeat`. Because of the behavior of `setTimeout`, `beat()` will start being called later and later (or if you just have bad luck the calls could get synchronized somehow on startup). At some point, `beat()` will also start getting called at xx:xx:50, causing the problem. https://github.com/louislam/uptime-kuma/blob/12237dec6e3809404a1c84df394b232431b0f5f0/server/model/monitor.js#L309-L329 https://github.com/louislam/uptime-kuma/blob/12237dec6e3809404a1c84df394b232431b0f5f0/server/model/monitor.js#L468-L475 https://github.com/louislam/uptime-kuma/blob/12237dec6e3809404a1c84df394b232431b0f5f0/server/routers/api-router.js#L20-L84 I would recommend following the approach of [healthchecks](https://github.com/healthchecks/healthchecks). Healthchecks waits for the first "ping" (aka push URL call), then 60s after that ping, checks again to make sure another ping has happened. There is also a grace period to accommodate clock differences. For example, if you have a 5s grace time, the monitor won't be marked as down unless 65s has elapsed since the last ping. I think the right thing to do, though it may be a little difficult with the way you have the architecture set up, is to do the first `setTimeout` on `monitor.start()`, but every other `setTimeout` should be run from `router.get("/api/push:pushToken")`. The timeout would be set to interval + grace period, like `setTimeout(safebeat, (interval + gracePeriod)*1000)`.
Author
Owner

@kaysond commented on GitHub (Apr 11, 2022):

technically this is still not finished yet until #1428 is finished and gets merged!

@kaysond commented on GitHub (Apr 11, 2022): technically this is still not finished yet until #1428 is finished and gets merged!
Author
Owner

@louislam commented on GitHub (Apr 11, 2022):

technically this is still not finished yet until #1428 is finished and gets merged!

It was closed automatically. Sounds horrible because it could be cross repo.

@louislam commented on GitHub (Apr 11, 2022): > technically this is still not finished yet until #1428 is finished and gets merged! It was closed automatically. Sounds horrible because it could be cross repo.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/uptime-kuma#937
No description provided.