starred/uptime-kuma

Fork 0

mirror of https://github.com/louislam/uptime-kuma.git synced 2026-03-02 22:57:00 -05:00

Services falsely reported as offline during a system overload #425

New issue

Open

opened 2026-02-28 01:46:06 -05:00 by deekerman · 13 comments

deekerman commented

2026-02-28 01:46:06 -05:00

Owner

Originally created by @MAXOUXAX on GitHub (Oct 16, 2021).

Description of the bug
When my server is overloaded, Uptime Kuma can't communicate with my services, so it considers them offline.
My services are not hosted on the same server, so they work fine, but my status page shows a reduced uptime.

(I want to specify that I voluntarily overloaded my server in order to fine-tune my Anti-DDoS protection)

To Reproduce
Steps to reproduce the behavior:

Overload the system and/or network that hosts your status page.
Wait a few minutes
Notice that your services are considered offline and have lost uptime.

Expected behavior
The uptime shouldn't be affected at all.

Info
Uptime Kuma Version: 1.8.0
Using Docker?: Yes
Docker Version: 20.10.8
OS: Debian 10
Browser: Brave V1.30.89

Possible fix
When the service has been queried, and an error has been retrieved, execute an action that is supposed to run quickly and check its execution time. If this execution time is greater than a certain limit, ignore the error.

Originally created by @MAXOUXAX on GitHub (Oct 16, 2021). **Description of the bug** When my server is overloaded, Uptime Kuma can't communicate with my services, so it considers them offline. My services are not hosted on the same server, so they work fine, but my status page shows a reduced uptime. (I want to specify that I voluntarily overloaded my server in order to fine-tune my Anti-DDoS protection) **To Reproduce** Steps to reproduce the behavior: 1. Overload the system and/or network that hosts your status page. 2. Wait a few minutes 3. Notice that your services are considered offline and have lost uptime. **Expected behavior** The uptime shouldn't be affected at all. **Info** Uptime Kuma Version: 1.8.0 Using Docker?: Yes Docker Version: 20.10.8 OS: Debian 10 Browser: Brave V1.30.89 **Possible fix** When the service has been queried, and an error has been retrieved, execute an action that is supposed to run quickly and check its execution time. If this execution time is greater than a certain limit, ignore the error.

deekerman added the

discussion

A:core

labels

2026-02-28 01:46:06 -05:00

deekerman commented

2026-02-28 01:46:10 -05:00

Author

Owner

@gaby commented on GitHub (Oct 16, 2021):

So you are DDoS the uptime-kuma server, and want the server to keep up?

How is this related to uptime-kuma?

@gaby commented on GitHub (Oct 16, 2021): So you are DDoS the uptime-kuma server, and want the server to keep up? How is this related to uptime-kuma?

deekerman commented

2026-02-28 01:46:10 -05:00

Author

Owner

@louislam commented on GitHub (Oct 16, 2021):

I think a good network connection is a hidden requirement here.

@louislam commented on GitHub (Oct 16, 2021): I think a good network connection is a hidden requirement here.

deekerman commented

2026-02-28 01:46:10 -05:00

Author

Owner

@PopcornPanda commented on GitHub (Oct 16, 2021):

I think that cross-check could be handy for such case. Sometimes a host with uptime-kuma could have problems, not a monitored service. There is already a feature request for such solution: #84
Cross-checking is quite handy and would be a nice addition to kuma. Tag service as unavailable only if 2 of 3 (it's just an example, but it has to be quorum) detect a problem with the service.

@PopcornPanda commented on GitHub (Oct 16, 2021): I think that cross-check could be handy for such case. Sometimes a host with uptime-kuma could have problems, not a monitored service. There is already a feature request for such solution: #84 Cross-checking is quite handy and would be a nice addition to kuma. Tag service as unavailable only if 2 of 3 (it's just an example, but it has to be quorum) detect a problem with the service.

deekerman commented

2026-02-28 01:46:10 -05:00

Author

Owner

@MAXOUXAX commented on GitHub (Oct 16, 2021):

So you are DDoS the uptime-kuma server, and want the server to keep up?

How is this related to uptime-kuma?

Well, essentially, there's always a way to take a website down, and I don't want attackers DDoS'ing my status page AND causing my services to report offline. Even though my status page would be down during the attack, I don't want my services to be shown as degraded and my uptime as really low after the attack, because, well, my services were just fine.
That's an edge case, but still.

I think a good network connection is a hidden requirement here.

Good network connection doesn't mean invulnerable ^^

@MAXOUXAX commented on GitHub (Oct 16, 2021): > So you are DDoS the uptime-kuma server, and want the server to keep up? > > How is this related to uptime-kuma? Well, essentially, there's always a way to take a website down, and I don't want attackers DDoS'ing my status page AND causing my services to report offline. Even though my status page would be down during the attack, I don't want my services to be shown as degraded and my uptime as really low after the attack, because, well, my services were just fine. That's an edge case, but still. > I think a good network connection is a hidden requirement here. Good network connection doesn't mean invulnerable ^^

deekerman commented

2026-02-28 01:46:10 -05:00

Author

Owner

@gaby commented on GitHub (Oct 16, 2021):

So you are DDoS the uptime-kuma server, and want the server to keep up?
How is this related to uptime-kuma?

Well, essentially, there's always a way to take a website down, and I don't want attackers DDoS'ing my status page AND causing my services to report offline. Even though my status page would be down during the attack, I don't want my services to be shown as degraded and my uptime as really low after the attack, because, well, my services were just fine. That's an edge case, but still.

I think a good network connection is a hidden requirement here.

Good network connection doesn't mean invulnerable ^^

Yes, but it has nothing to do with uptime-kuma. These are networking/firewall concerns. You can use ufw, fail2ban, cloudflare, and a properly configured NGINX to mitigate ddos.

@gaby commented on GitHub (Oct 16, 2021): > > So you are DDoS the uptime-kuma server, and want the server to keep up? > > How is this related to uptime-kuma? > > Well, essentially, there's always a way to take a website down, and I don't want attackers DDoS'ing my status page AND causing my services to report offline. Even though my status page would be down during the attack, I don't want my services to be shown as degraded and my uptime as really low after the attack, because, well, my services were just fine. That's an edge case, but still. > > > I think a good network connection is a hidden requirement here. > > Good network connection doesn't mean invulnerable ^^ Yes, but it has nothing to do with uptime-kuma. These are networking/firewall concerns. You can use `ufw`, `fail2ban`, cloudflare, and a properly configured NGINX to mitigate ddos.

deekerman commented

2026-02-28 01:46:10 -05:00

Author

Owner

@MAXOUXAX commented on GitHub (Oct 16, 2021):

So you are DDoS the uptime-kuma server, and want the server to keep up?
How is this related to uptime-kuma?

Well, essentially, there's always a way to take a website down, and I don't want attackers DDoS'ing my status page AND causing my services to report offline. Even though my status page would be down during the attack, I don't want my services to be shown as degraded and my uptime as really low after the attack, because, well, my services were just fine. That's an edge case, but still.

I think a good network connection is a hidden requirement here.

Good network connection doesn't mean invulnerable ^^

Yes, but it has nothing to do with uptime-kuma. These are networking/firewall concerns. You can use ufw, fail2ban, cloudflare, and a properly configured NGINX to mitigate ddos.

Yes it does? Having a good firewall is one thing. Being invulnerable is another.
I have protections such as Cloudflare and fail2ban, as I said, I was fine-tuning my protections when I noticed the issue, but it'll never make me invulnerable to other type of attacks I did not think of, botnets, and potential other issues.

@MAXOUXAX commented on GitHub (Oct 16, 2021): > > > So you are DDoS the uptime-kuma server, and want the server to keep up? > > > How is this related to uptime-kuma? > > > > > > Well, essentially, there's always a way to take a website down, and I don't want attackers DDoS'ing my status page AND causing my services to report offline. Even though my status page would be down during the attack, I don't want my services to be shown as degraded and my uptime as really low after the attack, because, well, my services were just fine. That's an edge case, but still. > > > I think a good network connection is a hidden requirement here. > > > > > > Good network connection doesn't mean invulnerable ^^ > > Yes, but it has nothing to do with uptime-kuma. These are networking/firewall concerns. You can use `ufw`, `fail2ban`, cloudflare, and a properly configured NGINX to mitigate ddos. Yes it does? Having a good firewall is one thing. Being invulnerable is another. I have protections such as Cloudflare and fail2ban, as I said, I was fine-tuning my protections when I noticed the issue, but it'll never make me invulnerable to other type of attacks I did not think of, botnets, and potential other issues.

deekerman commented

2026-02-28 01:46:10 -05:00

Author

Owner

@deefdragon commented on GitHub (Oct 16, 2021):

I think that this is at-least partially a Kuma problem. Fundamentally, the service is up, but Kuma is failing to detect it as so.

That doesn't mean that it is an easy problem to solve, or one that should be tackled right now however. I believe that @NixNotCastey is on to a potential solution, as separation of the reporters/collectors and the display would mean that the collectors would be unaffected by a DOS. Something to explore in the future with 84.

@deefdragon commented on GitHub (Oct 16, 2021): I think that this is at-least partially a Kuma problem. Fundamentally, the service is up, but Kuma is failing to detect it as so. That doesn't mean that it is an easy problem to solve, or one that should be tackled right now however. I believe that @NixNotCastey is on to a potential solution, as separation of the reporters/collectors and the display would mean that the collectors would be unaffected by a DOS. Something to explore in the future with 84.

deekerman commented

2026-02-28 01:46:10 -05:00

Author

Owner

@markdesilva commented on GitHub (Oct 17, 2021):

@MAXOUXAX ah so what you want is like what GSA has, an "override" feature so you can tell UK that "hey, this isn't a server outage, its actually UK that was having connection issues so please put my % back to 100%".

So like when they click the "DOWN" pill in the dashboard, a pop up shows up with an on/off button for "override" and a text box so you can fill in the reason for the override and when you submit, the reason for the override replaces the "No heartbeat in the time window" or "connect ECONNREFUSED " or "timeout of 48000ms exceeded", etc messages.

Yeah I think it's a good thing to have, especially when optics are important to upper management. They won't look at the production servers directly, they will look at your stats which UK provides. So it would be good for them to be able to see that the service has been 100% up rather than down just because UK couldn't connect to the services and not because the services were actually down. Doesn't have to be a DDoS on the UK, it could be something innocent like "tripped over UK server network cable and it came out" or "UK NIC faulty, had to replace".

In fact in this situation, it would be good then to suggest a "select reports range" (display reports within certain date and/or time range) and then "download reports" (to pdf) function.

My 2 cents worth.

@markdesilva commented on GitHub (Oct 17, 2021): @MAXOUXAX ah so what you want is like what GSA has, an "override" feature so you can tell UK that "hey, this isn't a server outage, its actually UK that was having connection issues so please put my % back to 100%". So like when they click the "DOWN" pill in the dashboard, a pop up shows up with an on/off button for "override" and a text box so you can fill in the reason for the override and when you submit, the reason for the override replaces the "No heartbeat in the time window" or "connect ECONNREFUSED <IP ADDR>" or "timeout of 48000ms exceeded", etc messages. Yeah I think it's a good thing to have, especially when optics are important to upper management. They won't look at the production servers directly, they will look at your stats which UK provides. So it would be good for them to be able to see that the service has been 100% up rather than down just because UK couldn't connect to the services and not because the services were actually down. Doesn't have to be a DDoS on the UK, it could be something innocent like "tripped over UK server network cable and it came out" or "UK NIC faulty, had to replace". In fact in this situation, it would be good then to suggest a "select reports range" (display reports within certain date and/or time range) and then "download reports" (to pdf) function. My 2 cents worth.

deekerman commented

2026-02-28 01:46:10 -05:00

Author

Owner

@louislam commented on GitHub (Oct 17, 2021):

Ultimately, I think one possible solution is completely sperating the core and the status page into two different projects.

Host the core inside a private network and dont expose.
Host the status page in another server and expose the page to the Internet. Sync data with like a private tunnel etc.

So if someone attack your status page, it wont take down the core too.

@louislam commented on GitHub (Oct 17, 2021): Ultimately, I think one possible solution is completely sperating the core and the status page into two different projects. - Host the core inside a private network and dont expose. - Host the status page in another server and expose the page to the Internet. Sync data with like a private tunnel etc. So if someone attack your status page, it wont take down the core too.

deekerman commented

2026-02-28 01:46:10 -05:00

Author

Owner

@gaby commented on GitHub (Oct 17, 2021):

Ultimately, I think one possible solution is completely sperating the core and the status page into two different projects.

Host the core inside a private network and dont expose.

Host the status page in another server and expose the page to the Internet. Sync data with like a private tunnel etc.

So if someone attack your status page, it wont take down the core too.

Status page should be internal to your network. Not exposed to the internet.

@gaby commented on GitHub (Oct 17, 2021): > Ultimately, I think one possible solution is completely sperating the core and the status page into two different projects. > > * Host the core inside a private network and dont expose. > * Host the status page in another server and expose the page to the Internet. Sync data with like a private tunnel etc. > > So if someone attack your status page, it wont take down the core too. Status page should be internal to your network. Not exposed to the internet.

deekerman commented

2026-02-28 01:46:10 -05:00

Author

Owner

@louislam commented on GitHub (Oct 17, 2021):

However, due to such a big amount of efforts, it won't happen shortly.

If you are using Cloudflare, setting Page Rule with Cache Everything for 5mins and disabling WebSocket is a way to go too.

Use your internal address to access the dashboard.

@louislam commented on GitHub (Oct 17, 2021): However, due to such a big amount of efforts, it won't happen shortly. If you are using Cloudflare, setting `Page Rule` with `Cache Everything` for 5mins and disabling WebSocket is a way to go too. Use your internal address to access the dashboard.

deekerman commented

2026-02-28 01:46:10 -05:00

Author

Owner

@deefdragon commented on GitHub (Oct 17, 2021):

Status page should be internal to your network. Not exposed to the internet.

That depends on what you are using the status page for. I use my status page to show of the current state of the different APIs that my site uses. Similar to www.cloudflarestatus.com for cloudflare.

@deefdragon commented on GitHub (Oct 17, 2021): > Status page should be internal to your network. Not exposed to the internet. That depends on what you are using the status page for. I use my status page to show of the current state of the different APIs that my site uses. Similar to www.cloudflarestatus.com for cloudflare.

deekerman commented

2026-02-28 01:46:10 -05:00

Author

Owner

@louislam commented on GitHub (Oct 17, 2021):

Status page should be internal to your network. Not exposed to the internet.

That depends on what you are using the status page for. I use my status page to show of the current state of the different APIs that my site uses. Similar to www.cloudflarestatus.com for cloudflare.

Agree if dont expose to the Internet, op's problem is not a problem.

@louislam commented on GitHub (Oct 17, 2021): > > Status page should be internal to your network. Not exposed to the internet. > > That depends on what you are using the status page for. I use my status page to show of the current state of the different APIs that my site uses. Similar to [www.cloudflarestatus.com](http://www.cloudflarestatus.com) for cloudflare. Agree if dont expose to the Internet, op's problem is not a problem.