mirror of
https://github.com/louislam/uptime-kuma.git
synced 2026-03-02 22:57:00 -05:00
Services falsely reported as offline during a system overload #425
Labels
No labels
A:accessibility
A:api
A:cert-expiry
A:core
A:dashboard
A:deployment
A:documentation
A:domain expiry
A:incidents
A:maintenance
A:metrics
A:monitor
A:notifications
A:reports
A:settings
A:status-page
A:ui/ux
A:user-management
Stale
ai-slop
blocked
blocked-upstream
bug
cannot-reproduce
dependencies
discussion
duplicate
feature-request
feature-request
good first issue
hacktoberfest
help
help wanted
house keeping
invalid
invalid-format
invalid-format
question
releaseblocker 🚨
security
spam
type:enhance-existing
type:new
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/uptime-kuma#425
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @MAXOUXAX on GitHub (Oct 16, 2021).
Description of the bug
When my server is overloaded, Uptime Kuma can't communicate with my services, so it considers them offline.
My services are not hosted on the same server, so they work fine, but my status page shows a reduced uptime.
(I want to specify that I voluntarily overloaded my server in order to fine-tune my Anti-DDoS protection)
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The uptime shouldn't be affected at all.
Info
Uptime Kuma Version: 1.8.0
Using Docker?: Yes
Docker Version: 20.10.8
OS: Debian 10
Browser: Brave V1.30.89
Possible fix
When the service has been queried, and an error has been retrieved, execute an action that is supposed to run quickly and check its execution time. If this execution time is greater than a certain limit, ignore the error.
@gaby commented on GitHub (Oct 16, 2021):
So you are DDoS the uptime-kuma server, and want the server to keep up?
How is this related to uptime-kuma?
@louislam commented on GitHub (Oct 16, 2021):
I think a good network connection is a hidden requirement here.
@PopcornPanda commented on GitHub (Oct 16, 2021):
I think that cross-check could be handy for such case. Sometimes a host with uptime-kuma could have problems, not a monitored service. There is already a feature request for such solution: #84
Cross-checking is quite handy and would be a nice addition to kuma. Tag service as unavailable only if 2 of 3 (it's just an example, but it has to be quorum) detect a problem with the service.
@MAXOUXAX commented on GitHub (Oct 16, 2021):
Well, essentially, there's always a way to take a website down, and I don't want attackers DDoS'ing my status page AND causing my services to report offline. Even though my status page would be down during the attack, I don't want my services to be shown as degraded and my uptime as really low after the attack, because, well, my services were just fine.
That's an edge case, but still.
Good network connection doesn't mean invulnerable ^^
@gaby commented on GitHub (Oct 16, 2021):
Yes, but it has nothing to do with uptime-kuma. These are networking/firewall concerns. You can use
ufw,fail2ban, cloudflare, and a properly configured NGINX to mitigate ddos.@MAXOUXAX commented on GitHub (Oct 16, 2021):
Yes it does? Having a good firewall is one thing. Being invulnerable is another.
I have protections such as Cloudflare and fail2ban, as I said, I was fine-tuning my protections when I noticed the issue, but it'll never make me invulnerable to other type of attacks I did not think of, botnets, and potential other issues.
@deefdragon commented on GitHub (Oct 16, 2021):
I think that this is at-least partially a Kuma problem. Fundamentally, the service is up, but Kuma is failing to detect it as so.
That doesn't mean that it is an easy problem to solve, or one that should be tackled right now however. I believe that @NixNotCastey is on to a potential solution, as separation of the reporters/collectors and the display would mean that the collectors would be unaffected by a DOS. Something to explore in the future with 84.
@markdesilva commented on GitHub (Oct 17, 2021):
@MAXOUXAX ah so what you want is like what GSA has, an "override" feature so you can tell UK that "hey, this isn't a server outage, its actually UK that was having connection issues so please put my % back to 100%".
So like when they click the "DOWN" pill in the dashboard, a pop up shows up with an on/off button for "override" and a text box so you can fill in the reason for the override and when you submit, the reason for the override replaces the "No heartbeat in the time window" or "connect ECONNREFUSED " or "timeout of 48000ms exceeded", etc messages.
Yeah I think it's a good thing to have, especially when optics are important to upper management. They won't look at the production servers directly, they will look at your stats which UK provides. So it would be good for them to be able to see that the service has been 100% up rather than down just because UK couldn't connect to the services and not because the services were actually down. Doesn't have to be a DDoS on the UK, it could be something innocent like "tripped over UK server network cable and it came out" or "UK NIC faulty, had to replace".
In fact in this situation, it would be good then to suggest a "select reports range" (display reports within certain date and/or time range) and then "download reports" (to pdf) function.
My 2 cents worth.
@louislam commented on GitHub (Oct 17, 2021):
Ultimately, I think one possible solution is completely sperating the core and the status page into two different projects.
So if someone attack your status page, it wont take down the core too.
@gaby commented on GitHub (Oct 17, 2021):
Status page should be internal to your network. Not exposed to the internet.
@louislam commented on GitHub (Oct 17, 2021):
However, due to such a big amount of efforts, it won't happen shortly.
If you are using Cloudflare, setting
Page RulewithCache Everythingfor 5mins and disabling WebSocket is a way to go too.Use your internal address to access the dashboard.
@deefdragon commented on GitHub (Oct 17, 2021):
That depends on what you are using the status page for. I use my status page to show of the current state of the different APIs that my site uses. Similar to www.cloudflarestatus.com for cloudflare.
@louislam commented on GitHub (Oct 17, 2021):
Agree if dont expose to the Internet, op's problem is not a problem.