mirror of
https://github.com/louislam/uptime-kuma.git
synced 2026-03-02 22:57:00 -05:00
getaddrinfo ENOTFOUND occasionally #3368
Labels
No labels
A:accessibility
A:api
A:cert-expiry
A:core
A:dashboard
A:deployment
A:documentation
A:domain expiry
A:incidents
A:maintenance
A:metrics
A:monitor
A:notifications
A:reports
A:settings
A:status-page
A:ui/ux
A:user-management
Stale
ai-slop
blocked
blocked-upstream
bug
cannot-reproduce
dependencies
discussion
duplicate
feature-request
feature-request
good first issue
hacktoberfest
help
help wanted
house keeping
invalid
invalid-format
invalid-format
question
releaseblocker 🚨
security
spam
type:enhance-existing
type:new
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/uptime-kuma#3368
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @sudoexec on GitHub (May 29, 2024).
📑 I have found these related issues/pull requests
🛡️ Security Policy
Description
There are some
getaddrinfo ENOTFOUNDerrors occasionally(0-3 errors per day).Uptime Kuma running in k8s. Upstream dns is k8s's coredns and coredns don't have any error logs.
I use
while true; do nslookup example.com && sleep 1; doneto test dns resolution and no errors.The error occurs randomly and I can't reproduce it.
Is there any methods to find details about this error?
👟 Reproduction steps
Can't reproduce.
👀 Expected behavior
No getaddrinfo ENOTFOUND errors.
😓 Actual Behavior
getaddrinfo ENOTFOUND
🐻 Uptime-Kuma Version
1.23.11
💻 Operating System and Arch
k8s
🌐 Browser
125.0.6422.112 (Official Build) Arch Linux (64-bit)
🖥️ Deployment Environment
📝 Relevant log output
@CommanderStorm commented on GitHub (May 29, 2024):
Same steps as in https://github.com/louislam/uptime-kuma/issues/4765
Most commonly, this issue is caused by you using a DNS resolver which does not like the level of DNS requests it is getting.
=> your DNS Server is dropping SOME requests
=> have you enabled NSCD in the settings to lowered the amount of DNS requests to your TTL (instead of on every request)
@sudoexec commented on GitHub (May 29, 2024):
@CommanderStorm commented on GitHub (May 29, 2024):
I have no clue what could be causing this.
Lets rule out the stupid cauases first:
could you look in the log if NSCD has been successfully started? (possible cause: using a custom UUID/GUID)
have you verified that the TTL is actually 600?
Just to make sure: you have activated https://coredns.io/plugins/errors/ and/or https://coredns.io/plugins/log/?
What are the logs?
@sudoexec commented on GitHub (May 29, 2024):
ps auxshow NSCD is runningI'm sure TTL is 600
I enable errors plugin but not log plugin. I'll try to enable log plugin to find more details.
@thielj commented on GitHub (May 30, 2024):
@sudoexec Alpine or other musl based Linux? Can you post a copy of your host's and the running container's /etc/resolv.conf?
I have seen similar issues in the past, including with Kubernetes, usually involving multiple DNS servers or related to search domains. The musl resolver would send out multiple parallel queries and ignore all replies but the first one. If that response was an error, this is what you would get. If the "good" lookup would usually win the race, you wouldn't see this error often.
Also, a regular
nslookupordig(or the DNS monitors in Kuma) do name service lookups differently than for examplecurlor http requests in Node which use the resolver (getaddrinfo) provided by the C library. Just had a quick google and these might give some background:https://jvns.ca/blog/2022/02/23/getaddrinfo-is-kind-of-weird/
https://medium.com/@hsahu24/understanding-dns-resolution-and-resolv-conf-d17d1d64471c
(this is just a personal opinion, but I wouldn't touch
nscdwith a barge pole)@sudoexec commented on GitHub (May 30, 2024):
@thielj Host machine is ubuntu 18.04.
Here are resolv.conf:
Thanks for the info you provided, I've learned more abount DNS internal from it.
Additionally, I've added another nameserver to uptime kuma pod, and there're no errors in the past 2 days.
@thielj commented on GitHub (May 31, 2024):
If you get more getadrinfo related errors: those resolv.conf settings and the internal DNS they lead to is the rabbit hole you need to dig into, all the way from the container/pod down your stack.
https://coredns.io/2017/06/08/how-queries-are-processed-in-coredns/
@CommanderStorm commented on GitHub (May 31, 2024):
We should likely document this here
https://github.com/louislam/uptime-kuma/wiki/Troubleshooting
What is your second nameserver? (how did you find it's IP? Do you have multiple coredns instances running?)
(Not a kubernetes/dns wizard 😅)
@sudoexec commented on GitHub (May 31, 2024):
@thielj Thanks again for your help. I'll try it
@CommanderStorm
In fact,"another nameserver" is 1.1.1.1. In case it's caused by coredns.
@thielj commented on GitHub (May 31, 2024):
@sudoexec This probably doesn't do what you expect, and if it does, you're relying on specific implementation behaviour of POSIX getaddrinfo. There are at least four different major implementations, and most of them can be further configured, see nsswitch.conf for an example.
The two most common, and their default behaviour with regards to the DNS resolver are:
glibc, which will query the first server, and if it replies saying that it can't resolve your name, that's the final result. Only if the first server doesn't reply at all within the timeout, glibc would move on. For the purpose of monitoring, this can effectively mask problems in your Kubernetes DNS setup. Unless you monitor to show off "all green" to your boss or a client, it's probably not what you want.
musl, which will query both servers in parallel, and the first to reply wins. If 1.1.1.1 is faster than coredns and says it's unresolveable, then that's the final result. This usually ends in your internal DNS winning the race 99.99% of the time. Instead of logging that your coredns is sometimes slow, you will log lookup failures (without knowing that they actually came from 1.1.1.1).
So: If you specify more than one server in resolv.conf, BOTH should be able to resolve ALL your hosts. If you want to implement fallbacks, query routing and such, configure a coredns or dnsmasq instance appropriately and point your resolv.conf to that. If you still want two DNS entries in your resolv.conf, configure two identically redundant instances.
Also, if you run frequent probes, you will eventually see failures. That's pretty normal. With a 99.99% reliability, a < 0.01% failure rate would be acceptable. Configure your probes to allow for one retry maybe?
Alpine/Musl
@skrue commented on GitHub (Jun 11, 2024):
I started seeing this behavior after setting up AdGuard Home. In my previous setup I only had Unbound DNS running on my OPNsense router/firewall. Now, AdGuard will relay all requests that it doesn't decide to block to Unbound, so AdGuard is the primary DNS. My entire home network is whitelisted in AdGuard as is the Uptime Kuma IP, so no blocking should be happening there. I am running Uptime Kuma as an LXC container on my Proxmox host.
getaddrinfo ENOTFOUNDerrors pop up roughly once a day for each monitor that I have configured. I have now increased the retry value from 0 to 2, let's see if that helps.@sudoexec commented on GitHub (Jun 23, 2024):
Weeks age, I change my upstream DNS (which is provided by cloud service and managed by systemd-resolved) to another 2 public DNS server. There's no
getaddrinfo ENOTFOUNDerror again.@benoitjpnet commented on GitHub (Feb 17, 2025):
Having the same setup and same issue. I wonder what's wrong with Adguard... Nothing in the logs.