getaddrinfo ENOTFOUND occasionally #3368

Closed
opened 2026-02-28 03:27:08 -05:00 by deekerman · 13 comments
Owner

Originally created by @sudoexec on GitHub (May 29, 2024).

🛡️ Security Policy

Description

There are some getaddrinfo ENOTFOUND errors occasionally(0-3 errors per day).

Uptime Kuma running in k8s. Upstream dns is k8s's coredns and coredns don't have any error logs.
I use while true; do nslookup example.com && sleep 1; done to test dns resolution and no errors.

The error occurs randomly and I can't reproduce it.
Is there any methods to find details about this error?

👟 Reproduction steps

Can't reproduce.

👀 Expected behavior

No getaddrinfo ENOTFOUND errors.

😓 Actual Behavior

getaddrinfo ENOTFOUND

🐻 Uptime-Kuma Version

1.23.11

💻 Operating System and Arch

k8s

🌐 Browser

125.0.6422.112 (Official Build) Arch Linux (64-bit)

🖥️ Deployment Environment

  • Runtime: k8s v1.18.1
  • Database: sqlite
  • Filesystem used to store the database on: local storage via hostpath
  • number of monitors: 52

📝 Relevant log output

Failing: getaddrinfo ENOTFOUND
Originally created by @sudoexec on GitHub (May 29, 2024). ### 📑 I have found these related issues/pull requests - https://github.com/louislam/uptime-kuma/issues/220 - https://github.com/louislam/uptime-kuma/issues/3992 - https://github.com/louislam/uptime-kuma/issues/4765 ### 🛡️ Security Policy - [X] I agree to have read this project [Security Policy](https://github.com/louislam/uptime-kuma/security/policy) ### Description There are some `getaddrinfo ENOTFOUND` errors occasionally(0-3 errors per day). Uptime Kuma running in k8s. Upstream dns is k8s's coredns and coredns don't have any error logs. I use `while true; do nslookup example.com && sleep 1; done` to test dns resolution and no errors. The error occurs randomly and I can't reproduce it. Is there any methods to find details about this error? ### 👟 Reproduction steps Can't reproduce. ### 👀 Expected behavior No getaddrinfo ENOTFOUND errors. ### 😓 Actual Behavior getaddrinfo ENOTFOUND ### 🐻 Uptime-Kuma Version 1.23.11 ### 💻 Operating System and Arch k8s ### 🌐 Browser 125.0.6422.112 (Official Build) Arch Linux (64-bit) ### 🖥️ Deployment Environment - Runtime: k8s v1.18.1 - Database: sqlite - Filesystem used to store the database on: local storage via hostpath - number of monitors: 52 ### 📝 Relevant log output ```shell Failing: getaddrinfo ENOTFOUND ```
Author
Owner

@CommanderStorm commented on GitHub (May 29, 2024):

Same steps as in https://github.com/louislam/uptime-kuma/issues/4765

getaddrinfo ENOTFOUND test.xyz

  • What is the TTL of the domains you are using?
  • Do you have DNS caching enabled in the settings?

Most commonly, this issue is caused by you using a DNS resolver which does not like the level of DNS requests it is getting.
=> your DNS Server is dropping SOME requests
=> have you enabled NSCD in the settings to lowered the amount of DNS requests to your TTL (instead of on every request)

@CommanderStorm commented on GitHub (May 29, 2024): Same steps as in https://github.com/louislam/uptime-kuma/issues/4765 > `getaddrinfo ENOTFOUND test.xyz` - What is the TTL of the domains you are using? - Do you have DNS caching enabled in the settings? Most commonly, this issue is caused by you using a DNS resolver which does not like the level of DNS requests it is getting. => your DNS Server is dropping SOME requests => have you enabled NSCD in the settings to lowered the amount of DNS requests to your TTL (instead of on every request)
Author
Owner

@sudoexec commented on GitHub (May 29, 2024):

Same steps as in #4765

getaddrinfo ENOTFOUND test.xyz

  • What is the TTL of the domains you are using?
  • Do you have DNS caching enabled in the settings?

Most commonly, this issue is caused by you using a DNS resolver which does not like the level of DNS requests it is getting. => your DNS Server is dropping SOME requests => have you enabled NSCD in the settings to lowered the amount of DNS requests to your TTL (instead of on every request)

  • TTL is 600
  • DNS chaing is enabled
    image
@sudoexec commented on GitHub (May 29, 2024): > Same steps as in #4765 > > > `getaddrinfo ENOTFOUND test.xyz` > > * What is the TTL of the domains you are using? > * Do you have DNS caching enabled in the settings? > > Most commonly, this issue is caused by you using a DNS resolver which does not like the level of DNS requests it is getting. => your DNS Server is dropping SOME requests => have you enabled NSCD in the settings to lowered the amount of DNS requests to your TTL (instead of on every request) - TTL is 600 - DNS chaing is enabled ![image](https://github.com/louislam/uptime-kuma/assets/27770920/048e7366-2f7f-447b-8139-bfa3fe7c726d)
Author
Owner

@CommanderStorm commented on GitHub (May 29, 2024):

I have no clue what could be causing this.

Lets rule out the stupid cauases first:

@CommanderStorm commented on GitHub (May 29, 2024): I have no clue what could be causing this. Lets rule out the stupid cauases first: - could you look in the log if NSCD has been successfully started? (possible cause: using a custom UUID/GUID) - have you verified that the TTL is actually 600? - > coredns don't have any error logs Just to make sure: you have activated https://coredns.io/plugins/errors/ and/or https://coredns.io/plugins/log/? What are the logs?
Author
Owner

@sudoexec commented on GitHub (May 29, 2024):

could you look in the log if NSCD has been successfully started? (possible cause: using a custom UUID/GUID)

ps aux show NSCD is running

have you verified that the TTL is actually 600?

I'm sure TTL is 600

coredns don't have any error logs

Just to make sure: you have activated https://coredns.io/plugins/errors/ and/or https://coredns.io/plugins/log/?
What are the logs?

I enable errors plugin but not log plugin. I'll try to enable log plugin to find more details.

@sudoexec commented on GitHub (May 29, 2024): > could you look in the log if NSCD has been successfully started? (possible cause: using a custom UUID/GUID) `ps aux` show NSCD is running > have you verified that the TTL is actually 600? I'm sure TTL is 600 > > coredns don't have any error logs > > Just to make sure: you have activated https://coredns.io/plugins/errors/ and/or https://coredns.io/plugins/log/? > What are the logs? I enable errors plugin but not log plugin. I'll try to enable log plugin to find more details.
Author
Owner

@thielj commented on GitHub (May 30, 2024):

@sudoexec Alpine or other musl based Linux? Can you post a copy of your host's and the running container's /etc/resolv.conf?

I have seen similar issues in the past, including with Kubernetes, usually involving multiple DNS servers or related to search domains. The musl resolver would send out multiple parallel queries and ignore all replies but the first one. If that response was an error, this is what you would get. If the "good" lookup would usually win the race, you wouldn't see this error often.

Also, a regular nslookup or dig (or the DNS monitors in Kuma) do name service lookups differently than for example curl or http requests in Node which use the resolver (getaddrinfo) provided by the C library. Just had a quick google and these might give some background:

https://jvns.ca/blog/2022/02/23/getaddrinfo-is-kind-of-weird/
https://medium.com/@hsahu24/understanding-dns-resolution-and-resolv-conf-d17d1d64471c

(this is just a personal opinion, but I wouldn't touch nscd with a barge pole)

@thielj commented on GitHub (May 30, 2024): @sudoexec Alpine or other musl based Linux? Can you post a copy of your host's and the running container's /etc/resolv.conf? I have seen similar issues in the past, including with Kubernetes, usually involving multiple DNS servers or related to search domains. The musl resolver would send out multiple parallel queries and ignore all replies but the first one. If that response was an error, this is what you would get. If the "good" lookup would usually win the race, you wouldn't see this error often. Also, a regular `nslookup` or `dig` (or the DNS monitors in Kuma) do name service lookups differently than for example `curl` or http requests in Node which use the resolver (`getaddrinfo`) provided by the C library. Just had a quick google and these might give some background: https://jvns.ca/blog/2022/02/23/getaddrinfo-is-kind-of-weird/ https://medium.com/@hsahu24/understanding-dns-resolution-and-resolv-conf-d17d1d64471c (this is just a personal opinion, but I wouldn't touch `nscd` with a barge pole)
Author
Owner

@sudoexec commented on GitHub (May 30, 2024):

@thielj Host machine is ubuntu 18.04.
Here are resolv.conf:

# Host
nameserver 119.29.29.29

# Container
nameserver 10.96.0.10                 # k8s coredns
search namespace.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Thanks for the info you provided, I've learned more abount DNS internal from it.

Additionally, I've added another nameserver to uptime kuma pod, and there're no errors in the past 2 days.

@sudoexec commented on GitHub (May 30, 2024): @thielj Host machine is ubuntu 18.04. Here are resolv.conf: ``` # Host nameserver 119.29.29.29 # Container nameserver 10.96.0.10 # k8s coredns search namespace.svc.cluster.local svc.cluster.local cluster.local options ndots:5 ``` Thanks for the info you provided, I've learned more abount DNS internal from it. Additionally, I've added another nameserver to uptime kuma pod, and there're no errors in the past 2 days.
Author
Owner

@thielj commented on GitHub (May 31, 2024):

If you get more getadrinfo related errors: those resolv.conf settings and the internal DNS they lead to is the rabbit hole you need to dig into, all the way from the container/pod down your stack.

https://coredns.io/2017/06/08/how-queries-are-processed-in-coredns/

@thielj commented on GitHub (May 31, 2024): If you get more getadrinfo related errors: those resolv.conf settings and the internal DNS they lead to is the rabbit hole you need to dig into, all the way from the container/pod down your stack. https://coredns.io/2017/06/08/how-queries-are-processed-in-coredns/
Author
Owner

@CommanderStorm commented on GitHub (May 31, 2024):

We should likely document this here
https://github.com/louislam/uptime-kuma/wiki/Troubleshooting

What is your second nameserver? (how did you find it's IP? Do you have multiple coredns instances running?)

(Not a kubernetes/dns wizard 😅)

@CommanderStorm commented on GitHub (May 31, 2024): We should likely document this here https://github.com/louislam/uptime-kuma/wiki/Troubleshooting What is your second nameserver? (how did you find it's IP? Do you have multiple coredns instances running?) (Not a kubernetes/dns wizard 😅)
Author
Owner

@sudoexec commented on GitHub (May 31, 2024):

@thielj Thanks again for your help. I'll try it

@CommanderStorm

Additionally, I've added another nameserver to uptime kuma pod, and there're no errors in the past 2 days.

In fact,"another nameserver" is 1.1.1.1. In case it's caused by coredns.

@sudoexec commented on GitHub (May 31, 2024): @thielj Thanks again for your help. I'll try it @CommanderStorm > Additionally, I've added another nameserver to uptime kuma pod, and there're no errors in the past 2 days. In fact,"another nameserver" is 1.1.1.1. In case it's caused by coredns.
Author
Owner

@thielj commented on GitHub (May 31, 2024):

@sudoexec This probably doesn't do what you expect, and if it does, you're relying on specific implementation behaviour of POSIX getaddrinfo. There are at least four different major implementations, and most of them can be further configured, see nsswitch.conf for an example.

The two most common, and their default behaviour with regards to the DNS resolver are:

  • glibc, which will query the first server, and if it replies saying that it can't resolve your name, that's the final result. Only if the first server doesn't reply at all within the timeout, glibc would move on. For the purpose of monitoring, this can effectively mask problems in your Kubernetes DNS setup. Unless you monitor to show off "all green" to your boss or a client, it's probably not what you want.

  • musl, which will query both servers in parallel, and the first to reply wins. If 1.1.1.1 is faster than coredns and says it's unresolveable, then that's the final result. This usually ends in your internal DNS winning the race 99.99% of the time. Instead of logging that your coredns is sometimes slow, you will log lookup failures (without knowing that they actually came from 1.1.1.1).

So: If you specify more than one server in resolv.conf, BOTH should be able to resolve ALL your hosts. If you want to implement fallbacks, query routing and such, configure a coredns or dnsmasq instance appropriately and point your resolv.conf to that. If you still want two DNS entries in your resolv.conf, configure two identically redundant instances.

Also, if you run frequent probes, you will eventually see failures. That's pretty normal. With a 99.99% reliability, a < 0.01% failure rate would be acceptable. Configure your probes to allow for one retry maybe?

Alpine/Musl

@thielj commented on GitHub (May 31, 2024): @sudoexec This probably doesn't do what you expect, and if it does, you're relying on specific implementation behaviour of [POSIX getaddrinfo](https://pubs.opengroup.org/onlinepubs/9699919799/functions/getaddrinfo.html). There are at least four different major implementations, and most of them can be further configured, see [nsswitch.conf](https://www.gnu.org/software/libc/manual/html_node/NSS-Configuration-File.html) for an example. The two most common, and their default behaviour with regards to the DNS resolver are: - glibc, which will query the first server, and if it replies saying that it can't resolve your name, that's the final result. Only if the first server doesn't reply at all within the timeout, glibc would move on. For the purpose of monitoring, this can effectively mask problems in your Kubernetes DNS setup. Unless you monitor to show off "all green" to your boss or a client, it's probably not what you want. - musl, which will query both servers in parallel, and the first to reply wins. If 1.1.1.1 is faster than coredns and says it's unresolveable, then that's the final result. This usually ends in your internal DNS winning the race 99.99% of the time. Instead of logging that your coredns is sometimes slow, you will log lookup failures (without knowing that they actually came from 1.1.1.1). So: If you specify more than one server in resolv.conf, **BOTH** should be able to resolve **ALL** your hosts. If you want to implement fallbacks, query routing and such, configure a coredns or dnsmasq instance appropriately and point your resolv.conf to that. If you still want two DNS entries in your resolv.conf, configure two identically redundant instances. Also, if you run frequent probes, you will eventually see failures. That's pretty normal. With a 99.99% reliability, a < 0.01% failure rate would be acceptable. Configure your probes to allow for one retry maybe? [Alpine/Musl](https://wiki.musl-libc.org/functional-differences-from-glibc.html#Name-Resolver/DNS)
Author
Owner

@skrue commented on GitHub (Jun 11, 2024):

I started seeing this behavior after setting up AdGuard Home. In my previous setup I only had Unbound DNS running on my OPNsense router/firewall. Now, AdGuard will relay all requests that it doesn't decide to block to Unbound, so AdGuard is the primary DNS. My entire home network is whitelisted in AdGuard as is the Uptime Kuma IP, so no blocking should be happening there. I am running Uptime Kuma as an LXC container on my Proxmox host. getaddrinfo ENOTFOUND errors pop up roughly once a day for each monitor that I have configured. I have now increased the retry value from 0 to 2, let's see if that helps.

@skrue commented on GitHub (Jun 11, 2024): I started seeing this behavior after setting up AdGuard Home. In my previous setup I only had Unbound DNS running on my OPNsense router/firewall. Now, AdGuard will relay all requests that it doesn't decide to block to Unbound, so AdGuard is the primary DNS. My entire home network is whitelisted in AdGuard as is the Uptime Kuma IP, so no blocking should be happening there. I am running Uptime Kuma as an LXC container on my Proxmox host. `getaddrinfo ENOTFOUND` errors pop up roughly once a day for each monitor that I have configured. I have now increased the retry value from 0 to 2, let's see if that helps.
Author
Owner

@sudoexec commented on GitHub (Jun 23, 2024):

Weeks age, I change my upstream DNS (which is provided by cloud service and managed by systemd-resolved) to another 2 public DNS server. There's no getaddrinfo ENOTFOUND error again.

@sudoexec commented on GitHub (Jun 23, 2024): Weeks age, I change my upstream DNS (which is provided by cloud service and managed by systemd-resolved) to another 2 public DNS server. There's no `getaddrinfo ENOTFOUND` error again.
Author
Owner

@benoitjpnet commented on GitHub (Feb 17, 2025):

I started seeing this behavior after setting up AdGuard Home. In my previous setup I only had Unbound DNS running on my OPNsense router/firewall. Now, AdGuard will relay all requests that it doesn't decide to block to Unbound, so AdGuard is the primary DNS. My entire home network is whitelisted in AdGuard as is the Uptime Kuma IP, so no blocking should be happening there. I am running Uptime Kuma as an LXC container on my Proxmox host. getaddrinfo ENOTFOUND errors pop up roughly once a day for each monitor that I have configured. I have now increased the retry value from 0 to 2, let's see if that helps.

Having the same setup and same issue. I wonder what's wrong with Adguard... Nothing in the logs.

@benoitjpnet commented on GitHub (Feb 17, 2025): > I started seeing this behavior after setting up AdGuard Home. In my previous setup I only had Unbound DNS running on my OPNsense router/firewall. Now, AdGuard will relay all requests that it doesn't decide to block to Unbound, so AdGuard is the primary DNS. My entire home network is whitelisted in AdGuard as is the Uptime Kuma IP, so no blocking should be happening there. I am running Uptime Kuma as an LXC container on my Proxmox host. `getaddrinfo ENOTFOUND` errors pop up roughly once a day for each monitor that I have configured. I have now increased the retry value from 0 to 2, let's see if that helps. Having the same setup and same issue. I wonder what's wrong with Adguard... Nothing in the logs.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/uptime-kuma#3368
No description provided.