mirror of
https://github.com/louislam/uptime-kuma.git
synced 2026-03-02 22:57:00 -05:00
Looping Crash After Deleting Monitor #2627
Labels
No labels
A:accessibility
A:api
A:cert-expiry
A:core
A:dashboard
A:deployment
A:documentation
A:domain expiry
A:incidents
A:maintenance
A:metrics
A:monitor
A:notifications
A:reports
A:settings
A:status-page
A:ui/ux
A:user-management
Stale
ai-slop
blocked
blocked-upstream
bug
cannot-reproduce
dependencies
discussion
duplicate
feature-request
feature-request
good first issue
hacktoberfest
help
help wanted
house keeping
invalid
invalid-format
invalid-format
question
releaseblocker 🚨
security
spam
type:enhance-existing
type:new
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/uptime-kuma#2627
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @jjschwarz on GitHub (Sep 24, 2023).
⚠️ Please verify that this bug has NOT been raised before.
🛡️ Security Policy
Description
I removed a single monitor from the system in the UI with the Delete button, after a few minutes of waiting I started getting WebSocket timeouts. I checked and saw the Kubernetes Pods was restarting in a loop
👟 Reproduction steps
Delete monitor and wait
👀 Expected behavior
Monitor disappears and system continues to function
😓 Actual Behavior
System crashes in repeated loop upon start
🐻 Uptime-Kuma Version
1.22.1
💻 Operating System and Arch
Docker image from docker.io/louislam/uptime-kuma
🌐 Browser
FireFox 117.0.1
🐋 Docker Version
K8S v1.26.7
🟩 NodeJS Version
16
📝 Relevant log output
@jjschwarz commented on GitHub (Sep 24, 2023):
Updated to v1.23.2 image and essentially the same results:
==> Performing startup jobs and maintenance tasks
==> Starting application with user 0 group 0
Welcome to Uptime Kuma
Your Node.js version: 16.20.2
2023-09-24T08:49:55Z [SERVER] INFO: Welcome to Uptime Kuma
2023-09-24T08:49:55Z [SERVER] INFO: Node Env: production
2023-09-24T08:49:55Z [SERVER] INFO: Inside Container: true
2023-09-24T08:49:55Z [SERVER] INFO: Importing Node libraries
2023-09-24T08:49:55Z [SERVER] INFO: Importing 3rd-party libraries
2023-09-24T08:49:56Z [SERVER] INFO: Creating express and socket.io instance
2023-09-24T08:49:56Z [SERVER] INFO: Server Type: HTTP
2023-09-24T08:49:56Z [SERVER] INFO: Importing this project modules
2023-09-24T08:49:56Z [NOTIFICATION] INFO: Prepare Notification Providers
2023-09-24T08:49:56Z [SERVER] INFO: Version: 1.23.2
2023-09-24T08:49:56Z [DB] INFO: Data Dir: ./data/
2023-09-24T08:49:56Z [SERVER] INFO: Connecting to the Database
2023-09-24T08:50:00Z [DB] INFO: SQLite config:
[ { journal_mode: 'wal' } ]
[ { cache_size: -12000 } ]
2023-09-24T08:50:00Z [DB] INFO: SQLite Version: 3.41.1
2023-09-24T08:50:00Z [SERVER] INFO: Connected
2023-09-24T08:50:00Z [DB] INFO: Your database version: 10
2023-09-24T08:50:00Z [DB] INFO: Latest database version: 10
2023-09-24T08:50:00Z [DB] INFO: Database patch not needed
2023-09-24T08:50:00Z [DB] INFO: Database Patch 2.0 Process
2023-09-24T08:50:00Z [DB] INFO: patch-add-invert-keyword.sql is not patched
2023-09-24T08:50:00Z [DB] INFO: patch-add-invert-keyword.sql is patching
2023-09-24T08:51:54Z [SERVER] INFO: Shutdown requested
2023-09-24T08:51:54Z [SERVER] INFO: Called signal: SIGTERM
Trace: Error [ERR_SERVER_NOT_RUNNING]: Server is not running.
at new NodeError (node:internal/errors:387:5)
at Server.close (node:net:1804:12)
at Object.onceWrapper (node:events:627:28)
at Server.emit (node:events:525:35)
at emitCloseNT (node:net:1857:8)
at processTicksAndRejections (node:internal/process/task_queues:82:21) {
code: 'ERR_SERVER_NOT_RUNNING'
}
at process. (/app/server/server.js:1886:13)
at process.emit (node:events:513:28)
at emit (node:internal/process/promises:140:20)
at processPromiseRejections (node:internal/process/promises:274:27)
at processTicksAndRejections (node:internal/process/task_queues:97:32)
If you keep encountering errors, please report to https://github.com/louislam/uptime-kuma/issues
@jjschwarz commented on GitHub (Sep 24, 2023):
I restored the database from the nightly backup and went into the working system. You can see in the attached logs where I delete the same monitor (25)...and then the system falls apart. After a period of time the pod fails. This is all of the log from the pod that first fails after the delete. Once that happens we are back in the loop.
How can I fix this so I can delete a monitor? For now I will revert to the last good backup and wait on suggestions.
==> Performing startup jobs and maintenance tasks
==> Starting application with user 0 group 0
Welcome to Uptime Kuma
Your Node.js version: 16.20.2
2023-09-24T09:06:49Z [SERVER] INFO: Welcome to Uptime Kuma
2023-09-24T09:06:49Z [SERVER] INFO: Node Env: production
2023-09-24T09:06:49Z [SERVER] INFO: Inside Container: true
2023-09-24T09:06:49Z [SERVER] INFO: Importing Node libraries
2023-09-24T09:06:49Z [SERVER] INFO: Importing 3rd-party libraries
2023-09-24T09:06:50Z [SERVER] INFO: Creating express and socket.io instance
2023-09-24T09:06:50Z [SERVER] INFO: Server Type: HTTP
2023-09-24T09:06:50Z [SERVER] INFO: Importing this project modules
2023-09-24T09:06:50Z [NOTIFICATION] INFO: Prepare Notification Providers
2023-09-24T09:06:50Z [SERVER] INFO: Version: 1.23.2
2023-09-24T09:06:50Z [DB] INFO: Data Dir: ./data/
2023-09-24T09:06:50Z [SERVER] INFO: Connecting to the Database
2023-09-24T09:06:50Z [DB] INFO: SQLite config:
[ { journal_mode: 'wal' } ]
[ { cache_size: -12000 } ]
2023-09-24T09:06:50Z [DB] INFO: SQLite Version: 3.41.1
2023-09-24T09:06:50Z [SERVER] INFO: Connected
2023-09-24T09:06:50Z [DB] INFO: Your database version: 10
2023-09-24T09:06:50Z [DB] INFO: Latest database version: 10
2023-09-24T09:06:50Z [DB] INFO: Database patch not needed
2023-09-24T09:06:50Z [DB] INFO: Database Patch 2.0 Process
2023-09-24T09:06:50Z [DB] INFO: patch-add-invert-keyword.sql is not patched
2023-09-24T09:06:50Z [DB] INFO: patch-add-invert-keyword.sql is patching
2023-09-24T09:06:50Z [DB] INFO: patch-add-invert-keyword.sql was patched successfully
2023-09-24T09:06:50Z [DB] INFO: patch-added-json-query.sql is not patched
2023-09-24T09:06:50Z [DB] INFO: patch-added-json-query.sql is patching
2023-09-24T09:06:50Z [DB] INFO: patch-added-json-query.sql was patched successfully
2023-09-24T09:06:50Z [DB] INFO: patch-added-kafka-producer.sql is not patched
2023-09-24T09:06:50Z [DB] INFO: patch-added-kafka-producer.sql is patching
2023-09-24T09:06:50Z [DB] INFO: patch-added-kafka-producer.sql was patched successfully
2023-09-24T09:06:50Z [DB] INFO: patch-add-certificate-expiry-status-page.sql is not patched
2023-09-24T09:06:50Z [DB] INFO: patch-add-certificate-expiry-status-page.sql is patching
2023-09-24T09:06:50Z [DB] INFO: patch-add-certificate-expiry-status-page.sql was patched successfully
2023-09-24T09:06:50Z [DB] INFO: patch-monitor-oauth-cc.sql is not patched
2023-09-24T09:06:50Z [DB] INFO: patch-monitor-oauth-cc.sql is patching
2023-09-24T09:06:50Z [DB] INFO: patch-monitor-oauth-cc.sql was patched successfully
2023-09-24T09:06:50Z [DB] INFO: patch-add-timeout-monitor.sql is not patched
2023-09-24T09:06:50Z [DB] INFO: patch-add-timeout-monitor.sql is patching
2023-09-24T09:06:50Z [DB] INFO: patch-add-timeout-monitor.sql was patched successfully
2023-09-24T09:06:50Z [DB] INFO: patch-add-gamedig-given-port.sql is not patched
2023-09-24T09:06:50Z [DB] INFO: patch-add-gamedig-given-port.sql is patching
2023-09-24T09:06:50Z [DB] INFO: patch-add-gamedig-given-port.sql was patched successfully
2023-09-24T09:06:50Z [DB] INFO: Database Patched Successfully
2023-09-24T09:06:50Z [SERVER] INFO: Load JWT secret from database.
2023-09-24T05:06:50-04:00 [SERVER] INFO: Adding route
2023-09-24T05:06:50-04:00 [SERVER] INFO: Adding socket handler
2023-09-24T05:06:50-04:00 [SERVER] INFO: Init the server
2023-09-24T05:06:50-04:00 [SERVER] INFO: Listening on 3001
2023-09-24T05:06:50-04:00 [SERVICES] INFO: Starting nscd
2023-09-24T05:07:06-04:00 [AUTH] INFO: Login by token. IP=172.16.121.65
2023-09-24T05:07:06-04:00 [AUTH] INFO: Username from JWT: admin
2023-09-24T05:07:06-04:00 [AUTH] INFO: Successfully logged in user admin. IP=172.16.121.65
2023-09-24T05:07:06-04:00 [MONITOR] WARN: Monitor #25 'Expense System K8S Production': Failing: connect ECONNREFUSED 192.168.1.112:30981 | Interval: 60 seconds | Type: http | Down Count: 0 | Resend Interval: 0
2023-09-24T05:07:32-04:00 [AUTH] INFO: Login by token. IP=172.16.121.65
2023-09-24T05:07:32-04:00 [AUTH] INFO: Username from JWT: admin
2023-09-24T05:07:32-04:00 [AUTH] INFO: Successfully logged in user admin. IP=172.16.121.65
2023-09-24T05:07:46-04:00 [AUTH] INFO: Login by token. IP=172.16.121.65
2023-09-24T05:07:46-04:00 [AUTH] INFO: Username from JWT: admin
2023-09-24T05:07:46-04:00 [AUTH] INFO: Successfully logged in user admin. IP=172.16.121.65
2023-09-24T05:08:01-04:00 [MANAGE] INFO: Delete Monitor: 25 User ID: 1
2023-09-24T05:08:32-04:00 [AUTH] INFO: Login by token. IP=172.16.121.65
2023-09-24T05:08:32-04:00 [AUTH] INFO: Username from JWT: admin
2023-09-24T05:10:04-04:00 [MONITOR] ERROR: Caught error
2023-09-24T05:10:04-04:00 [MONITOR] ERROR: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
2023-09-24T05:10:32-04:00 [AUTH] ERROR: Invalid token. IP=172.16.121.65
2023-09-24T05:10:32-04:00 [MONITOR] WARN: Monitor #73 'WebDocs K8S Production': Pending: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? | Max retries: 1 | Retry: 1 | Retry Interval: 60 seconds | Type: http
2023-09-24T05:10:32-04:00 [MONITOR] WARN: Monitor #74 'MobileIron Sentry Admin Portal': Pending: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? | Max retries: 1 | Retry: 1 | Retry Interval: 60 seconds | Type: http
2023-09-24T05:10:33-04:00 [MONITOR] WARN: Monitor #75 'MobileIron VSP': Pending: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? | Max retries: 1 | Retry: 1 | Retry Interval: 60 seconds | Type: http
2023-09-24T05:10:33-04:00 [MONITOR] WARN: Monitor #76 'Avaya CM1 Management SSH': Pending: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? | Max retries: 1 | Retry: 1 | Retry Interval: 60 seconds | Type: port
2023-09-24T05:10:34-04:00 [MONITOR] WARN: Monitor #77 'Avaya CM2 Management SSH': Pending: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? | Max retries: 1 | Retry: 1 | Retry Interval: 60 seconds | Type: port
2023-09-24T05:10:35-04:00 [MONITOR] WARN: Monitor #78 'Workwave API Server': Pending: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? | Max retries: 1 | Retry: 1 | Retry Interval: 60 seconds | Type: http
@chakflying commented on GitHub (Sep 24, 2023):
Deleting a monitor can take a long time if the database has been in use for a while and or the retention period is long, during which the server may be unresponsive. I'm guessing your K8 config has a periodic health check, and failing the health check K8 sent SIGTERM to shutdown the server.
I think your option would be to disable the health check, delete the monitor using the UI and hope that it completes after a while,
or alternatively shutdown the server, open the SQLite database using an external tool, and delete the monitor manually from there. Remember to check that foreign_keys are turned on if you do so.
@jjschwarz commented on GitHub (Sep 24, 2023):
I will try to remove the health check and test the delete. Any suggested health check limits?
Jason John Schwarz
eMail: @.***
Amateur Radio: N4JJS
@chakflying commented on GitHub (Sep 24, 2023):
I don't think anyone has done any serious benchmarking on how long it takes, so do report back if you succeed since it would be a great data point!
Of course it would be best to restore the original health check once you are done deleting monitors.
@louislam commented on GitHub (Sep 24, 2023):
Duplicate of #2781
And unfortunately, I think it could only be solved by using MariaDB/MySQL in upcoming 2.0.0.
@jjschwarz commented on GitHub (Sep 25, 2023):
I removed the health check, but it never seems to have recovered. I waited about 15 minutes and then restarted it. This time the restart was clean, so my guess is it completed the delete but the app might have crashed in the meantime. I had 30 second health checks, so we expanded that to 90 seconds. Interesting that I stopped getting errors in the logs after 2 minutes or so, but the system was still up (non responsive).
uptime-kuma.log
@chakflying commented on GitHub (Sep 25, 2023):
Looking at the logs made me realize that all reads are blocked when writing. This contradicts my understanding that in SQLite WAL mode, writers should not block readers. I'm assuming it's because we are only using a single connection across the entire application.
@louislam Do you remember what was the issue with using multiple connections in the sqlite connection pool config? If there was some memory leak, is it worth investigating to see if we can work around that?
Alternatively, I think I can modify the
deleteMonitorfunction to delete old heartbeats first over fixed size batches, until all heartbeats are cleared, then delete the monitor. This should prevent a huge cascade delete blocking for too long.