Skip to content

Federated queue message store gets corrupted and doesn't recover #1862

@DarthLowen

Description

@DarthLowen

Describe the bug
After an unexpected power-down, sometimes I get an IO::EOFError, similar to the bug described in #1693 (I assume, I can't really compare logs)
For now it only appears to happen to the federated queue, it is the queue with the most traffic on it, so the fact that it's federated might be coincidence. Please find my full log below.
Both federated and shoveled queues don't work after receiving this error, even though lavinmq logs show that reconnects are succesful. Even RabbitMQ thinks all is well, but in reality, no data is transferred over the queues anymore. RabbitMQ log at the time of the error:

2026-04-16 19:07:45.429888+00:00 [info] <0.51729.0> Federation queue 'federated' in vhost '/' received a 'basic.cancel'
2026-04-16 19:07:50.453907+00:00 [info] <0.51839.0> Federation queue 'federated' in vhost '/' connected to queue 'federated' in vhost '/' on amqp://10.11.21.1:5672/%2F

Note: I don't need to remove the lavinmq data folder, just restarting the container works fine to get the federation and shovel links to work again (but with potential loss of what was persisted in the queue)


    ██╗      █████╗ ██╗   ██╗██╗███╗   ██╗███╗   ███╗ ██████╗
    ██║     ██╔══██╗██║   ██║██║████╗  ██║████╗ ████║██╔═══██╗
    ██║     ███████║██║   ██║██║██╔██╗ ██║██╔████╔██║██║   ██║
    ██║     ██╔══██║╚██╗ ██╔╝██║██║╚██╗██║██║╚██╔╝██║██║▄▄ ██║
    ███████╗██║  ██║ ╚████╔╝ ██║██║ ╚████║██║ ╚═╝ ██║╚██████╔╝
    ╚══════╝╚═╝  ╚═╝  ╚═══╝  ╚═╝╚═╝  ╚═══╝╚═╝     ╚═╝ ╚══▀▀═╝

             The message broker built for peaks
2026-04-16T19:07:37.570927Z   WARN - lmq: Use --default-user-only-loopback instead.
2026-04-16T19:07:37.571139Z  INFO lmq[level: "Info", target: "stdout"] Logger settings
2026-04-16T19:07:37.579000Z  INFO lmq.launcher LavinMQ 2.7.0-rc.2
2026-04-16T19:07:37.579007Z  INFO lmq.launcher Crystal 1.19.1 (2026-01-20)
2026-04-16T19:07:37.579008Z  INFO lmq.launcher LLVM: 18.1.3
2026-04-16T19:07:37.579009Z  INFO lmq.launcher Default target: aarch64-unknown-linux-gnu
2026-04-16T19:07:37.579020Z  INFO lmq.launcher Build flags: --release --debug
2026-04-16T19:07:37.579024Z  INFO lmq.launcher Multithreading: 4 threads
2026-04-16T19:07:37.579079Z  INFO lmq.launcher PID: 1
2026-04-16T19:07:37.579081Z  INFO lmq.launcher Config file: /etc/lavinmq/lavinmq.ini
2026-04-16T19:07:37.579088Z  INFO lmq.launcher Data directory: /var/lib/lavinmq
2026-04-16T19:07:37.579660Z  INFO lmq.launcher Max mmap count: 65530
2026-04-16T19:07:37.579666Z  WARN lmq.launcher The max mmap count limit is very low, consider raising it.
2026-04-16T19:07:37.579667Z  WARN lmq.launcher The limits should be higher than the maximum of # connections * 2 + # consumer * 2 + # queues * 4
2026-04-16T19:07:37.579668Z  WARN lmq.launcher sysctl -w vm.max_map_count=1000000
2026-04-16T19:07:37.580792Z  INFO lmq.launcher FD limit: 524288
2026-04-16T19:07:37.656335Z  INFO lmq.vhost[vhost: "/"] Loading definitions
2026-04-16T19:07:37.664458Z  INFO lmq.vhost[vhost: "/"] Applying 6 exchanges
2026-04-16T19:07:37.669583Z  INFO lmq.vhost[vhost: "/"] Applying 4 queues
2026-04-16T19:07:37.681412Z  INFO lmq.message_store[queue: "shoveled", vhost: "/"] Loaded 1 segments, 0 messages
2026-04-16T19:07:37.696041Z  INFO lmq.message_store[queue: "vocachick", vhost: "/"] Loaded 1 segments, 0 messages
2026-04-16T19:07:37.698147Z  INFO lmq.message_store[queue: "MachineMgmtMgmt", vhost: "/"] Loaded 1 segments, 0 messages
2026-04-16T19:07:37.699988Z  INFO lmq The queue type classic is not supported by LavinMQ and will be changed to the default queue type
2026-04-16T19:07:37.707505Z  INFO lmq.message_store[queue: "federated", vhost: "/"] Loaded 1 segments, 0 messages
2026-04-16T19:07:37.707720Z  INFO lmq.vhost[vhost: "/"] Applying 0 exchange bindings
2026-04-16T19:07:37.707726Z  INFO lmq.vhost[vhost: "/"] Applying 0 queue bindings
2026-04-16T19:07:37.707727Z  INFO lmq.vhost[vhost: "/"] Definitions loaded
2026-04-16T19:07:37.731481Z  INFO lmq.http.server Bound to 0.0.0.0:15672
2026-04-16T19:07:37.747184Z  INFO lmq.http.server Bound to /tmp/lavinmqctl.sock
2026-04-16T19:07:37.747643Z  INFO lmq.metrics.server Bound to 127.0.0.1:15692
2026-04-16T19:07:37.752538Z  INFO lmq.server Listening for AMQP on 0.0.0.0:5672
2026-04-16T19:07:37.753119Z  INFO lmq.server Listening for MQTT on 127.0.0.1:1883
2026-04-16T19:07:37.755387Z  INFO lmq.launcher Finished startup in 0.173009018s
2026-04-16T19:07:38.732525Z  INFO lmq.amqp.client[vhost: "/", address: "172.18.0.2:56258", name: "xxx"] Connection established for user=xxx
2026-04-16T19:07:38.740035Z  INFO lmq.message_store[queue: "HMIMw_7c8a58d6-d0ee-45eb-a2d7-2edf7f1f1863", vhost: "/"] Loaded 1 segments, 0 messages
2026-04-16T19:07:38.743117Z  INFO lmq.message_store[queue: "HMIMgmt_6740000a-2ba8-46bd-bce9-4adb394c62fe", vhost: "/"] Loaded 1 segments, 0 messages
2026-04-16T19:07:40.240957Z  INFO lmq.amqp.client[vhost: "/", address: "10.11.0.110:34902", name: "Federation link (upstream: federated_f03f755c-fe55-4b6b-8c88-f954529da55d, policy: federated)"] Connection established for user=xxx
2026-04-16T19:07:40.273244Z  INFO lmq.amqp.client[vhost: "/", address: "10.11.0.110:34914", name: "Shovel shoveled_f03f755c-fe55-4b6b-8c88-f954529da55d"] Connection established for user=xxx
2026-04-16T19:07:43.813227Z  INFO lmq.amqp.client[vhost: "/", address: "172.18.0.1:51794"] Connection established for user=xxx
2026-04-16T19:07:44.318698Z  INFO lmq.message_store[queue: "MachineMgmtMw", vhost: "/"] Loaded 1 segments, 0 messages
2026-04-16T19:07:45.433902Z ERROR lmq.queue[queue: "federated", vhost: "/"] Queue closed due to error
path=/var/lib/lavinmq/42099b4af021e53fd8fd4e056c2568d7c2e3ffa8/8c7667b3c233e22b4afac020a6088bf9250a7168/msgs.0000000606 pos=7563293 size=7563293 (LavinMQ::MessageStore::Error)
  from /usr/src/lavinmq/src/lavinmq/message_store.cr:146:19 in 'shift?'
  from /usr/src/lavinmq/src/lavinmq/amqp/queue/queue.cr:728:45 in 'deliver_loop'
  from /usr/share/crystal/src/wait_group.cr:68:13 in '->'
  from /usr/share/crystal/src/fiber.cr:170:11 in 'run'
  from ???
Caused by: EOF but @size=1 (IO::EOFError)
  from /usr/src/lavinmq/src/lavinmq/message_store.cr:129:11 in 'shift?'
  from /usr/src/lavinmq/src/lavinmq/amqp/queue/queue.cr:728:45 in 'deliver_loop'
  from /usr/share/crystal/src/wait_group.cr:68:13 in '->'
  from /usr/share/crystal/src/fiber.cr:170:11 in 'run'
  from ???

2026-04-16T19:07:45.453527Z  INFO lmq.amqp.client[vhost: "/", address: "10.11.0.110:34902", name: "Federation link (upstream: federated_f03f755c-fe55-4b6b-8c88-f954529da55d, policy: federated)"] Connection disconnected for user=xxx duration=00:00:05
2026-04-16T19:07:50.451903Z  INFO lmq.amqp.client[vhost: "/", address: "10.11.0.110:33548", name: "Federation link (upstream: federated_f03f755c-fe55-4b6b-8c88-f954529da55d, policy: federated)"] Connection established for user=xxx
2026-04-16T19:07:50.606940Z  INFO lmq.amqp.client[vhost: "/", address: "172.18.0.3:43476"] Connection established for user=xxx
2026-04-16T19:07:50.815172Z  INFO lmq.message_store[queue: "amq_c5a18072a7474723bbcdb9642850374d", vhost: "/"] Loaded 1 segments, 0 messages

Describe your setup
LavinMQ 2.7.0-rc.2 running in docker on a custom built imx8-based platform. 2 clients on the same platform are connecting to it, one is a RabbitMQ C# client, another is an amqp-cpp client.
Federated and shoveled queues towards a RabbitMQ upstream server (4.2.4-alpine)

root@localhost:~# docker exec -it broker lavinmq --build-info           
LavinMQ 2.7.0-rc.2
Crystal 1.19.1 (2026-01-20)
LLVM: 18.1.3
Default target: aarch64-unknown-linux-gnu
Build flags: --release --debug

How to reproduce
Not exactly sure how to reproduce, make a federated queue, write to it and kill the app while writing?

Expected behavior
In an ideal world only the packet being written at the time of shutdown is corrupted and lost. Also, automatic recovery after the error occurs would be nice. I can provide RabbitMQ logs if this proves useful. Local connections do seem to still work. Our federated queue can potentially hold quite some data, which should be flushed to the upstream in case of connection issues, if this mechanism is not guaranteed, I would love to know how to properly deal with it instead.

Metadata

Metadata

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions