fix: Allow for ValidationErrors to lead to failed events by effron · Pull Request #73 · Betterment/journaled

effron · 2026-05-28T18:17:05Z

Summary

We had an issue where a single journaled event exceeded kinesis's limit, but because it didn't trigger a BatchTooLarge error, it did not process the batch, led to an exception, and caused the journaled worker to enter a death loop (it kept emitting the same batch, and running into the same error, on a loop).

This allows us to capture more types of ValidationExceptions, leading to failed events but not a totally broken job worker. It also adds additional validation to Event creation to prevent us from creating events that are doomed to fail by being too large.

…enqueued

effron · 2026-05-28T18:18:33Z

+          payload_bytesize = event_data.merge(id: SecureRandom.uuid).to_json.bytesize
+          if payload_bytesize > KinesisBatchSender::KINESIS_MAX_RECORD_BYTES
+            raise RecordTooLargeError, "Journaled event '#{event.journaled_attributes[:event_type]}' " \
+              "exceeds Kinesis #{KinesisBatchSender::KINESIS_MAX_RECORD_BYTES}-byte per-record limit " \
+              "(#{payload_bytesize} bytes); refusing to enqueue."
+          end


we need this here instead of as a validation on Journaled::Outbox::Event because we use insert_all, bypassing model level validations.

Do you think it's worth adding as a db constraint (albeit slightly less precise) instead?

UUIDs are fixed-length, so it seems like we could maintain a reasonable length estimate (perhaps with some padding) in either place. (It all kinda comes down to the way the string serialization of the JSON works over in the outbox worker, right?)

Do you think it's worth adding as a db constraint (albeit slightly less precise) instead?

I would say in addition to rather than instead -- the DB constraint can be a backstop that maintains the invariant, and the raise call should be at least as strict in terms of padding estimates but is also where you get the developer-friendly error message and exception type.

And we don't strictly need the DB constraint as part of this PR, but if it's trivial to add then happy to review that too.

But mainly I think it's useful to both validate the data in motion and then express the hard requirements in the schema, and both are valuable.

samandmoore · 2026-05-28T19:13:36Z

i believe this fixes a known issue:

fixes #33

effron · 2026-05-28T19:15:24Z

@samandmoore good to know! I went with the strategy of erroring at event creation time, rather than create-and-fail the event. Do you think I should change the approach?

samandmoore · 2026-05-28T22:26:27Z

hm. my thinking is that id rather not lose an event nor error for the user.

so if we can accept the event and figure out how to fix the event to get it to flow through the pipeline, that seems better in some sense than breaking for the user.

it's definitely a trade off that accepts more pain for us operationally though.

smudge · 2026-05-29T15:42:22Z

-        raise unless e.message.match?(BATCH_TOO_LARGE_PATTERN)
-
-        handle_batch_too_large(stream_name, stream_events)
+        handle_validation_error(stream_name, stream_events, e.message)


Yup, and just flagging this for myself (if I ever try to reconstruct my understanding of the root cause), the raise unless is what essentially creates the poison pill jobs, as nothing upstream of this will mark these jobs as failed.

smudge

Non-blocking feedback about the DB constraint!

smudge · 2026-05-29T17:25:03Z

+          # placeholder id keeps this estimate honest.
+          payload_bytesize = event_data.merge(id: SecureRandom.uuid).to_json.bytesize
+          if payload_bytesize > KinesisBatchSender::KINESIS_MAX_RECORD_BYTES
+            raise RecordTooLargeError, "Journaled event '#{event.journaled_attributes[:event_type]}' " \


Also, re: @samandmoore's point:

hm. my thinking is that id rather not lose an event nor error for the user.

so if we can accept the event and figure out how to fix the event to get it to flow through the pipeline, that seems better in some sense than breaking for the user.

I guess I'm sort of 🤞 that we would detect most case before it gets to a production request. But even then, I guess we have two choices:

Disallow an operation from proceeding if it is fundamentally not loggable. (This has been where I gravitate, kind of for simplicity until we have a better sense for what we need to solve beyond that.)

Allow it to proceed but fire the events directly into the dead letter queue in the hopes that they can be manually cleaned up later.

I'm less a fan of option 2 because I think we ultimately want to do away with the DLQ -- it's a bit of a crutch that makes it easier for us to ignore / tighten the ratchet on guaranteeing deliverability from the moment we construct the event payload.

effron added 2 commits May 28, 2026 14:03

rescue all valdidatoin exceptions, prevent too-large jobs from being …

cd735e9

…enqueued

bump patch version

03e3db1

effron requested a review from a team as a code owner May 28, 2026 18:17

effron commented May 28, 2026

View reviewed changes

effron requested a review from smudge May 28, 2026 18:18

effron changed the title ~~fix: Allow for ValidationErrors to lead to failed jobs~~ fix: Allow for ValidationErrors to lead to failed events May 28, 2026

smudge reviewed May 29, 2026

View reviewed changes

smudge approved these changes May 29, 2026

View reviewed changes

smudge reviewed May 29, 2026

View reviewed changes

effron merged commit e4fe3a4 into master Jun 2, 2026
30 checks passed

effron deleted the effron/main/more-robust-validation-error-handling branch June 2, 2026 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Allow for ValidationErrors to lead to failed events#73

fix: Allow for ValidationErrors to lead to failed events#73
effron merged 2 commits into
masterfrom
effron/main/more-robust-validation-error-handling

effron commented May 28, 2026

Uh oh!

effron May 28, 2026

Uh oh!

smudge May 29, 2026

Uh oh!

smudge May 29, 2026

Uh oh!

samandmoore commented May 28, 2026

Uh oh!

effron commented May 28, 2026

Uh oh!

samandmoore commented May 28, 2026

Uh oh!

smudge May 29, 2026 •

edited

Loading

Uh oh!

smudge left a comment

Uh oh!

smudge May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

effron commented May 28, 2026

Summary

Uh oh!

effron May 28, 2026

Choose a reason for hiding this comment

Uh oh!

smudge May 29, 2026

Choose a reason for hiding this comment

Uh oh!

smudge May 29, 2026

Choose a reason for hiding this comment

Uh oh!

samandmoore commented May 28, 2026

Uh oh!

effron commented May 28, 2026

Uh oh!

samandmoore commented May 28, 2026

Uh oh!

smudge May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smudge left a comment

Choose a reason for hiding this comment

Uh oh!

smudge May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

smudge May 29, 2026 •

edited

Loading