Skip to content

fix: Allow for ValidationErrors to lead to failed events#73

Merged
effron merged 2 commits into
masterfrom
effron/main/more-robust-validation-error-handling
Jun 2, 2026
Merged

fix: Allow for ValidationErrors to lead to failed events#73
effron merged 2 commits into
masterfrom
effron/main/more-robust-validation-error-handling

Conversation

@effron

@effron effron commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary

We had an issue where a single journaled event exceeded kinesis's limit, but because it didn't trigger a BatchTooLarge error, it did not process the batch, led to an exception, and caused the journaled worker to enter a death loop (it kept emitting the same batch, and running into the same error, on a loop).

This allows us to capture more types of ValidationExceptions, leading to failed events but not a totally broken job worker. It also adds additional validation to Event creation to prevent us from creating events that are doomed to fail by being too large.

@effron effron requested a review from a team as a code owner May 28, 2026 18:17
Comment on lines +35 to +40
payload_bytesize = event_data.merge(id: SecureRandom.uuid).to_json.bytesize
if payload_bytesize > KinesisBatchSender::KINESIS_MAX_RECORD_BYTES
raise RecordTooLargeError, "Journaled event '#{event.journaled_attributes[:event_type]}' " \
"exceeds Kinesis #{KinesisBatchSender::KINESIS_MAX_RECORD_BYTES}-byte per-record limit " \
"(#{payload_bytesize} bytes); refusing to enqueue."
end

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need this here instead of as a validation on Journaled::Outbox::Event because we use insert_all, bypassing model level validations.

Do you think it's worth adding as a db constraint (albeit slightly less precise) instead?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UUIDs are fixed-length, so it seems like we could maintain a reasonable length estimate (perhaps with some padding) in either place. (It all kinda comes down to the way the string serialization of the JSON works over in the outbox worker, right?)

Do you think it's worth adding as a db constraint (albeit slightly less precise) instead?

I would say in addition to rather than instead -- the DB constraint can be a backstop that maintains the invariant, and the raise call should be at least as strict in terms of padding estimates but is also where you get the developer-friendly error message and exception type.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we don't strictly need the DB constraint as part of this PR, but if it's trivial to add then happy to review that too.

But mainly I think it's useful to both validate the data in motion and then express the hard requirements in the schema, and both are valuable.

@effron effron requested a review from smudge May 28, 2026 18:18
@effron effron changed the title fix: Allow for ValidationErrors to lead to failed jobs fix: Allow for ValidationErrors to lead to failed events May 28, 2026
@samandmoore

Copy link
Copy Markdown
Member

i believe this fixes a known issue:

fixes #33

@effron

effron commented May 28, 2026

Copy link
Copy Markdown
Contributor Author

@samandmoore good to know! I went with the strategy of erroring at event creation time, rather than create-and-fail the event. Do you think I should change the approach?

@samandmoore

Copy link
Copy Markdown
Member

hm. my thinking is that id rather not lose an event nor error for the user.

so if we can accept the event and figure out how to fix the event to get it to flow through the pipeline, that seems better in some sense than breaking for the user.

it's definitely a trade off that accepts more pain for us operationally though.

Comment on lines -46 to +47
raise unless e.message.match?(BATCH_TOO_LARGE_PATTERN)

handle_batch_too_large(stream_name, stream_events)
handle_validation_error(stream_name, stream_events, e.message)

@smudge smudge May 29, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, and just flagging this for myself (if I ever try to reconstruct my understanding of the root cause), the raise unless is what essentially creates the poison pill jobs, as nothing upstream of this will mark these jobs as failed.

@smudge smudge left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking feedback about the DB constraint!

# placeholder id keeps this estimate honest.
payload_bytesize = event_data.merge(id: SecureRandom.uuid).to_json.bytesize
if payload_bytesize > KinesisBatchSender::KINESIS_MAX_RECORD_BYTES
raise RecordTooLargeError, "Journaled event '#{event.journaled_attributes[:event_type]}' " \

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, re: @samandmoore's point:

hm. my thinking is that id rather not lose an event nor error for the user.

so if we can accept the event and figure out how to fix the event to get it to flow through the pipeline, that seems better in some sense than breaking for the user.

I guess I'm sort of 🤞 that we would detect most case before it gets to a production request. But even then, I guess we have two choices:

  1. Disallow an operation from proceeding if it is fundamentally not loggable. (This has been where I gravitate, kind of for simplicity until we have a better sense for what we need to solve beyond that.)
  2. Allow it to proceed but fire the events directly into the dead letter queue in the hopes that they can be manually cleaned up later.

I'm less a fan of option 2 because I think we ultimately want to do away with the DLQ -- it's a bit of a crutch that makes it easier for us to ignore / tighten the ratchet on guaranteeing deliverability from the moment we construct the event payload.

@effron effron merged commit e4fe3a4 into master Jun 2, 2026
30 checks passed
@effron effron deleted the effron/main/more-robust-validation-error-handling branch June 2, 2026 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants