fix: Allow for ValidationErrors to lead to failed events#73
Conversation
| payload_bytesize = event_data.merge(id: SecureRandom.uuid).to_json.bytesize | ||
| if payload_bytesize > KinesisBatchSender::KINESIS_MAX_RECORD_BYTES | ||
| raise RecordTooLargeError, "Journaled event '#{event.journaled_attributes[:event_type]}' " \ | ||
| "exceeds Kinesis #{KinesisBatchSender::KINESIS_MAX_RECORD_BYTES}-byte per-record limit " \ | ||
| "(#{payload_bytesize} bytes); refusing to enqueue." | ||
| end |
There was a problem hiding this comment.
we need this here instead of as a validation on Journaled::Outbox::Event because we use insert_all, bypassing model level validations.
Do you think it's worth adding as a db constraint (albeit slightly less precise) instead?
There was a problem hiding this comment.
UUIDs are fixed-length, so it seems like we could maintain a reasonable length estimate (perhaps with some padding) in either place. (It all kinda comes down to the way the string serialization of the JSON works over in the outbox worker, right?)
Do you think it's worth adding as a db constraint (albeit slightly less precise) instead?
I would say in addition to rather than instead -- the DB constraint can be a backstop that maintains the invariant, and the raise call should be at least as strict in terms of padding estimates but is also where you get the developer-friendly error message and exception type.
There was a problem hiding this comment.
And we don't strictly need the DB constraint as part of this PR, but if it's trivial to add then happy to review that too.
But mainly I think it's useful to both validate the data in motion and then express the hard requirements in the schema, and both are valuable.
|
i believe this fixes a known issue: fixes #33 |
|
@samandmoore good to know! I went with the strategy of erroring at event creation time, rather than create-and-fail the event. Do you think I should change the approach? |
|
hm. my thinking is that id rather not lose an event nor error for the user. so if we can accept the event and figure out how to fix the event to get it to flow through the pipeline, that seems better in some sense than breaking for the user. it's definitely a trade off that accepts more pain for us operationally though. |
| raise unless e.message.match?(BATCH_TOO_LARGE_PATTERN) | ||
|
|
||
| handle_batch_too_large(stream_name, stream_events) | ||
| handle_validation_error(stream_name, stream_events, e.message) |
There was a problem hiding this comment.
Yup, and just flagging this for myself (if I ever try to reconstruct my understanding of the root cause), the raise unless is what essentially creates the poison pill jobs, as nothing upstream of this will mark these jobs as failed.
smudge
left a comment
There was a problem hiding this comment.
Non-blocking feedback about the DB constraint!
| # placeholder id keeps this estimate honest. | ||
| payload_bytesize = event_data.merge(id: SecureRandom.uuid).to_json.bytesize | ||
| if payload_bytesize > KinesisBatchSender::KINESIS_MAX_RECORD_BYTES | ||
| raise RecordTooLargeError, "Journaled event '#{event.journaled_attributes[:event_type]}' " \ |
There was a problem hiding this comment.
Also, re: @samandmoore's point:
hm. my thinking is that id rather not lose an event nor error for the user.
so if we can accept the event and figure out how to fix the event to get it to flow through the pipeline, that seems better in some sense than breaking for the user.
I guess I'm sort of 🤞 that we would detect most case before it gets to a production request. But even then, I guess we have two choices:
- Disallow an operation from proceeding if it is fundamentally not loggable. (This has been where I gravitate, kind of for simplicity until we have a better sense for what we need to solve beyond that.)
- Allow it to proceed but fire the events directly into the dead letter queue in the hopes that they can be manually cleaned up later.
I'm less a fan of option 2 because I think we ultimately want to do away with the DLQ -- it's a bit of a crutch that makes it easier for us to ignore / tighten the ratchet on guaranteeing deliverability from the moment we construct the event payload.
Summary
We had an issue where a single journaled event exceeded kinesis's limit, but because it didn't trigger a BatchTooLarge error, it did not process the batch, led to an exception, and caused the journaled worker to enter a death loop (it kept emitting the same batch, and running into the same error, on a loop).
This allows us to capture more types of
ValidationExceptions, leading to failed events but not a totally broken job worker. It also adds additional validation to Event creation to prevent us from creating events that are doomed to fail by being too large.