Use YEAR + ERA for XSD 1.0 calendar year numbering#1685
Conversation
| case class DFDLDate( | ||
| calendar: Calendar, | ||
| override val hasTimeZone: Boolean, | ||
| prolepticBCE: Boolean = false |
There was a problem hiding this comment.
I'm not sure I understand the need for this proleptic flag. Doesn't the Calendar contain all the necessary era information in the ERA field?
We also still have a number of references in our codebase to EXTENDED_YEAR and patterns with uuuu--that feels like a bug to me. Should we instead be setting/getting the YEAR and ERA components everywhere and using yyyy in all of our patterns? And removing all references to EXTENDED_YEAR and uuuu?
So the idea is we let ICU parse things according to dfdl:calendarPattern. When we want to create a string for the infoset or in one of the path fn:year-from-date*() functions we get the YEAR and ERA fields, and if the ERA field is BC we negate the year. Whether or not the pattern has a G in it shouldn't matter I think?. And for unparse, we read a date from the infoset and set the YEAR to the absolute value of the year and set the ERA to BC if the year was negative. It is a little extra work (which it looks like you already have logic for in a few places), but it avoids the potential of confusing EXTENDED_YEAR with YEAR. Seems we don't really want EXTENDED_YEAR or uuuu appearing anywhere in our codebase, except for tests making sure 'uuuu' patterns work correct.
Note I think this change would mean that it's impossible to use dfdl:calendarPattern to model BC years using yyyy without a G indicator (which is maybe the original reason for switching to uuuu and EXTENDED_YEAR many years ago?). But I think that's correct. Many DFDL semantics are based on ICU, and that is just a limitation of the yyyy pattern. If you want a pattern that is of the form yyyy-MM-dd without a G era indicator, and you want negative years to indicate the era, then you must use uuuu with the assumption that your years are astronomical. If your years aren't astronomical, then DFDL can't currently model them, unless it's added a new feature to support negative yyyy patterns.
There was a problem hiding this comment.
So the prolepticBCE flag is not recording the ERA. It looks at which pattern is used, which is why it checks that the pattern does not contain a G. No G means there is no explicit era designator, so for a BCE value the era was carried by the sign and the year was parsed astronomically (the uuuu form). If it is BCE but the pattern contains G, the era is named explicitly and ICU parses it correctly, so the flag stays false and nothing is adjusted. But if it is BCE and the pattern has no G, the date is represented astronomically which has a year zero and so is off by one from XSD 1.0 for BCE, and the flag is set true to mark that the read-out needs an adjustment to match XSD 1.0. Perhaps a better name for that flag would have been astronomicalYear?
The need for that flag stemmed from the fact that I changed EXTENDED_YEAR to YEAR + ERA (for the most part, other than where special logic was needed to help correct the astronomical representation of BCE dates), but still kept the uuuu pattern as the default/implicit pattern. So if we change the default pattern to yyyy, it eliminates the need for that. The only issue, as you already mentioned, is that we won't be able to represent BCE years without the G marker, which seems to be the correct thing to do. So I'll update any BCE tests that previously used the implicit uuuu pattern to explicitly use a yyyy pattern with a G marker and era-worded input (e.g. BC), since negative years won't parse under yyyy on their own. This should simplify things — it removes the flag, the read-side and unparse adjustments, and the EXTENDED_YEAR/uuuu references.
b1e550a to
4ff33bc
Compare
stevedlawrence
left a comment
There was a problem hiding this comment.
Looks like, this feels liek the write approach. But I think we need to take it a bit farther and remove remaining patterns.
| // Use YEAR + ERA rather than EXTENDED_YEAR to match XSD 1.0 numbering: | ||
| // a negative lexical year is BCE, stored as ERA=BC with positive YEAR. | ||
| val yInt = Integer.parseInt(y) | ||
| if (yInt < 0) { |
There was a problem hiding this comment.
Do we throw an error if yInt is 0? These strings are XSD 1.0 date/times which do not support year 0.
There was a problem hiding this comment.
In practice, yInt can never be 0 unless the value somehow bypasses ICU entirely.
From my investigation and tracing, the raw document string first comes in through ConvertTextCalendarParser.parse. For example, consider the date 0000-01-01T23:00:00 with the explicit pattern:uuuu-MM-dd'T'HH:mm:ss
When ICU parses that text via df.parse(txt, cal, pos), it sets: YEAR = 1 and ERA = BC on the calendar instance. Later, DFDLCalendarConversion.datePartToXMLString renders that calendar value as: -0001-01-01
So by the time the string reaches datePartFromXMLString, the date has already been transformed from:
0000-01-01T23:00:00 to -0001-01-01T23:00:00. So ICU has already normalized year zero to -0001 (1 BC). As a result, yInt can never be 0 on this code path.
It is possible to intentionally specify year 0000 in an expected infoset, but that follows a different code path. In that case, the value is validated by dateTimeIsSame, which uses the new XMLGregorianCalendar for the comparison. Since year 0000 is not valid under XSD 1.0, the comparison ultimately throws a TDMLException.
There was a problem hiding this comment.
This isn't correct. The fromXMLString functions are used to read date/time values from the infoset when unparsing. Even though parse will never create an infoset with year 0000, someone could modify/create a year 0000, which would hit this code path.
To verify, you could create a unparserTestCase that has a year 0000 in the infoset.
Note that you're right that the TDML runner will not use this code path, but that logic only happens after a parser. In unparser tests we do not hit the XMLGregorian code path because that's only used for comparing infosets, which we don't do on unparse.
There was a problem hiding this comment.
Ah, so the reason why those debug statements were showing up on that year zero test when I tried tracing it was because that test roundtrips, so I drew the wrong conclusion there. it also makes sense why when i intentionally set the infoset to 0000 for the date it routed to the dateTimeIsSame function for validation and never hit the code path.
Do we want to follow teh XMLGregorianCalendar behavior and simply throw an ILLEgalArgumentException here if yInt is 0? I think that should then be casted to a TDMLException during unparse? The issue with that is that we cannot really unit test these using the standard TDML test cases unless we convert those to parse/unparse errors and expect those errors from the tdml?
There was a problem hiding this comment.
Yeah, I think IllegalArgumentException is the right thing, but we should do it via the invalidCalendar function. It looks like that's that standard thing our DFDLCalendarConversion code does when it determines a calendar string isn't valid.
When unparsing, the IllegalArugmentException should be caught by the NodeInfo Date/DateTime.fromXmlString stuff and eventually get converted to an UnparseError. The XMLGregorianCalendar code path only happens on parse, so if you create an unparserTestCase with a 0000 infoset it should create an UnparserError that the TDML test can expect.
There was a problem hiding this comment.
ok great! the last question I have about this for the parse side that gets routed to the XMLGregorianCalendar and that exception does not become a parseError, the TDMLRunner simply throws a TDMLException. Are we ok with that behavior? we don't have a test case for that in the form of TDML because of that reason. we could unit test the function itself to show that it throws an exception for year zero assuming the exception being thrown here is an acceptable behavior.
There was a problem hiding this comment.
Yeah, I think that behavior is fine. The XMLGregorianCalendar exception becoming a TDML error is reasonable because a failure there indicates an expected infoset in a parserTestCase that is invalid, so it's more of an incorrectly written TDML file than a parse/unparse error, so TDMLException makes sense. We often use that for indicating there is something wrong with a TDML file.
Yeah, we would need to write a unit test since tdml tests can't expect TDMLExceptions. We have a number of unit tests that do intercept[TDMLException] so there is precedent for this kind of thing.
Change date/dateTime year handling from astronomical numbering (uuuu/EXTENDED_YEAR) to year-of-era semantics (yyyy with YEAR/ERA), so years conform to XSD 1.0 (no year zero; 1 BCE = -0001). This covers the implicit/default calendar pattern, lexical infoset conversion, facet parsing/narrowing, and the binaryCalendarEpoch. Parsing, formatting, comparison, and XPath year extraction now read years via YEAR/ERA instead of the astronomical EXTENDED_YEAR. Facet conversion and the binaryCalendarEpoch route through DFDLCalendarConversion rather than hardcoded calendar patterns; the latter gains negative-year support. BCE years require an explicit era designator (G) with yyyy-based patterns. Explicit uuuu patterns still parse astronomically, but extracted years and infoset output are always rendered in XSD 1.0 numbering. DAFFODIL-3084
4ff33bc to
40cbb3a
Compare
| } else { | ||
| // value-space facets (min/max Inclusive/Exclusive, enumeration): | ||
| // parse in the element's own value space | ||
| calendarStringToBigDecimal(facet, primType, this) |
There was a problem hiding this comment.
This needs to take into account the prim type. For example, if the prim Type was xs:int then this is would try to convert the value to a calendar which will fail.
| } | ||
| } catch { | ||
| case e: IllegalArgumentException => | ||
| SDE("invalid facet restriction: %s", e.getMessage) |
There was a problem hiding this comment.
We should include the facet name in this SDE
| case u: UnsuppressableException => throw u | ||
| case _: Exception => | ||
| try { | ||
| new JBigDecimal(value) |
There was a problem hiding this comment.
I'm not sure it's a good idea to to fall back to converting to a JBigDeicmal. This function is just about calendar strings, so if something passes in an integer string then it should not succeed, which is what this does. We did that before, but that probably wasn't the right thing.
I would also suggestw e dont'want to catch unsuppressable exception or exception. The only exceptino that should be thrown here that we don't want to bubble up is the InvalidPrimitiveDataException thrown by fromXMLString. I don't think getCalendar or toJBigDecimal can throw exceptions.
I would also suggest we remove this function and just move calendar conversion logic into convertFacetToBigDecimal, that way that is the one function that handles converting a facet to big decimal
| case (_, _) => { | ||
| // Perform conversions once | ||
| val theValue = convertStringToBigDecimal(value, primType, theContext) | ||
| val theValue = calendarStringToBigDecimal(value, primType, theContext) |
There was a problem hiding this comment.
This probably wants to call convertFacetToBigDecimal, this part of the code can get more than just calendars types.
There was a problem hiding this comment.
This was the reason why I reverted that changed where that function you mentioned to remove was put back and renamed. I tried the inlining idea both cases then some of the tests started to fail when it tries to convert an already integer string. so, I tried to follow what the previously removed function was doing and because of the fallback idea. I thought it made more sense to keep that logic in one place instead of duplicating it since the inlining idea using the primtype to convert to JBigDecimal was not as simple as we originally thought it would
There was a problem hiding this comment.
That's fair. I think combining all the logic into convertFacetToBigDecimal might sovle that problem? Then we wouldn't need a falback since that function knows about facet types and primtypes and shouldn't need a fallback since it knows the exact format to expect.
| try { | ||
| DFDLDateTimeConversion.fromXMLString(e.binaryCalendarEpoch) | ||
| } catch { | ||
| case exception: IllegalArgumentException => |
There was a problem hiding this comment.
I think we should probably avoid directly calling DFDL*Conversion functions. Instead, we can use NodeInfo.DateTime.fromXMLString(...).getCalendar and catch InvalidPrimitiveDataException. That way NodeInfo.fromXMLString becomes the one source of truth for how to convert between XSD primitives and our internal primities.
Then our DFDLConversion functions just become an implementation detail of our NodeInfos, and we can easily change how NodeInfo does conversions in the future without affecting other code.
| * @param calendar the ICU calendar instance | ||
| * @return signed year (e.g. 1 BCE -> -1, 1 CE -> 1) | ||
| */ | ||
| def signedYear(calendar: Calendar): Int = { |
There was a problem hiding this comment.
Thoughts on making this a function on DFDLCalendar, similar to hasTimeZone and toJBigDecimal, so we could do something like
val xsdYear = dfdlDate.signedYearIt would have the same logic, but we don't have to pass in a calendar and it makes signedYear a bit more visible in case we need it for other things.
It looks like everywhere we call signedYear we are calling it on a DFDLCalendar so it shoudl be a straightforward change.
There was a problem hiding this comment.
I think that's a good idea
Change date/dateTime year handling from astronomical numbering
(uuuu/EXTENDED_YEAR) to year-of-era semantics (yyyy with YEAR/ERA), so
years conform to XSD 1.0 (no year zero; 1 BCE = -0001). This covers the
implicit/default calendar pattern, lexical infoset conversion, facet
parsing/narrowing, and the binaryCalendarEpoch.
Parsing, formatting, comparison, and XPath year extraction now read years
via YEAR/ERA instead of the astronomical EXTENDED_YEAR. Facet conversion and
the binaryCalendarEpoch route through DFDLCalendarConversion rather than
hardcoded calendar patterns; the latter gains negative-year support.
BCE years require an explicit era designator (G) with yyyy-based patterns.
Explicit uuuu patterns still parse astronomically, but extracted years and
infoset output are always rendered in XSD 1.0 numbering.
DAFFODIL-3084