Skip to content

refactor: align internal terminology with ubiquitous language#127

Merged
stevehansen merged 1 commit into
masterfrom
refactor/align-internal-terminology
Jun 8, 2026
Merged

refactor: align internal terminology with ubiquitous language#127
stevehansen merged 1 commit into
masterfrom
refactor/align-internal-terminology

Conversation

@stevehansen

Copy link
Copy Markdown
Owner

What

Aligns the codebase to a single documented vocabulary (record / physical line / field / value / quoting), captured in a new UBIQUITOUS_LANGUAGE.md. This came out of a glossary pass that surfaced six pervasive terminology conflicts; this PR fixes the ones that are safe to fix now.

Why

Csv is a public package with millions of downloads, so renaming a public member is a SemVer-major break. The changes here are scoped by blast radius so nothing in the public contract moves:

Bucket Action
🟢 Internal / private identifiers Renamed (visible only to Csv.Tests via InternalsVisibleTo → zero ecosystem impact)
🟡 Public XML-doc text Reworded to canonical terms (not part of the binary/source contract)
🔴 Public member names Untouched — deferred to a future vNext behind [Obsolete] forwarders

Changes

Internal renames

  • Reader record classes: rawSplitLinerawFields, RawSplitLineRawFields, parsedLineparsedValues, and the private property literally named Line (which returned the parsed field array) → ParsedValues.
  • Writer escape-vs-quote fix: FixedEscapeCharsQuoteTriggerChars, escapeCharsquoteTriggerChars, needsGeneralEscapeneedsQuoting, and the wrap-the-field escape flag → mustQuote. needsQuoteEscape is kept (it genuinely means quote-doubling). cell/WriteCell/WriteRow/WriteLine are kept for consistency with the public writer API.

Doc rewording (non-breaking)

  • ColumnCount now documents "number of fields in this record"; ValidateColumnCount matches "field count per row"; Read* summaries say "Reads the records"; int indexers and ICsvLineSpan.GetSpan/GetMemory/TryGet* document a "field index"; CsvBufferWriter.WriteCell documents "quoting and escaping".

New file

  • UBIQUITOUS_LANGUAGE.md — the glossary, the blast-radius analysis, what changed in this pass, and a vNext rename-target table for the frozen public names (ColumnCountFieldCount, ValidateColumnCountValidateFieldCount, LineHasColumnRecordHasValue, ICsvLineICsvRecord).

Verification

  • Builds on netstandard2.0 / net8.0 / net9.0, 0 errors.
  • All 179 tests pass.
  • No public API surface changed (renames are internal/private; only XML-doc text and internal identifiers were touched).

🤖 Generated with Claude Code

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aligns the codebase and XML documentation with a newly defined 'Ubiquitous Language' glossary, renaming internal fields and updating comments to consistently use terms like 'record', 'field', and 'quoting' instead of 'line', 'cell', and 'escaping'. Feedback points out a critical inconsistency where \r is omitted from the quote-trigger characters in CsvWriter (unlike CsvBufferWriter), which could result in malformed CSVs. Additionally, the reviewer recommends caching the quote-trigger characters in a static array to avoid unnecessary per-row allocations in the writer's serialization methods.

Comment thread Csv/CsvWriter.cs
Comment on lines 24 to 30
#if NET8_0_OR_GREATER
// The separator is per-call so it can't be baked into a single cached SearchValues.
// Keep the fixed escape chars cached and check the separator with a separate Contains.
// Keep the fixed quote-trigger chars cached and check the separator with a separate Contains.
// Without this caching, MemoryExtensions.IndexOfAny(ReadOnlySpan, ReadOnlySpan)/char[]
// builds a fresh SearchValues<char> on the heap every call.
private static readonly SearchValues<char> FixedEscapeChars = SearchValues.Create("'\n");
private static readonly SearchValues<char> QuoteTriggerChars = SearchValues.Create("'\n");
#endif

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is an inconsistency in the quote-trigger characters between CsvWriter and CsvBufferWriter. CsvBufferWriter correctly includes \r (carriage return) in its QuoteTriggerChars ("'\n\r"), but CsvWriter only includes '\n in its QuoteTriggerChars and quoteTriggerChars arrays.

Omitting \r can cause fields containing carriage returns to be written without quotes, resulting in malformed CSV files when read by standard parsers.

Additionally, we can define a static FixedQuoteTriggerChars array here to avoid allocating a new char[] array on every row write in WriteLine and WriteLineAsync.

#if NET8_0_OR_GREATER
        // The separator is per-call so it can't be baked into a single cached SearchValues.
        // Keep the fixed quote-trigger chars cached and check the separator with a separate Contains.
        // Without this caching, MemoryExtensions.IndexOfAny(ReadOnlySpan, ReadOnlySpan)/char[]
        // builds a fresh SearchValues<char> on the heap every call.
        private static readonly SearchValues<char> QuoteTriggerChars = SearchValues.Create("'\n\r");
#endif
        private static readonly char[] FixedQuoteTriggerChars = { '\'', '\n', '\r' };

Comment thread Csv/CsvWriter.cs Outdated
Comment on lines +471 to +491
var quoteTriggerChars = new[] { separator, '\'', '\n' };
for (var i = 0; i < columnCount; i++)
{
if (i > 0)
writer.Write(separator);

if (i < data.Length)
{
var escape = false;
var mustQuote = false;
var cell = data[i] ?? string.Empty;
#if NET8_0_OR_GREATER
if (cell.Contains('"'))
#else
if (cell.Contains("\""))
#endif
{
escape = true;
mustQuote = true;
cell = cell.Replace("\"", "\"\"");
}
else if (cell.IndexOfAny(escapeChars) >= 0)
escape = true;
else if (cell.IndexOfAny(quoteTriggerChars) >= 0)
mustQuote = true;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Allocating a new char[] array (quoteTriggerChars) on every single call to WriteLine introduces unnecessary garbage collection pressure, especially when writing large datasets.

By using the static FixedQuoteTriggerChars array and checking the dynamic separator character separately via IndexOf, we can completely eliminate this per-row allocation.

            for (var i = 0; i < columnCount; i++)
            {
                if (i > 0)
                    writer.Write(separator);

                if (i < data.Length)
                {
                    var mustQuote = false;
                    var cell = data[i] ?? string.Empty;
#if NET8_0_OR_GREATER
                    if (cell.Contains('"'))
#else
                    if (cell.Contains("\""))
#endif
                    {
                        mustQuote = true;
                        cell = cell.Replace("\"", "\"\"");
                    }
                    else if (cell.IndexOf(separator) >= 0 || cell.IndexOfAny(FixedQuoteTriggerChars) >= 0)
                        mustQuote = true;

Comment thread Csv/CsvWriter.cs Outdated
Comment on lines 506 to 508
var quoteTriggerChars = new[] { separator, '\'', '\n' };
for (var i = 0; i < columnCount; i++)
{

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to WriteLine, we can avoid allocating the quoteTriggerChars array on every call to WriteLineAsync by removing it and using FixedQuoteTriggerChars instead.

            for (var i = 0; i < columnCount; i++)
            {

Comment thread Csv/CsvWriter.cs Outdated
await writer.WriteAsync('"').ConfigureAwait(false);
}
else if (cell.IndexOfAny(escapeChars) >= 0)
else if (cell.IndexOfAny(quoteTriggerChars) >= 0)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update the check in WriteLineAsync to use FixedQuoteTriggerChars and check the separator separately, avoiding the allocated array.

                    else if (cell.IndexOf(separator) >= 0 || cell.IndexOfAny(FixedQuoteTriggerChars) >= 0)

@stevehansen

Copy link
Copy Markdown
Owner Author

Thanks @gemini-code-assist — both points are valid, and the \r one is a genuine pre-existing bug:

  • \r missing from CsvWriter's quote triggers (HIGH): confirmed. CsvWriter triggered on ", the separator, ', and \n but not \r, while CsvBufferWriter already included it — so a field like a\rb was written unquoted and is malformed for strict RFC-4180 parsers.
  • Per-row char[] allocation (MEDIUM): agreed, worth hoisting to a static array + a separate separator check.

This PR is a deliberately behavior-neutral terminology/rename pass, so to keep it clean I've addressed both in a focused follow-up rather than here: #128 (adds \r across all CsvWriter paths — sync, async, and the ReadOnlyMemory<char> paths — removes the allocation, and adds regression tests + a CHANGELOG entry). This PR stays terminology-only.

@gemini-code-assist

Copy link
Copy Markdown

Thanks for the update, @stevehansen. That sounds like a solid plan. Keeping the terminology refactor focused while addressing the \r bug and allocation improvements in a separate, dedicated PR (#128) is definitely the right approach to ensure both are handled cleanly and effectively. I look forward to reviewing the follow-up.

stevehansen added a commit that referenced this pull request Jun 7, 2026
* fix: quote fields containing a carriage return in CsvWriter

Per RFC 4180 a field that contains CR, LF, the separator, or a quote must be
quoted. CsvWriter only triggered on '\n', the separator, the single quote, and
'"', so a value like `a\rb` was written unquoted — mis-parsed by strict readers
and split into two records when re-read. CsvBufferWriter already included '\r';
all CsvWriter paths (sync, async, and the ReadOnlyMemory<char> paths) now match it.

Also removes the per-row char[] allocation in WriteLine/WriteLineAsync by caching
the fixed quote-trigger characters in a static array and checking the variable
separator separately.

Surfaced by Gemini's review of #127.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

* perf: use cached SearchValues on the NET8+ writer hot path

Address PR review: on NET8+, WriteLine/WriteLineAsync now reuse the cached
SearchValues<char> via cell.AsSpan().IndexOfAny + string.Contains(separator)
instead of char[] IndexOfAny, matching the existing memory write paths. The
char[] fallback is retained for netstandard2.0 and scoped under #if so it is
not flagged unused on NET8+.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
Internal/private identifiers and XML-doc comments now follow a single
documented vocabulary (record / physical line / field / value / quoting).
No public API changes — every rename is internal or doc-only.

- Reader record classes: rawSplitLine->rawFields, parsedLine->parsedValues,
  and the private `Line` property (which returned the parsed field array)
  ->ParsedValues.
- Writer: fix the escape-vs-quote naming (FixedEscapeChars->QuoteTriggerChars,
  needsGeneralEscape->needsQuoting, the wrap-the-field `escape` flag->mustQuote).
  Kept cell/WriteCell/WriteRow for consistency with the public writer API.
- Reword misleading public XML docs (ColumnCount counts fields, Read* yields
  records, int indexers take a field index, WriteCell does quoting and escaping).
- Add UBIQUITOUS_LANGUAGE.md: the glossary, blast-radius analysis, and a vNext
  rename-target list for the frozen public names.

Builds on netstandard2.0/net8.0/net9.0; all 179 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
@stevehansen stevehansen force-pushed the refactor/align-internal-terminology branch from c0c4563 to 3197217 Compare June 7, 2026 19:56
@stevehansen stevehansen merged commit f7b6001 into master Jun 8, 2026
3 checks passed
@stevehansen stevehansen deleted the refactor/align-internal-terminology branch June 8, 2026 07:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant