|
| 1 | +# CSV Parser Philosophy |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +This document outlines the philosophy and design principles of the CSV parser library. |
| 6 | +It is intended to provide a clear understanding of the design decisions and the reasoning behind them. |
| 7 | + |
| 8 | +## Key Principles |
| 9 | + |
| 10 | +All design decisions should be guided by the following principles: |
| 11 | + |
| 12 | +- **Integrity**: Data integrity is paramount. Data should only be modified in the most expected of ways. |
| 13 | +- **Fail-fast**: Ambigous or malformed input should result in immediate failure to avoid cascading errors. |
| 14 | +- **Low-level**: The library should provide low-level access for flexibility purposes |
| 15 | +- **Real-world compatible:** With high regard to the above, the library should be compatible with real-world data and not just idealized data. |
| 16 | + |
| 17 | +# Escaped fields |
| 18 | + |
| 19 | +Escaped fileds... |
| 20 | + |
| 21 | +1. ...are surrounded by _escape characters_ (usually double quotes). |
| 22 | +2. ...can contain any character, including the delimiter and newline characters. |
| 23 | +3. ...can contain _escape characters_ themselves, which are preserved by doubling them. |
| 24 | +4. ...must be used for all fields that contain the delimiter, newline characters, or _escape characters_ themselves. |
| 25 | + |
| 26 | +The following table shows how escaped fields are parsed (escaped by a double quote character): |
| 27 | + |
| 28 | +| Example | Result | Additional Note | |
| 29 | +| -------------------------------------------------- | --------------------------------------------------------- | ------------------------------------------------------------------------------------------- | |
| 30 | +| `a,"example string, with delimiter",c` | `["a", "example string, with delimiter", "c"]` | Fields are escaped by surrounding them with _escape characters_. | |
| 31 | +| `a,"example string, with ""escape character""",c` | `["a", "example string, with \"escape character\"", "c"]` | Delimiters are preserved by doubling them as proposed by the RFC 4180 standard. | |
| 32 | +| `a,"example string,\nwith newline",c` | `["a", "example string,\nwith newline", "c"]` | Newlines are preserved in escaped fields. | |
| 33 | +| `a,example string",c` | Invalid syntax | Escaped fields must start with an escape character. | |
| 34 | +| `a,"example string,c` | Invalid syntax | Escaped fields must end with an escape character. | |
| 35 | +| `a, "example string",c` or `a,"example string" ,c` | Invalid syntax | Whitespace around quotes is disallowed, since it can lead to ambiguity. | |
| 36 | +| `a,example "str"ing,c` | Invalid syntax | `Escape characters` within unescaped fields is not allowed, since it can lead to ambiguity. | |
| 37 | + |
| 38 | +# Edge Cases |
| 39 | + |
| 40 | +Most of the CSV parser design is adhering to the [RFC 4180 standard](https://www.ietf.org/rfc/rfc4180.txt). |
| 41 | +However, there are some edge cases that are not covered by the standard. |
| 42 | +These edge cases are handled in a way that is consistent with the principles above. |
| 43 | + |
| 44 | +| Supported Cases | Example | Reasoning | |
| 45 | +| ------------------------------- | ------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | |
| 46 | +| Trailing delimiter | `a,b,` → `["a", "b", ""]` | Each field is separated by a delimiter, so a trailing delimiter should result in an empty field. | |
| 47 | +| Unquoted empty | `a,,c` → `["a", "", "c"]` | Each field is separated by a delimiter, so an unescaped empty field should result in an empty string. | |
| 48 | +| Multiple consecutive delimiters | `a,,,b` → `["a", "", "", "b"]` | Each field is separated by a delimiter, so multiple consecutive delimiters should result in empty fields. | |
| 49 | +| Empty escaped field | `"",x` → `["", "x"]` | Escaped fields may be empty, as no data is also valid data - no matter if escaped or not. | |
| 50 | +| Custom newline | rows separated by ex. `\n`, `\r\n`, or `\r` | Configurable line endings are supported (default is `\n`). There is no statical analysis of line endings to enforce consistency in parsing data. | |
| 51 | +| Custom delimiter | columns separated by ex. `,` or `;` | Configurable delimiters are supported (default is `,`). There is no statical analysis of delimiters to enforce consistency in parsing data. | |
| 52 | +| Custom escape character | fields escaped by ex. `"` or `'` | Configurable escape characters are supported (default is `"`). There is no statical analysis of escape characters to enforce consistency in parsing data. | |
| 53 | +| Newline as last character | `a,b\n` → `["a", "b"]` | Newline at the end of the file is ignored, as it is not considered part of the data as per the RFC 4180 standard. It is an expected case in many CSV files. | |
| 54 | +| Missing final newline | `a,b,c` EOF | The parser should not require a final newline character at the end of the file, as per the RFC 4180 standard. | |
| 55 | +| Missing header row | - | The header row is optional, as per the RFC 4180 standard. The parser should be able to handle files without a header row. | |
| 56 | + |
| 57 | +| Unsupported Cases | Example | Reasoning | |
| 58 | +| --------------------------------------- | -------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | |
| 59 | +| Empty lines | `a,b\n\nc` | Empty lines may be a result of a misconfiguration or an error in the data. The parser should fail-fast to avoid cascading errors. | |
| 60 | +| Inconsistent row lengths | `a,b\nc,d,e` | Fail-fast on inconsistent row lengths to ensure data integrity. Each row should have the same number of fields. | |
| 61 | +| Mixed styles | `'a',"b"` or `a,b\n1;2` | Fail-fast on mixed character styles to ensure consistency in parsing. The parser should not attempt to guess the style. | |
| 62 | +| Backslash escaping | `a\,b` → `["a,b"]` or `a,"\"",c` → `["a,\"", "c"]` | Backslash escaping is currently unsupported as it is not part of the RFC 4180 standard and could lead to ambiguity. | |
| 63 | +| Fields with comment-style trailing text | `a,b # note` | Trailing comments are parsed verbatim and as part of the field. This should be avoided to prevent ambiguity. | |
| 64 | + |
| 65 | +# References |
| 66 | + |
| 67 | +- [RFC 4180](https://www.ietf.org/rfc/rfc4180.txt) - The standard for CSV files. |
0 commit comments