Today, we will be looking for patterns in some documents.
Specifically, CSV-formatted files, AKA Comma-Separated Values. This is a well-known and simple file format. Additionally, the CSV ABNF grammar as specified in RFC 4180 is simple enough to comprehend to make the implementation of a lexer and parser a great entry-level practice problem for those interested in topics like parsers, interpreters, compilers.
At first you might think it’s as simple as splitting the CSV document string on newlines, then for each resulting line string you split on commas to find the fields.
That is perhaps better called the poor man’s CSV file format. Such will force you to not use commas or newlines in your fields. So you think… then to support such characters in fields let’s just come up with a scheme for escaping them. For instance only split on newlines not preceded by an unescaped escape character, or similar such schemes…
But these problems have already been solved in an elegant and easy to comprehend way in the published grammar for CSV in RFC 4180, so better to use that rather than reinvent the wheel. In this case, we are looking for practice writing a parser based on an established grammar (and a simple one at that), not practice with architecting our own grammar from scratch.
I’m told there is a not uncommon CSV format that does not follow the RFC standard’s approach to escaping double quotes with another double quote, and instead escapes them with a backslash. Interesting, anyway
Here’s the grammar as specified in the RFC, only slightly edited by me cosmetically.
file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
COMMA = %x2C
CR = %x0D ;as per section 6.1 of RFC 2234 [2]
DQUOTE = %x22 ;as per section 6.1 of RFC 2234 [2]
LF = %x0A ;as per section 6.1 of RFC 2234 [2]
CRLF = CR LF ;as per section 6.1 of RFC 2234 [2]
TEXTDATA = %x20-21
TEXTDATA /= %x23-2B
TEXTDATA /= %x2E-7E
TEXTDATA /= %x2D
You can read the Augmented Backus-Naur form (ABNF) wikipedia article for a more formal introduction to such grammars.
But on a higher level, it’s important to understand that the grammar is really only describing the shape of a kind of tree. I’ll give a natural language explanation of the meaning of the grammar below.
The root of the tree is the file
. The part to the right
of the equal sign is like an abstract definition of what constitutes a
csv file. According to this grammar, a csv file is as follows:
The terms used in that definition of a file
are further
clarified in the rest of the grammar.
header
has one or more name
s, separated
by comma
s if there are multiple. A synonym of
name
is field
, where a record
is
made of one or more comma-separated field
s.field
comes in one of two variants:
escaped
and non-escaped
.non-escaped
is easy — a sequence of zero or more
textdata
. That includes an empty string.escaped
is more involved. At minimum, it is comprised
of an empty pair of double quotes, though the double quotes may contain
zero or more of of any of the following in any order:
textdata
, comma
, cr
,
lf
, or 2dquote
(a sequence of two double
quotes).crlf
is cr
followed by
lf
The rest of the grammar entries are “terminals”, kind of like leaves in a tree. Basically this means each one represents one or more characters. Characters in this grammar are specified by their hexadecimal numeric values, either a single value or a range.
cr
is a carriage return characterlf
is a line feed charactercomma
is as you expectdquote
is a single double quotation mark charactertextdata
is any character within one of the defined
character ranges. Later I’ll show how to derive the characters from the
listed hex values, after which you’ll be able to figure out which
characters the author of the grammar was trying to avoid by specifying
multiple character ranges for textdata
.Let’s make a Map of the terminal names to their hex values. In the case of ranges, we will split them into multiples names where each range has a separate name for lower and upper bounds.
const terminals = new Map(
[
["COMMA",
0x2C
, [
]"CR",
0x0D
, [
]"LF",
0x0A
, [
]"DQUOTE",
0x22
, [
]"TEXTDATA_1_LB",
0x20
, [
]"TEXTDATA_1_UB",
0x21
, [
]"TEXTDATA_2_LB",
0x23
, [
]"TEXTDATA_2_UB",
0x2B
, [
]"TEXTDATA_3_LB",
0x2E
, [
]"TEXTDATA_3_UB",
0x7E
, [
]"TEXTDATA_4",
0x2D
]
]; )
Next, let’s transform the hex values we stored in the Map into their ordinary character forms.
[...terminals.keys()
.forEach(
]=> {
v .set(
terminals,
vString.fromCharCode(
.get(
terminals
v
)
)
)
}; )
And this next part is optional I guess. But, looking ahead at the grammar, I can already tell the kinds of matches we will need to do with regex so I’m going to stitch together a few strings to represent character ranges and individual characters along with their quantifiers.
const characterSets = new Map(
[
["CRLF",
[`${
.get(
terminals"CR"
)}${
.get(
terminals"LF"
)}`,
"1,"
], [
]"TEXTDATA",
[`${
.get(
terminals"TEXTDATA_1_LB"
)}-${
.get(
terminals"TEXTDATA_1_UB"
)}${
.get(
terminals"TEXTDATA_2_LB"
)}-${
.get(
terminals"TEXTDATA_2_UB"
)}${
.get(
terminals"TEXTDATA_3_LB"
)}-${
.get(
terminals"TEXTDATA_3_UB"
)}${
.get(
terminals"TEXTDATA_4"
)}`,
"1,"
], [
]"COMMA",
[`${
.get(
terminals"COMMA"
)}`,
"1"
], [
]"DQUOTE",
[`${
.get(
terminals"DQUOTE"
)}`,
"1"
]
]
]; )
Next stop: the Tokenization Station.
In this part we will make a generator function which will step through the whole CSV-formatted string and spit out a series of tokens, where tokens are basically just sequences of substrings from the input string, each with a label attached.
Here’s the function:
function * tokenizationStation(
document,
terminals
){const gex = new RegExp(
[...terminals.entries()
.map(
]
([,
name
[,
pattern
count
]=> `(?<${
])
name}>[${
pattern}]{${
count}})`
.join("|"),
)"g"
)let result
while(
= gex.exec(
(result document
))
){for(
const k of terminals.keys()
){if(
typeof result.groups !== "undefined" &&
typeof result.groups[k] === "string"
yield [
) ,
k.groups[k]
result
]
}
} }
This function takes the CSV string as first argument, and second is
the characterSets
Map we made earlier. It uses the Map to
create a regex pattern which is like a series of named capturing groups
separated by pipe |
characters. The idea is to match one of
those capturing groups. The regex also needs the g
flag as
we will use it with exec
. Then we just use a while loop to
keep yielding matches made with exec
until it stops
matching (ideally at the end of the document).
How I derive the regex pattern itself in code is, based on my own experience, probably easier for most to understand than the meaning of the resulting regex pattern itself. Hence, since we already have all the parts we need to calculate the regex pattern, I’ll give it to you next along with an natural language explanation of the pattern:
(?<CRLF>[\r\n]{1,})|(?<TEXTDATA>[ -!#-+.-~-]{1,})|(?<COMMA>[,]{1})|(?<DQUOTE>["]{1})
The |
characters form a disjunction. It’s like listing
the members of a set, or giving a list of possible alternatives:
(?<CRLF>[\r\n]{1,})
- match one or more
sequential characters where the characters are each a member of the set
of characters given in square brackets (a \r
or a
\n
). If matched, then assign the matched sequence to the
capturing group called CRLF
.(?<TEXTDATA>[ -!#-+.-~-]{1,})
- match one or more
characters in a sequence where each character is a member of the set in
square brackets. The brackets give a list of three character ranges:
-!
, #-+
, and .-~
. Also, a final
-
is listed between the brackets to given the group a
chance to also match a literal hyphen. The matched sequence is assigned
to the capturing group called TEXTDATA
.(?<COMMA>[,]{1})
- match and capture exactly one
comma.(?<DQUOTE>["]{1})
- match and capture exactly one
double quote character.In the while loop, we are interrogating the result of the regex exec
method operation to see if it matched anything. Since the groups
property of the result object is an object with null
as its
prototype, we don’t have access to usual object methods to enumerate
object keys. Hence, we must perform a linear search for the capturing
group it contains. When we find it, we remember to attach this name as a
label to the returned a token, which is just an array where first
element is the label and second is the matched text. The labels will be
used later by the parser to understand what part of the grammar the text
it is considering belongs to.
Now we will write the parsing logic. It will consume the sequence of tokens spit out by the tokenizer function. It will scan that sequence of tokens and, if the sequence fits the definition of CSV format per the grammar, transform the tokens into a readily consumable JavaScript data structure.
Before we compose our master parsing function, let’s make a few helper functions. These functions will follow similar conventions: they take as arguments a reference to the tokens array along with a cursor integer indicating the current position in the array. They return an array where first entry is the new cursor position, and the second is the “thing” (a particular kind of string) the function is meant to recognize.
First function is easier one, to recognize non-escaped
sequences:
function getNonEscaped(
,
tokens
cursor
){const finalTokensIdx = tokens.length - 1
const r = []
while(
<= finalTokensIdx &&
cursor 0] === "TEXTDATA"
tokens[cursor][
){.push(
r1]
tokens[cursor][
)++
cursor
}if(
<= finalTokensIdx &&
cursor 0] === "DQUOTE"
tokens[cursor][throw "PARSE ERROR: Unescaped fields can't contain double quotes."
) return [
,
cursor.join("")
r
] }
We keep looping as long as each token is a TEXTDATA
token and we’re not at the end. We then return the new cursor position
and the resulting string.
A similar function must be written for escaped
, but it’s
a more involved one:
function getEscaped(
,
tokens
cursor
){const finalTokensIdx = tokens.length - 1
if(
<= finalTokensIdx &&
cursor 0] === "DQUOTE"
tokens[cursor][
){const r = [
1]
tokens[cursor][
]++
cursorconst innerNonDquoteTypes = new Set(
["TEXTDATA",
"COMMA",
"CRLF"
]
)let unterminated = false
while(true){
if(
.has(
innerNonDquoteTypes0]
tokens[cursor][
)
){.push(
r1]
tokens[cursor][
)++
cursorcontinue
else if(
} === finalTokensIdx &&
cursor 0] === "DQUOTE"
tokens[cursor][
){.push(
r1]
tokens[cursor][
)++
cursorreturn [
,
cursor.join("")
r
]else if(
} === finalTokensIdx &&
cursor
unterminated
){throw `\nPARSE ERROR: Expected an escaped field. Expected to find a terminating double quote character.\n\nToken Type: ${
0]
tokens[cursor][}\n\nToken Value: ${
1]
tokens[cursor][}\n`
else if(
} 0] === "DQUOTE" &&
tokens[cursor][+ 1][0] === "DQUOTE"
tokens[cursor
){= true
unterminated .push(
r1]
tokens[cursor][
)++
cursor.push(
r1]
tokens[cursor][
)++
cursorcontinue
else if(
} 0] === "DQUOTE"
tokens[cursor][
){= false
unterminated .push(
r1]
tokens[cursor][
)++
cursorreturn [
,
cursor.join("")
r
]else {
} throw "\nUNEXPECTED PARSE ERROR\n\n"
}
}else {
} return [
-1,
""
]
} }
First, we have a chance to quit early if the cursor has been moved
beyond the end of tokens array or if the current token (the
start of the escaped
sequence) is not a double quote. This
is because we know per the grammar an escaped
field has at
least two characters, a pair of double quotes, with potentially even
more stuff between them. My convention is that we return -1 as new
cursor position if we quit early.
In the case of the potentially happy path where we have an initial
double quote, we add that to the result array and start a
while(true)
loop to continue our way through an unknown
number of tokens after the opening double quote.
If we recognize something, we add it to the result array and continue.
We have some logic to carefully handle double quotes. Within each
loop, if we encounter one, first we check if the next character (if any)
is also a double quote. If it is, per the grammar, we have not reached
the end of the escaped
field, as that is an escaped double
quote. This is not too different from a positive lookahead in regex.
This is the brilliant and simple logic of the CSV grammar, and it’s a great example of escaping logic. Rather than have separate schemes for escaping carriage returns, line feeds, and commas in fields, we allow for wrapping of the whole field in double quotes. Then, we only need to escape the double quote characters found in the field.
Finally, we have to ensure that we actually reach the end of a field and the final double quote character if we encounter double quotes.
Now let’s compose the two functions we just wrote into a single function for recognizing fields!
function getField(
,
tokens
cursor
){const [
,
newCursor
field= getEscaped(
] ,
tokens
cursor
)if(newCursor < 0){
return getNonEscaped(
,
tokens
cursor
)else {
} return [
,
newCursor
field
]
} }
Finally, we write our master parsing function. I dub it
parsicles
, to rhyme with Hercules 😤. On a high-level, it
is a hierarchical state machine, implemented with nested switches,
ultimately producing a simplistic row major table data structure for
consuming the CSV data.
If you were trying to solve this problem yourself, your first hint you might need a proper parsing solution instead of merely a quality regex pattern on its own is that the kind of “state machine” functionality offered by regex is too “flat” to support accurate and complete recognition of the kind of language specified by the grammar. That’s because if regex is a state machine, it is a finite one and, though terse and declarative, less capable than a proper hierarchical state machine.
The function is a long one:
function parsicles(
document,
= true
header
){let cursor = 0
const tokens = []
for(const arr of tokenizationStation(
document,
characterSets
)){.push(arr)
tokens
}let fileState = header ? "headerCRLF" : "firstRecord"
let headerCRLFState = "header"
let headerState = "firstField"
let recordState = "firstField"
const AST = new Map()
let quit = false
while(cursor < tokens.length){
switch(fileState){
case "headerCRLF": {
switch(headerCRLFState){
case "header": {
switch(headerState){
case "firstField": {
const [
,
newCursor
field= getField(
] ,
tokens
cursor
)if(newCursor < 0) {
throw `\nPARSE ERROR: Expected to find first header field but didn't.\n\nToken Type: ${
0]
tokens[cursor][}\n\nToken Value: ${
1]
tokens[cursor][}\n`
else {
} = newCursor
cursor .set(
AST"header",
[
field
]
)= "nthFieldOpt"
headerState continue
}
}case "nthFieldOpt": {
if(
0] === "COMMA"
tokens[cursor][
){++
cursorconst [
,
newCursor
field= getField(
] ,
tokens
cursor
)if(newCursor < 0) {
throw `\nPARSE ERROR: last token was a comma so expected to find another header field but didn't.\n\nToken Type: ${
0]
tokens[cursor][}\n\nToken Value: ${
1]
tokens[cursor][}\n`
else {
} = newCursor
cursor .set(
AST"header",
[...[
...AST.get(
"header"
),
]
field
]
)continue
}else {
} = "CRLF"
headerCRLFState continue
}
}
}
}case "CRLF": {
if(
0] === "CRLF"
tokens[cursor][
){++
cursor= "firstRecord"
fileState continue
else {
} throw `\nPARSE ERROR: expected to find a line termination sequence to terminate header line but didn't.\n\nToken Type: ${
0]
tokens[cursor][}\n\nToken Value: ${
1]
tokens[cursor][}\n`
}
}
}
}case "firstRecord": {
switch(recordState){
case "firstField": {
const [
,
newCursor
field= getField(
] ,
tokens
cursor
)if(newCursor < 0) {
throw `\nPARSE ERROR: Expected to find first field of a record but didn't.\n\nToken Type: ${
0]
tokens[cursor][}\n\nToken Value: ${
1]
tokens[cursor][}\n`
else {
} = newCursor
cursor .set(
AST"records",
[
[
field
]
]
)= "nthFieldOpt"
recordState continue
}
}case "nthFieldOpt": {
if(
0] === "COMMA"
tokens[cursor][
){++
cursorconst [
,
newCursor
field= getField(
] ,
tokens
cursor
)if(newCursor < 0) {
throw `\nPARSE ERROR: last token was a comma so expected to find another record field but didn't.\n\nToken Type: ${
0]
tokens[cursor][}\n\nToken Value: ${
1]
tokens[cursor][}\n`
else {
} = newCursor
cursor const records = [...AST.get(
"records"
)]const latestRecord = records[
.length - 1
records
].push(
latestRecord
field
)continue
}else {
} = "nthRecordOpt"
fileState = "CRLF"
recordState continue
}
}
}
}case "nthRecordOpt": {
switch(recordState){
case "CRLF": {
if(
0] === "CRLF"
tokens[cursor][
){= "firstField"
recordState ++
cursorcontinue
else {
} = "CRLF"
fileState continue
}
}case "firstField": {
const [
,
newCursor
field= getField(
] ,
tokens
cursor
)if(newCursor < 0) {
throw `\nPARSE ERROR: Expected to find first field of a record but didn't.\n\nToken Type: ${
0]
tokens[cursor][}\n\nToken Value: ${
1]
tokens[cursor][}\n`
else {
} = newCursor
cursor .set(
AST"records",
[...AST.get(
"records"
,
)
[
field
]
]
)= "nthFieldOpt"
recordState continue
}
}case "nthFieldOpt": {
if(
0] === "COMMA"
tokens[cursor][
){++
cursorconst [
,
newCursor
field= getField(
] ,
tokens
cursor
)if(newCursor < 0) {
throw `\nPARSE ERROR: last token was a comma so expected to find another record field but didn't.\n\nToken Type: ${
0]
tokens[cursor][}\n\nToken Value: ${
1]
tokens[cursor][}\n`
else {
} = newCursor
cursor const records = [...AST.get(
"records"
)]const latestRecord = records[
.length - 1
records
].push(
latestRecord
field
)continue
}else {
} = "CRLF"
recordState continue
}
}
}
}case "CRLF": {
console.log(
"YAY. All done parsing CSV."
)= true
quit continue
}
}if(quit) break
}return AST
}
The nested switch is located within another while loop. The nested
switch inside of it determines which state it’s in on any given pass
through the while loop by way of the preceding state variables. It’s
important to remember to use continue
when you are ready to
go to the next iteration of while loop, because if you use
break
that might just incorrectly take you to another case
statement within the same loop iteration.
Really what this state machine does is allow us to safely consume the
get-
functions we wrote earlier without losing track of
where we are in the higher levels of the grammar.
The function does per the RFC have a header parameter to allow you to
tell the procedure to separate out the header record from the rest. Also
at this point I confess I am playing a little fast and loose with what I
allow for CRLF
sequences, but it’s a wide world and the RFC
does say something about being liberal with what we accept.
You’ll note a few throw statements. These are just recognizing and appropriately reporting the pathological cases the parser may encounter.
A few final bonus functions. Let’s say the user of this framework doesn’t always want to remember to manually type two double quotes inside an escaped field for every literal double quote they need. We can easily form two helper functions for escaping and unescaping fields:
function unescape(
escaped
){if(
.startsWith(
escaped"\""
)
){return escaped.slice(
1,
.length - 1
escaped.replaceAll(
)"\"\"",
"\""
)else {
} return escaped
}
}
function escape(
unescaped
){return `"${
.replaceAll(
unescaped"\"",
"\"\""
)}"`
}
This is not the most performative CSV parser in the world. But it makes a good faith effort to adhere to the grammar, it was fun to write, and I leave any performance optimizations required as easter eggs for the reader.
Cheers, Dave