Parsing CSV Format in JavaScript per RFC 4180, No Frameworks, No Libraries

Today, we will be looking for patterns in some documents.

Specifically, CSV-formatted files, AKA Comma-Separated Values. This is a well-known and simple file format. Additionally, the CSV ABNF grammar as specified in RFC 4180 is simple enough to comprehend to make the implementation of a lexer and parser a great entry-level practice problem for those interested in topics like parsers, interpreters, compilers.

At first you might think it’s as simple as splitting the CSV document string on newlines, then for each resulting line string you split on commas to find the fields.

That is perhaps better called the poor man’s CSV file format. Such will force you to not use commas or newlines in your fields. So you think… then to support such characters in fields let’s just come up with a scheme for escaping them. For instance only split on newlines not preceded by an unescaped escape character, or similar such schemes…

But these problems have already been solved in an elegant and easy to comprehend way in the published grammar for CSV in RFC 4180, so better to use that rather than reinvent the wheel. In this case, we are looking for practice writing a parser based on an established grammar (and a simple one at that), not practice with architecting our own grammar from scratch.

I’m told there is a not uncommon CSV format that does not follow the RFC standard’s approach to escaping double quotes with another double quote, and instead escapes them with a backslash. Interesting, anyway

Humanizing the Grammar

Here’s the grammar as specified in the RFC, only slightly edited by me cosmetically.

file = [header CRLF] record *(CRLF record) [CRLF]  

header = name *(COMMA name)  

record = field *(COMMA field)  

name = field  

field = (escaped / non-escaped)  

escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE  

non-escaped = *TEXTDATA  

COMMA = %x2C  

CR = %x0D ;as per section 6.1 of RFC 2234 [2]

DQUOTE = %x22 ;as per section 6.1 of RFC 2234 [2]  

LF = %x0A ;as per section 6.1 of RFC 2234 [2]  

CRLF = CR LF ;as per section 6.1 of RFC 2234 [2]  

TEXTDATA = %x20-21 

TEXTDATA /= %x23-2B 

TEXTDATA /= %x2E-7E 

TEXTDATA /= %x2D

You can read the Augmented Backus-Naur form (ABNF) wikipedia article for a more formal introduction to such grammars.

But on a higher level, it’s important to understand that the grammar is really only describing the shape of a kind of tree. I’ll give a natural language explanation of the meaning of the grammar below.

The root of the tree is the file. The part to the right of the equal sign is like an abstract definition of what constitutes a csv file. According to this grammar, a csv file is as follows:

  1. An optional (indicated by brackets) header line. If it has one, a well-formed header line is made by a header followed by a CRLF (in other words, character(s) indicating end of a line).
  2. At least one record is required.
  3. Subsequent records are optional, but must always have a CRLF before them (to separate them from the preceding record.)
  4. Finally, an optional trailing CRLF at the end of the document.

The terms used in that definition of a file are further clarified in the rest of the grammar.

The rest of the grammar entries are “terminals”, kind of like leaves in a tree. Basically this means each one represents one or more characters. Characters in this grammar are specified by their hexadecimal numeric values, either a single value or a range.

The Script’s Prologue

Let’s make a Map of the terminal names to their hex values. In the case of ranges, we will split them into multiples names where each range has a separate name for lower and upper bounds.

const terminals = new Map(
  [
    [
      "COMMA",
      0x2C
    ], [
      "CR",
      0x0D
    ], [
      "LF",
      0x0A
    ], [
      "DQUOTE",
      0x22
    ], [
      "TEXTDATA_1_LB",
      0x20
    ], [
      "TEXTDATA_1_UB",
      0x21
    ], [
      "TEXTDATA_2_LB",
      0x23
    ], [
      "TEXTDATA_2_UB",
      0x2B
    ], [
      "TEXTDATA_3_LB",
      0x2E
    ], [
      "TEXTDATA_3_UB",
      0x7E
    ], [
      "TEXTDATA_4",
      0x2D
    ]
  ]
);

Next, let’s transform the hex values we stored in the Map into their ordinary character forms.

[
  ...terminals.keys()
].forEach(
  v => {
    terminals.set(
      v,
      String.fromCharCode(
        terminals.get(
          v
        )
      )
    )
  }
);

And this next part is optional I guess. But, looking ahead at the grammar, I can already tell the kinds of matches we will need to do with regex so I’m going to stitch together a few strings to represent character ranges and individual characters along with their quantifiers.

const characterSets = new Map(
  [
    [
      "CRLF", 
      [
        `${
          terminals.get(
            "CR"
          )
        }${
          terminals.get(
            "LF"
          )
        }`,
        "1,"
      ]
    ], [
      "TEXTDATA",
      [
        `${
          terminals.get(
            "TEXTDATA_1_LB"
          )
        }-${
          terminals.get(
            "TEXTDATA_1_UB"
          )
        }${
          terminals.get(
            "TEXTDATA_2_LB"
          )
        }-${
          terminals.get(
            "TEXTDATA_2_UB"
          )
        }${
          terminals.get(
            "TEXTDATA_3_LB"
          )
        }-${
          terminals.get(
            "TEXTDATA_3_UB"
          )
        }${
          terminals.get(
            "TEXTDATA_4"
          )
        }`,
        "1,"
      ]
    ], [
      "COMMA",
      [
        `${
          terminals.get(
            "COMMA"
          )
        }`,
        "1"
      ]
    ], [
      "DQUOTE",
      [
        `${
          terminals.get(
            "DQUOTE"
          )
        }`,
        "1"
      ]
    ]
  ]
);

Lexical Analysis

Next stop: the Tokenization Station.

In this part we will make a generator function which will step through the whole CSV-formatted string and spit out a series of tokens, where tokens are basically just sequences of substrings from the input string, each with a label attached.

Here’s the function:

function * tokenizationStation(
  document,
  terminals
){
  const gex = new RegExp(
    [
      ...terminals.entries()
    ].map(
      ([
        name,
        [
          pattern,
          count
        ]
      ]) => `(?<${
        name
      }>[${
        pattern
      }]{${
        count
      }})`
    ).join("|"),
    "g"
  )
  let result
  while(
    (result = gex.exec(
      document
    ))
  ){
    for(
      const k of terminals.keys()
    ){
      if(
        typeof result.groups !== "undefined" && 
        typeof result.groups[k] === "string"
      ) yield [
        k,
        result.groups[k]
      ]
    }
  }
}

This function takes the CSV string as first argument, and second is the characterSets Map we made earlier. It uses the Map to create a regex pattern which is like a series of named capturing groups separated by pipe | characters. The idea is to match one of those capturing groups. The regex also needs the g flag as we will use it with exec. Then we just use a while loop to keep yielding matches made with exec until it stops matching (ideally at the end of the document).

How I derive the regex pattern itself in code is, based on my own experience, probably easier for most to understand than the meaning of the resulting regex pattern itself. Hence, since we already have all the parts we need to calculate the regex pattern, I’ll give it to you next along with an natural language explanation of the pattern:

(?<CRLF>[\r\n]{1,})|(?<TEXTDATA>[ -!#-+.-~-]{1,})|(?<COMMA>[,]{1})|(?<DQUOTE>["]{1})

The | characters form a disjunction. It’s like listing the members of a set, or giving a list of possible alternatives:

In the while loop, we are interrogating the result of the regex exec method operation to see if it matched anything. Since the groups property of the result object is an object with null as its prototype, we don’t have access to usual object methods to enumerate object keys. Hence, we must perform a linear search for the capturing group it contains. When we find it, we remember to attach this name as a label to the returned a token, which is just an array where first element is the label and second is the matched text. The labels will be used later by the parser to understand what part of the grammar the text it is considering belongs to.

Parsing

Now we will write the parsing logic. It will consume the sequence of tokens spit out by the tokenizer function. It will scan that sequence of tokens and, if the sequence fits the definition of CSV format per the grammar, transform the tokens into a readily consumable JavaScript data structure.

Before we compose our master parsing function, let’s make a few helper functions. These functions will follow similar conventions: they take as arguments a reference to the tokens array along with a cursor integer indicating the current position in the array. They return an array where first entry is the new cursor position, and the second is the “thing” (a particular kind of string) the function is meant to recognize.

First function is easier one, to recognize non-escaped sequences:

function getNonEscaped(
  tokens,
  cursor
){
  const finalTokensIdx = tokens.length - 1
  const r = []
  while(
    cursor <= finalTokensIdx &&
    tokens[cursor][0] === "TEXTDATA"
  ){
    r.push(
      tokens[cursor][1]
    )
    cursor++
  }
  if(
    cursor <= finalTokensIdx &&
    tokens[cursor][0] === "DQUOTE"
  ) throw "PARSE ERROR: Unescaped fields can't contain double quotes."
  return [
    cursor,
    r.join("")
  ]
}

We keep looping as long as each token is a TEXTDATA token and we’re not at the end. We then return the new cursor position and the resulting string.

A similar function must be written for escaped, but it’s a more involved one:

function getEscaped(
  tokens,
  cursor
){
  const finalTokensIdx = tokens.length - 1
  if(
    cursor <= finalTokensIdx && 
    tokens[cursor][0] === "DQUOTE"
  ){
    const r = [
      tokens[cursor][1]
    ]
    cursor++
    const innerNonDquoteTypes = new Set(
      [
        "TEXTDATA",
        "COMMA",
        "CRLF"
      ]
    )
    let unterminated = false
    while(true){
      if(
        innerNonDquoteTypes.has(
          tokens[cursor][0]
        )
      ){
        r.push(
          tokens[cursor][1]
        )
        cursor++
        continue
      } else if(
        cursor === finalTokensIdx && 
        tokens[cursor][0] === "DQUOTE"
      ){
        r.push(
          tokens[cursor][1]
        )
        cursor++
        return [
          cursor,
          r.join("")
        ]
      } else if(
        cursor === finalTokensIdx && 
        unterminated
      ){
        throw `\nPARSE ERROR: Expected an escaped field. Expected to find a terminating double quote character.\n\nToken Type: ${
          tokens[cursor][0]
        }\n\nToken Value: ${
          tokens[cursor][1]
        }\n`
      } else if(
        tokens[cursor][0] === "DQUOTE" &&
        tokens[cursor + 1][0] === "DQUOTE"
      ){
        unterminated = true
        r.push(
          tokens[cursor][1]
        )
        cursor++
        r.push(
          tokens[cursor][1]
        )
        cursor++
        continue
      } else if(
        tokens[cursor][0] === "DQUOTE"
      ){
        unterminated = false
        r.push(
          tokens[cursor][1]
        )
        cursor++
        return [
          cursor,
          r.join("")
        ]
      } else {
        throw "\nUNEXPECTED PARSE ERROR\n\n"
      }
    }
  } else {
    return [
      -1,
      ""
    ]
  }
}

First, we have a chance to quit early if the cursor has been moved beyond the end of tokens array or if the current token (the start of the escaped sequence) is not a double quote. This is because we know per the grammar an escaped field has at least two characters, a pair of double quotes, with potentially even more stuff between them. My convention is that we return -1 as new cursor position if we quit early.

In the case of the potentially happy path where we have an initial double quote, we add that to the result array and start a while(true) loop to continue our way through an unknown number of tokens after the opening double quote.

If we recognize something, we add it to the result array and continue.

We have some logic to carefully handle double quotes. Within each loop, if we encounter one, first we check if the next character (if any) is also a double quote. If it is, per the grammar, we have not reached the end of the escaped field, as that is an escaped double quote. This is not too different from a positive lookahead in regex.

This is the brilliant and simple logic of the CSV grammar, and it’s a great example of escaping logic. Rather than have separate schemes for escaping carriage returns, line feeds, and commas in fields, we allow for wrapping of the whole field in double quotes. Then, we only need to escape the double quote characters found in the field.

Finally, we have to ensure that we actually reach the end of a field and the final double quote character if we encounter double quotes.

Now let’s compose the two functions we just wrote into a single function for recognizing fields!

function getField(
  tokens,
  cursor
){
  const [
    newCursor,
    field
  ] = getEscaped(
    tokens,
    cursor
  )
  if(newCursor < 0){
    return getNonEscaped(
      tokens,
      cursor
    )
  } else {
    return [
      newCursor,
      field
    ]
  }
}

Finally, we write our master parsing function. I dub it parsicles, to rhyme with Hercules 😤. On a high-level, it is a hierarchical state machine, implemented with nested switches, ultimately producing a simplistic row major table data structure for consuming the CSV data.

If you were trying to solve this problem yourself, your first hint you might need a proper parsing solution instead of merely a quality regex pattern on its own is that the kind of “state machine” functionality offered by regex is too “flat” to support accurate and complete recognition of the kind of language specified by the grammar. That’s because if regex is a state machine, it is a finite one and, though terse and declarative, less capable than a proper hierarchical state machine.

The function is a long one:

function parsicles(
  document,
  header = true
){
  let cursor = 0
  const tokens = []
  for(const arr of tokenizationStation(
    document,
    characterSets
  )){
    tokens.push(arr)
  }
  let fileState = header ? "headerCRLF" : "firstRecord"
  let headerCRLFState = "header"
  let headerState = "firstField"
  let recordState = "firstField"
  const AST = new Map()
  let quit = false
  while(cursor < tokens.length){
    switch(fileState){
      case "headerCRLF": {
        switch(headerCRLFState){
          case "header": {
            switch(headerState){
              case "firstField": {
                const [
                  newCursor,
                  field
                ] = getField(
                  tokens,
                  cursor
                )
                if(newCursor < 0) {
                  throw `\nPARSE ERROR: Expected to find first header field but didn't.\n\nToken Type: ${
                    tokens[cursor][0]
                  }\n\nToken Value: ${
                    tokens[cursor][1]
                  }\n`
                } else {
                  cursor = newCursor
                  AST.set(
                    "header",
                    [
                      field
                    ]
                  )
                  headerState = "nthFieldOpt"
                  continue
                }
              }
              case "nthFieldOpt": {
                if(
                  tokens[cursor][0] === "COMMA"
                ){
                  cursor++
                  const [
                    newCursor,
                    field
                  ] = getField(
                    tokens,
                    cursor
                  )
                  if(newCursor < 0) {
                    throw `\nPARSE ERROR: last token was a comma so expected to find another header field but didn't.\n\nToken Type: ${
                      tokens[cursor][0]
                    }\n\nToken Value: ${
                      tokens[cursor][1]
                    }\n`
                  } else {
                    cursor = newCursor
                    AST.set(
                      "header",
                      [
                        ...[
                          ...AST.get(
                            "header"
                          )
                        ],
                        field
                      ]
                    )
                    continue
                  }
                } else {
                  headerCRLFState = "CRLF"
                  continue
                }
              }
            }
          }
          case "CRLF": {
            if(
              tokens[cursor][0] === "CRLF"
            ){
              cursor++
              fileState = "firstRecord"
              continue
            } else {
              throw `\nPARSE ERROR: expected to find a line termination sequence to terminate header line but didn't.\n\nToken Type: ${
                tokens[cursor][0]
              }\n\nToken Value: ${
                tokens[cursor][1]
              }\n`
            }
          }
        }
      }
      case "firstRecord": {
        switch(recordState){
          case "firstField": {
            const [
              newCursor,
              field
            ] = getField(
              tokens,
              cursor
            )
            if(newCursor < 0) {
              throw `\nPARSE ERROR: Expected to find first field of a record but didn't.\n\nToken Type: ${
                tokens[cursor][0]
              }\n\nToken Value: ${
                tokens[cursor][1]
              }\n`
            } else {
              cursor = newCursor
              AST.set(
                "records",
                [
                  [
                    field
                  ]
                ]
              )
              recordState = "nthFieldOpt"
              continue
            }
          }
          case "nthFieldOpt": {
            if(
              tokens[cursor][0] === "COMMA"
            ){
              cursor++
              const [
                newCursor,
                field
              ] = getField(
                tokens,
                cursor
              )
              if(newCursor < 0) {
                throw `\nPARSE ERROR: last token was a comma so expected to find another record field but didn't.\n\nToken Type: ${
                  tokens[cursor][0]
                }\n\nToken Value: ${
                  tokens[cursor][1]
                }\n`
              } else {
                cursor = newCursor
                const records = [...AST.get(
                  "records"
                )]
                const latestRecord = records[
                  records.length - 1
                ]
                latestRecord.push(
                  field
                )
                continue
              }
            } else {
              fileState = "nthRecordOpt"
              recordState = "CRLF"
              continue
            }
          }
        }
      }
      case "nthRecordOpt": {
        switch(recordState){
          case "CRLF": {
            if(
              tokens[cursor][0] === "CRLF"
            ){
              recordState = "firstField"
              cursor++
              continue
            } else {
              fileState = "CRLF"
              continue
            }
          }
          case "firstField": {
            const [
              newCursor,
              field
            ] = getField(
              tokens,
              cursor
            )
            if(newCursor < 0) {
              throw `\nPARSE ERROR: Expected to find first field of a record but didn't.\n\nToken Type: ${
                tokens[cursor][0]
              }\n\nToken Value: ${
                tokens[cursor][1]
              }\n`
            } else {
              cursor = newCursor
              AST.set(
                "records",
                [
                  ...AST.get(
                    "records"
                  ),
                  [
                    field
                  ]
                ]
              )
              recordState = "nthFieldOpt"
              continue
            }
          }
          case "nthFieldOpt": {
            if(
              tokens[cursor][0] === "COMMA"
            ){
              cursor++
              const [
                newCursor,
                field
              ] = getField(
                tokens,
                cursor
              )
              if(newCursor < 0) {
                throw `\nPARSE ERROR: last token was a comma so expected to find another record field but didn't.\n\nToken Type: ${
                  tokens[cursor][0]
                }\n\nToken Value: ${
                  tokens[cursor][1]
                }\n`
              } else {
                cursor = newCursor
                const records = [...AST.get(
                  "records"
                )]
                const latestRecord = records[
                  records.length - 1
                ]
                latestRecord.push(
                  field
                )
                continue
              }
            } else {
              recordState = "CRLF"
              continue
            }
          }
        }
      }
      case "CRLF": {
        console.log(
          "YAY. All done parsing CSV."
        )
        quit = true
        continue
      }
    }
    if(quit) break
  }
  return AST
}

The nested switch is located within another while loop. The nested switch inside of it determines which state it’s in on any given pass through the while loop by way of the preceding state variables. It’s important to remember to use continue when you are ready to go to the next iteration of while loop, because if you use break that might just incorrectly take you to another case statement within the same loop iteration.

Really what this state machine does is allow us to safely consume the get- functions we wrote earlier without losing track of where we are in the higher levels of the grammar.

The function does per the RFC have a header parameter to allow you to tell the procedure to separate out the header record from the rest. Also at this point I confess I am playing a little fast and loose with what I allow for CRLF sequences, but it’s a wide world and the RFC does say something about being liberal with what we accept.

You’ll note a few throw statements. These are just recognizing and appropriately reporting the pathological cases the parser may encounter.

A few final bonus functions. Let’s say the user of this framework doesn’t always want to remember to manually type two double quotes inside an escaped field for every literal double quote they need. We can easily form two helper functions for escaping and unescaping fields:

function unescape(
  escaped
){
  if(
    escaped.startsWith(
      "\""
    )
  ){
    return escaped.slice(
      1,
      escaped.length - 1
    ).replaceAll(
      "\"\"",
      "\""
    )
  } else {
    return escaped
  }
}

function escape(
  unescaped
){
  return `"${
    unescaped.replaceAll(
      "\"",
      "\"\""
    )
  }"`
}

This is not the most performative CSV parser in the world. But it makes a good faith effort to adhere to the grammar, it was fun to write, and I leave any performance optimizations required as easter eggs for the reader.

Cheers, Dave