Handling double-quoted CSVs in Azure Data Factory Pipelines

Jun 12, 2019
azure data factory azure
2 min read

Azure Data Factory by default uses a backslash as the escape character for CSVs, but we ran into an issue with this today processing one of the CSV files from data.gov.au. As with most CSVs they use quotes around values as normal and with double-quotes for empty values, but they also use double-quotes to escape quotes within non-empty values. This probably sounds confusing, so here’s an example:

"column 1”,“column 2”,"“,“column 4 value is "“sort of"“ like this"

The ADF pipeline failed because the double-quotes don’t get escaped correctly:
ErrorCode=UserErrorSourceDataContainsMoreColumnsThanDefined, found more columns than expected column count.

The solution was to change the “Escape character” property on the dataset, by clicking the “Edit” button beneath it and manually entering “", since "” isn’t in the list of escape characters. I didn’t think this would work at first but it turns out that escape characters don’t have to be a single character, and it looks like the double-quotes used for empty values are processed separately from double-quotes used as escape characters. Handy!

Edit:
Unfortunately you can’t just set "” as an escape character when creating the dataset, because even though it can process the CSV correctly when set this way, ADF will give you an error when importing the schema for the CSV:

"CSV serilization setting escapeChar cannot be more than one character"

So the trick is to leave it as backslash, just to import the schema, and then change it to double-quotes afterwards, since this seems to be the only step that complains about this escape character.