Detect and remove UTF-8 BOM
Created by: doekman
The bug is illustrated with files from this zip
example.csv
is a CSV file, saved with UTF-8 with BOM marker and LF linefeeds (the webapp highrise uses this format).
When running ./setup.sh
the file is imported into the user-database, and everything seems fine:
csvsql 1.0.2
Python 2.7.15
ProductName: Mac OS X
ProductVersion: 10.13.6
BuildVersion: 17G3025
NOTICE: table "example" does not exist, skipping
DROP TABLE
However, when running ./test.sh
, the second command fails (and the first and third succeed):
----[ select * ]--------------------------------------------
FullName | Email
--------------+---------------------
Test User | test@example.com
Another User | another@example.com
(2 rows)
----[ select FullName, Email ]------------------------------
ERROR: column "fullname" does not exist
LINE 1: SELECT FullName, Email from example;
^
HINT: Perhaps you meant to reference the column "example.FullName".
----[ select [BOM-marker]FullName, Email ]------------------
FullName | Email
--------------+---------------------
Test User | test@example.com
Another User | another@example.com
(2 rows)
This is because the first column name is prefixed with the BOM marker, and is demonstrated by the third command. When you copy/paste the output into an editor that shows control characters (TextMate), you see the name of the first column is actually `U&"\FEFFFullName"
When the header columns in the CSV file are double-quoted ("FullName","Email"), the behaviour is different but also undesirable, but I haven't checked that.
The error message could be nicer (non-printable characters could be shown escaped as I did in the third example), but I'm not sure which software part that would have to do...
From csvkit, I expect the UTF-BOM marker to be detected and removed.