Improve dialect sniffing
Created by: akkartik
Back in June 2017, @petersonjr pointed out some test cases that cause csvkit/agate to misinterpret the dialect of files: https://github.com/wireservice/csvkit/issues/751#issuecomment-310803282
The reasonable workaround @jpmckinney pointed out was to configure --snifflimit 0
at the commandline. While this helps, I'd like to point out a couple of things:
a) Those test cases work fine without the override to the Sniffer
class introduced in https://github.com/wireservice/agate/commit/3b9ceea131ba143cc72b6d1a9f7871d059188b52
b) I dug into the default Sniffer in the Python standard library. It's not very robust; it uses regular expressions for parsing each line, which means that it gets confused by say commas inside quoted strings.
c) While agate has a reasonable default sniff_limit
of 0, the commandline argument parsing in csvkit overrides it to None
, which has the effect of reading the entire file. As the size of the input file grows, it becomes increasingly likely that it'll encounter stuff that the Python Sniffer can't handle.
These observations indicate a few places in the stack where a change may be beneficial. I'm not sure if any of them is a good idea -- I may well be missing context as an outsider -- but thought I'd try to start a discussion. Is there a testsuite for the custom Sniffer in agate? Is it an option to default --snifflimit
to something smaller, say 1024? (Obviously changing the Python standard library is a bigger task. The maintainer for the csv module is also not active anymore.)