GitHub - ws-garcia/CSVsniffer: The absolute pinnacle in CSV dialect detection. The methodology is research backed and implemented in Python, outperforming existing state of the art solutions by 8.35% in terms of their F1 scores, using only built-in modules.

CSVsniffer

Companion repository for the paper:

Detecting CSV File Dialects by Table Uniformity Measurement and Data Type Inference (PDF)

An application of the Table Uniformity method outlined in the paper can be found in the CSV interface repository. Also, a Python version using CleverCSV source code has been implemented and stored in the python folder, demonstrating the reliability of the research presented.

Introduction

The results from the research can be reproduced by running the RunTests method from the macro-enabled Excel workbook CSVsniffer.xlsm. To review the results for CleverCSV it is necessary to run the scripts from the clevercsv_test.py file. The text files with the results output are stored in the Current research and cleverCSV folders. Additional results can be reproduced by running scripts in the python/src/run_tests.py file, the results can be read from the python/tests results/ folder.

Data

The CSV folder contains the files copied from the Pollock framework and other collected test files. Also the dataset used for the CSV wrangling research is available in the CSV_Wranglin folder. Note that only link to the files can be provided, in this last case,due to the authors holds the copyright. A dataset from w3c, CSVW project, is available in the W3C-CSVW folder.

The expect configuration for each tested CSV is saved in the Dialect_annotations.txt, Manual_dialect_annotation.txt and W3C-CSVW-Dialect_annotations.txt files.

Results

In this section, the results after running tests with the Beta Python implementation of the Table Uniformity method are presented. In order to obtain more representative performance metrics, it was decided to measure scripts running time on a Linux Mint system. All other metrics were obtained from a Windows 10 system. We have introduced a variant of the Table Uniformity Method that uses new metrics, including Mean Absolute Deviation and Shannon Entropy. These metrics will appear under the heading CSVsniffer MADSE.

The table below shows the dialect detection success ratio for CSVsniffer, CleverCSV and the built-in Python csv.Sniffer class module. Note that the accuracy has been measured using only those files that do not produce a failure when attempting to infer CSV dialects

Data set	`CSVsniffer MADSE`	`CSVsniffer`	`CleverCSV`	`csv.Sniffer`	DuckDB `sniff_csv`
POLLOCK	95.2703%	96.5517%	95.1724%	96.3504%	84.1379%
CSV Wrangling	90.3226%	90.5660%	84.3137%	80.5556%	76.8212%
CSV Wrangling filtered CODEC	91.5493%	89.4737%	84.2520%	80.0000%	78.0303%
CSV Wrangling MESSY	80.2817%	78.1955%	71.6535%	66.6667%	65.9091%
W3C-CSVW	94.5205%	95.3917%	61.1111%	97.6923%	99.0783%

The table below shows the failure ratio for each tool.

Data set	`CSVsniffer MADSE`	`CSVsniffer`	`CleverCSV`	`csv.Sniffer`	DuckDB `sniff_csv`
POLLOCK [148 files]	0.0%	2.0270%	2.0270%	7.4324%	2.027%
CSV Wrangling [179 files]	13.4078%	11.1732%	14.5251%	19.5531%	15.6425%
CSV Wrangling filtered CODEC [142 files]	0.0%	6.3380%	10.5634%	15.4930%	7.0423%
CSV Wrangling MESSY [126 files]	0.0%	6.3380%	10.5634%	15.4930%	7.0423%
W3C-CSVW [221 files]	0.905%	1.8100%	2.2624%	41.1765%	1.81%

The following table shows the average success and failure ratio for selected tools. The higher the number of errors obtained, the lower the reliability for detection.

Tool	Success ratio (SR)	Failure ratio (FR)
`CSVsniffer MADSE`	90.39%	2.86%
`CSVsniffer`	90.04%	5.54%
DuckDB `sniff_csv`	80.80%	6.71%
`CleverCSV`	79.30%	7.99%
`csv.Sniffer`	84.25%	19.83%

As a complementary metric, the table below shows the average reliability factor for CSV dialect detection. This value is computed as: $$RF=SR\times (1-FR)$$.

Tool	Reliability factor (RF)
`CSVsniffer MADSE`	87.80%
`CSVsniffer`	85.05%
DuckDB `sniff_csv`	75.38%
`CleverCSV`	72.96%
`csv.Sniffer`	67.54%

The below table shows the execution times obtained. In this one we can see that the Python module, reading 6144 characters from the CSV files, is incredibly efficient, easily outperforming the other tools.

Tool	Run-time
`csv.Sniffer`	0.98 sec.
DuckDB `sniff_csv`	4.40 sec.
`CleverCSV`	6.54 sec.
`CSVsniffer`	17.00 sec.

Accuracy analysis

For dialect detection, we have defined True Positive (TP) as the number of CSV files where the dialect was correctly detected. By its way, False Positive (FP) is defined as the number of CSV files where the dialect was incorrectly identified as a specific dialect when it was actually a different dialect. False Negatives (FN) is defined as the number of CSV files where the specific dialect was present but not detected.

The next table shows the precision (P), which measures the accuracy of dialect detection when predicting a specific dialect. The metric is calculated as follows

$$P=\frac{TP}{TP+FP}$$

Data set	`CSVsniffer MADSE`	`CSVsniffer`	`CleverCSV`	`csv.Sniffer`	DuckDB `sniff_csv`
POLLOCK	0.9527	0.9655	0.9517	0.9635	0.8414
CSV Wrangling	0.9032	0.9057	0.8431	0.8056	0.7682
CSV Wrangling filtered CODEC	0.9155	0.8947	0.8425	0.8000	0.7803
CSV Wrangling MESSY	0.8028	0.7820	0.7165	0.6667	0.6591
W3C-CSVW	0.9452	0.9539	0.6111	0.9769	0.9908

The following table shows the recall (R), which measures the ability of the method to detect the specific dialect when it is actually present. The metric is calculated as follows

$$R=\frac{TP}{TP+FN}$$

Data set	`CSVsniffer MADSE`	`CSVsniffer`	`CleverCSV`	`csv.Sniffer`	DuckDB `sniff_csv`
POLLOCK	1.0	0.9790	0.9787	0.9231	0.9760
CSV Wrangling	0.8537	0.8780	0.8323	0.7682	0.8056
CSV Wrangling filtered CODEC	1.0	0.9297	0.8770	0.8136	0.9115
CSV Wrangling MESSY	1.0	0.9204	0.8585	0.7843	0.8969
W3C-CSVW	0.9904	0.9810	0.9635	0.5826	0.9817

The below table shows the F1 score, which is the most polished measure of dialect detection accuracy. The metric is calculated as follows

$$F1=2 \times \frac{P \times R}{P+R}$$

Data set	`CSVsniffer MADSE`	`CSVsniffer`	`CleverCSV`	`csv.Sniffer`	DuckDB `sniff_csv`
POLLOCK	0.9758	0.9722	0.9650	0.9429	0.9037
CSV Wrangling	0.8778	0.8916	0.8377	0.7865	0.7865
CSV Wrangling filtered CODEC	0.9559	0.9119	0.8594	0.8067	0.8408
CSV Wrangling MESSY	0.8906	0.8456	0.7811	0.7207	0.7598
W3C-CSVW	0.9673	0.9673	0.7479	0.7299	0.9862

Thus, the True Positive (TP) weighted F1 score for each tool is computed as

$$F1_{Weighted} Score = \frac{\sum_{i=1}^{n} TP_i \times F1_{Score}i}{\sum_{i=1}^{n} TP_i}$$

where

$\text{TP}_i$: The number of True Positive instances of dataset $i$.
$F1_{Score}i$: The F1 score for dataset $i$.

The computations are given in the below table.

Tool	F1 score
`CSVsniffer MADSE`	0.9378
`CSVsniffer`	0.9260
DuckDB `sniff_csv`	0.8806
`CleverCSV`	0.8425
`csv.Sniffer`	0.8049

Conclusions

By studying the last table it is concluded that the Table Uniformity method is able to predict and determine the dialects of CSV files with an accuracy of 92.60% using a sample of 10 records, while the others tools can reach 88.06% of accuracy. Surprisingly, the DuckDB sniff_csv function, loading 100 records, outperforms CleverCSV by 3.81%, perhaps due to its high performance on the W3C-CSVW database. By increasing the sample to 50 and using Median Absolute Deviation and Shannon Entropy (MADSE), the library is able to predict CSV dialects with 93.78% accuracy.

The proposed methodology shows an improvement of up to 5.72% over the DuckDB sniff_csv function and of 9.53% over CleverCSV using the same source code for data type detection as CleverCSV. At this point, the library has successfully achieved a substantial improvement resulting from the implementation of stricter metrics on tabular structures, reducing the false positives detected in cells. On the other hand, CleverCSV doesn't shows significant accuracy improvements when reading all the data from the CSV files. This unexpected result helps to reaffirm that dialect detection does not always require reading all the information from the CSV files.

The DuckDB sniff_csv function was the big revelation, proving to be superior to CleverCSV in accuracy and performance. So we can conclude that the Table Uniformity approach represents the state-of-the-art methodology in the dialect detection field, despite the fact that the increased accuracy and robustness leads to an increase in execution time. An alternative would be to use Pandas to avoid the overhead of using the file iterator to pre-filter the lines to be used in the creation of the tables to be evaluated.

Requirements

Below are the requirements for reproducing the experiments.

Microsoft Office Excel.
Python v3
CleverCSV and all its dependencies.
DuckDB and all its dependencies.

Credits

Many of the CSV files used in this research were recovered from different repositories. Below you can review the list.

franciscom/testlink-code-playground/
boryn/yii_demo/
dengkeaway/kunagi/
austinrfnd/arethedodgersplayingtonight.com/
okfn/messytables/
kmugglet/Reporting/
jankvak/Schedule-of-pain/
kamilklw/onlineDR/
mavcunha/dojosp/
evenwestvang/skoolgate/
shalomb/dotfiles/
ockam/php-csv/
code34/war-in-takistan/
thejesusbr/GPU-Color2Gray/
zardosht/clustering/
bryanburgers/personal-site/
godlessendeavor/MultiDB/
vofp/Relation-Browser-Ruby/
javierfdr/Endrov-collaboration/
nmklong/limesurvey-cdio3/
jkamenik/Rails-Profiler/
philogb/philogb.github.com/
jaredcohe/sec_scrape/
andrewxhill/MOL/
abelhegedus/Magic-Collection-Builder/
Momus/Multipass/
christhorpe/gacscraper/
ikharlampenkov/gkh/
jekozyra/ngo-project/
dynamicpacket-public/opensips/
ryzhov/ATS0/
pejo751/sube751/
markkimsal/agent-ohm/
ericpaulbishop/gargoyle/
Dopi/JetPlatform/
jburks/OpenDidj/
matsubo/emoji-sprite/
19maps/getWeather/
athomason/perl-Memcached-libmemcached/
nadavoid/ReadySiteBase/
lagenar/rivadavia/
DAISUKEICHIKAWA/pred/
mhausenblas/schema-org-rdf/
mhausenblas/omnidator/
shinichinomura/zipcode_jp-python/
bashofmann/opensocial_demo_game/
rajkosto/mxoemu/
pino1068/riskman/
rocksolidwebdesign/AweCMS/
duckduckgo/zeroclickinfo-goodies/
jeffreyhorner/coloring/
ivanistheone/Latent-Dirichlet-Allocation/
mvilrokx/BPAD/
navid/Nav/
lboaretto/Stratos/
robertogds/ReallyLateBooking/
regdog/goodlife/
danux/django-holding-page/
popdevelop/snapplr/
MichaelMarner/Half-Real-Scenes/
bjpop/website/
dvydra/coupon-rails/
aaronzhang/kunagi/
tbielawa/py-prtstat/
tricycle/electrodrive-market-analysis/
darpified/kunagi/
jgarciagarrido/SegoviaOpenTeam/
datenspiel/is_csv_importable/
cgueret/eRDF/
zikula-modules/Eternizer/
azizmb/TWSS/
kms/ds1052e-measurements/
pphetra/fresh-odhd/
levivya/investmarketkz/
TeamImba/MDAS/
saghul/sipp-scenarios/
wliao008/mysteryleague/
mgius/cpe458/
ecell/ecell3-spatiocyte/
gnorsilva/frontlinesms-core/
tonytian33/shoppinglist/
Airead/excise/
UncleCJ/alexastuff/
colinmollenhour/magento-lite/
barbie/test-xhtml/
kyr0/ExtZF/
petewarden/dstkdata/
lacimol/kunagi/
apslab/ap-manager/
noosamad/mxp-consolidate/
practicalparticipation/IKMLinkedResearch/
cwegrzyn/RHIPE/
nadineproject/nadine/
samsonjs/samhuri.net/
kcampos/Kuali-Sakai-Functional-Test-Automation-Framework/
max7255/FPGA-Analyzer/
LATC/sandbox/
jeffkit/autoforms/
d8agroup/metaLayer-Gateway/
hmgaspar/eesddp/
fadzril/freelovr/
fguillen/simplecov-csv/
aminin/google-api-adwords-php/
michaeltyson/potionstore/9
srveit/distance/
zarma/baghdad01/
hippiefahrzeug/jeannie/
digitalfox/py10n/
JanHoralik/jh-prototypes/
alexmajy/flashcards/
onlytiancai/codesnip/
vofp/relation-browser/
Error-331/Axis-modules/
CarnegieLearning/MathFluency/
i-dcc/allele_image/
hogi/kunagi/
BarbaraEMac/TrivBot/
Wisser/Jailer/
jasonlong/benfords-law/
soccermetrics/marcotti-sql/
kennym/itools/
rkoeppl/skeinforge_settings/
aalzola/prueba/
cmheisel/kardboard/
gfalav/newcar/
PabloCastellano/pablog-scripts/
elle/music-library/
oppian/xerobis/
twidi/satchmo/
34/T/
apache/bigtop/
ciriarte/laundry/
romanchyla/montysolr/
alibezz/SNRails_Research/
SQLServerIO/IometerParser/
renduples/alibtob/
evanleonard/kunagi/
yuvadm/ayalon-dat/
marplatense/erlang/
spazm/Iron-Munger/
crschmidt/haitibrowser/
eLBirador/AllAboutCity/
Coalas/apsilesia/
emelleme/All-Drag-Racing/
despesapublica/site/
apkennedy/Emmaus-Silverstripe/
Coalas/kolakowski/
notioncollective/
emelleme/G8-life/
mgc544/sscustom-tra/
andyinabox/Beyond-the-Debt-Ceiling/
emelleme/PhillyOpen/
sponsoredlinx/Total-Care-Asphalting/
davidmontgomery/Wireframe/
frappe/frappe/
davedash/SUMO-issues/
Habrok/HelloWorld/
gaubert/java-balivernes/
befair/gasistafelice/
komagata/cloister/
ryannscy/Unemployment-Chart/
guylhem/PerlMSI/
lightspeedretail/webstore/
naderman/symfony/
orchestra-io/sample-symfony2/
biow0lf/prometheus2.0/
sctape/GrinnellPlans/
joshuabenuck/eshell/
cmcginn/Common/
ruby-rdf/sparql-client/
doubi/Text-CSV_XS/
cpf-se/citrus/
abhishekkr/
NeoGeographyToolkit/kraken/
denisoid/homebank_import_scripts/
Jazzinghen/spamOSEK/
DiUS/java-api-bindings/
ikko/hazaitop/
blancavg/ggplot2-basics/
learnstream/ls_atomic/
pdawczak/onko/
peterkrenn/spit-generator/
TasteeWheat/mxoemu1/
digitarald/redracer/
Ezku/xi-framework/
item/sugarcrm_dev/
candidasa/silverstripe-phpunit-3.4/
EHER/phpunit-all-in-one/
gsdevme-archive/CURL/
vivid-planet/library/
psawaya/CS34-Bus-Routing/
headius/jruby-cdc/
sbourdeauducq/milkymist-ruby/
hamhei/hamcolor/
johnl/deb-ruby1.9.1/
darealcaffeine/first_app/
srirammca53/update_status/
nikuuchi/oreore_ruby/
Pettrov/Asynchronous-Operations-Module--PHP-/
TeamRocketScience/Claroline-TRS-Edition/
khjgbkbk/CRMS/
milandobrota/WPSocialNetwork/
cciechad/brlcad/
mattmccray/gumdrop/
irace/irace.github.com/
ronny/kodepos/
Juuro/Dreamapp-Website/
Belarus/Windows.OmegaT/
tuxnani/pyrhmn/
cesine/ToolsForFieldLinguistics/
mathics/Mathics/
jwiegley/CSV2Ledger/
jmlegr/GestComp/
mchelen/data_gov_catalog_data/
jasherai/maatkit/
cwarden/kettle/
purpleKarrot/wowmapview/
egonz/old_sql/
johnantoni/old.tutorials/
code-mangler/my-emacs-package/
calle/fn/
matthewfarrell/gargoyle/
e2thex/hackunteers.org/
andrewjpage/freezer_tracking/
fbacall/simple-spreadsheet-extractor/
simonvh/gimmemotifs/
disnet/contracts.js/
eckes/wandern/
zxvf/--2/
tyage/town-cake/
whiteshark/kunagi/
Surgeon/Watir/
cawka/DSMS_NBC/
dptww1/WW1AirMap2/
jonibo/ledgersmb/
dlc/ttcsbrandon/
akariv/obudget2/
meloun/py_ewitis/
semantic-im/sim-rf/
sankroh/django-form-manager/
JudoWill/pyMutF/
jou4/parsure/
shanx/djangocon.eu/
paulgessinger/fermi/
ajpalkovic/b2010/
OhTu-IDDQD/scheduler3000/
OlliD/DtwSequenceCompare/
lukeorland/phonography/
r0ck3y3/AAI/
glenbot/beerapp/
StephanHoyer/magento-lucene/