CSVsniffer
Companion repository for the paper:
Detecting CSV File Dialects by Table Uniformity Measurement and Data Type Inference (PDF)
by W. García.
An application of the Table Uniformity method outlined in the paper can be found in the CSV interface repository. Also, a Python version using CleverCSV source code has been implemented and stored in the python folder, demonstrating the reliability of the research presented.
Introduction
The results from the research can be reproduced by running the RunTests method from the macro-enabled Excel workbook CSVsniffer.xlsm. To review the results for CleverCSV it is necessary to run the scripts from the clevercsv_test.py file. The text files with the results output are stored in the Current research and cleverCSV folders. Additional results can be reproduced by running scripts in the python/src/run_tests.py file, the results can be read from the python/tests results/ folder.
Data
The CSV folder contains the files copied from the Pollock framework and other collected test files. Also the dataset used for the CSV wrangling research is available in the CSV_Wranglin folder. Note that only link to the files can be provided, in this last case,due to the authors holds the copyright. A dataset from w3c, CSVW project, is available in the W3C-CSVW folder.
The expect configuration for each tested CSV is saved in the Dialect_annotations.txt, Manual_dialect_annotation.txt and W3C-CSVW-Dialect_annotations.txt files.
Results
In this section, the results after running tests with the Beta Python implementation of the Table Uniformity method are presented. In order to obtain more representative performance metrics, it was decided to measure scripts running time on a Linux Mint system. All other metrics were obtained from a Windows 10 system. We have introduced a variant of the Table Uniformity Method that uses new metrics, including Mean Absolute Deviation and Shannon Entropy. These metrics will appear under the heading CSVsniffer MADSE.
The table below shows the dialect detection success ratio for CSVsniffer, CleverCSV and the built-in Python csv.Sniffer class module. Note that the accuracy has been measured using only those files that do not produce a failure when attempting to infer CSV dialects
| Data set | CSVsniffer MADSE |
CSVsniffer |
CleverCSV |
csv.Sniffer |
DuckDB sniff_csv |
|---|---|---|---|---|---|
| POLLOCK | 95.2703% | 96.5517% | 95.1724% | 96.3504% | 84.1379% |
| CSV Wrangling | 90.3226% | 90.5660% | 84.3137% | 80.5556% | 76.8212% |
| CSV Wrangling filtered CODEC | 91.5493% | 89.4737% | 84.2520% | 80.0000% | 78.0303% |
| CSV Wrangling MESSY | 80.2817% | 78.1955% | 71.6535% | 66.6667% | 65.9091% |
| W3C-CSVW | 94.5205% | 95.3917% | 61.1111% | 97.6923% | 99.0783% |
The table below shows the failure ratio for each tool.
| Data set | CSVsniffer MADSE |
CSVsniffer |
CleverCSV |
csv.Sniffer |
DuckDB sniff_csv |
|---|---|---|---|---|---|
| POLLOCK [148 files] | 0.0% | 2.0270% | 2.0270% | 7.4324% | 2.027% |
| CSV Wrangling [179 files] | 13.4078% | 11.1732% | 14.5251% | 19.5531% | 15.6425% |
| CSV Wrangling filtered CODEC [142 files] | 0.0% | 6.3380% | 10.5634% | 15.4930% | 7.0423% |
| CSV Wrangling MESSY [126 files] | 0.0% | 6.3380% | 10.5634% | 15.4930% | 7.0423% |
| W3C-CSVW [221 files] | 0.905% | 1.8100% | 2.2624% | 41.1765% | 1.81% |
The following table shows the average success and failure ratio for selected tools. The higher the number of errors obtained, the lower the reliability for detection.
| Tool | Success ratio (SR) | Failure ratio (FR) |
|---|---|---|
CSVsniffer MADSE |
90.39% | 2.86% |
CSVsniffer |
90.04% | 5.54% |
DuckDB sniff_csv |
80.80% | 6.71% |
CleverCSV |
79.30% | 7.99% |
csv.Sniffer |
84.25% | 19.83% |
As a complementary metric, the table below shows the average reliability factor for CSV dialect detection. This value is computed as:
| Tool | Reliability factor (RF) |
|---|---|
CSVsniffer MADSE |
87.80% |
CSVsniffer |
85.05% |
DuckDB sniff_csv |
75.38% |
CleverCSV |
72.96% |
csv.Sniffer |
67.54% |
The below table shows the execution times obtained. In this one we can see that the Python module, reading 6144 characters from the CSV files, is incredibly efficient, easily outperforming the other tools.
| Tool | Run-time |
|---|---|
csv.Sniffer |
0.98 sec. |
DuckDB sniff_csv |
4.40 sec. |
CleverCSV |
6.54 sec. |
CSVsniffer |
17.00 sec. |
Accuracy analysis
For dialect detection, we have defined True Positive (TP) as the number of CSV files where the dialect was correctly detected. By its way, False Positive (FP) is defined as the number of CSV files where the dialect was incorrectly identified as a specific dialect when it was actually a different dialect. False Negatives (FN) is defined as the number of CSV files where the specific dialect was present but not detected.
The next table shows the precision (P), which measures the accuracy of dialect detection when predicting a specific dialect. The metric is calculated as follows
| Data set | CSVsniffer MADSE |
CSVsniffer |
CleverCSV |
csv.Sniffer |
DuckDB sniff_csv |
|---|---|---|---|---|---|
| POLLOCK | 0.9527 | 0.9655 | 0.9517 | 0.9635 | 0.8414 |
| CSV Wrangling | 0.9032 | 0.9057 | 0.8431 | 0.8056 | 0.7682 |
| CSV Wrangling filtered CODEC | 0.9155 | 0.8947 | 0.8425 | 0.8000 | 0.7803 |
| CSV Wrangling MESSY | 0.8028 | 0.7820 | 0.7165 | 0.6667 | 0.6591 |
| W3C-CSVW | 0.9452 | 0.9539 | 0.6111 | 0.9769 | 0.9908 |
The following table shows the recall (R), which measures the ability of the method to detect the specific dialect when it is actually present. The metric is calculated as follows
| Data set | CSVsniffer MADSE |
CSVsniffer |
CleverCSV |
csv.Sniffer |
DuckDB sniff_csv |
|---|---|---|---|---|---|
| POLLOCK | 1.0 | 0.9790 | 0.9787 | 0.9231 | 0.9760 |
| CSV Wrangling | 0.8537 | 0.8780 | 0.8323 | 0.7682 | 0.8056 |
| CSV Wrangling filtered CODEC | 1.0 | 0.9297 | 0.8770 | 0.8136 | 0.9115 |
| CSV Wrangling MESSY | 1.0 | 0.9204 | 0.8585 | 0.7843 | 0.8969 |
| W3C-CSVW | 0.9904 | 0.9810 | 0.9635 | 0.5826 | 0.9817 |
The below table shows the F1 score, which is the most polished measure of dialect detection accuracy. The metric is calculated as follows
| Data set | CSVsniffer MADSE |
CSVsniffer |
CleverCSV |
csv.Sniffer |
DuckDB sniff_csv |
|---|---|---|---|---|---|
| POLLOCK | 0.9758 | 0.9722 | 0.9650 | 0.9429 | 0.9037 |
| CSV Wrangling | 0.8778 | 0.8916 | 0.8377 | 0.7865 | 0.7865 |
| CSV Wrangling filtered CODEC | 0.9559 | 0.9119 | 0.8594 | 0.8067 | 0.8408 |
| CSV Wrangling MESSY | 0.8906 | 0.8456 | 0.7811 | 0.7207 | 0.7598 |
| W3C-CSVW | 0.9673 | 0.9673 | 0.7479 | 0.7299 | 0.9862 |
Thus, the True Positive (TP) weighted F1 score for each tool is computed as
where
-
$\text{TP}_i$ : The number of True Positive instances of dataset$i$ . -
$F1_{Score}i$ : The F1 score for dataset$i$ .
The computations are given in the below table.
| Tool | F1 score |
|---|---|
CSVsniffer MADSE |
0.9378 |
CSVsniffer |
0.9260 |
DuckDB sniff_csv |
0.8806 |
CleverCSV |
0.8425 |
csv.Sniffer |
0.8049 |
Conclusions
By studying the last table it is concluded that the Table Uniformity method is able to predict and determine the dialects of CSV files with an accuracy of 92.60% using a sample of 10 records, while the others tools can reach 88.06% of accuracy. Surprisingly, the DuckDB sniff_csv function, loading 100 records, outperforms CleverCSV by 3.81%, perhaps due to its high performance on the W3C-CSVW database. By increasing the sample to 50 and using Median Absolute Deviation and Shannon Entropy (MADSE), the library is able to predict CSV dialects with 93.78% accuracy.
The proposed methodology shows an improvement of up to 5.72% over the DuckDB sniff_csv function and of 9.53% over CleverCSV using the same source code for data type detection as CleverCSV. At this point, the library has successfully achieved a substantial improvement resulting from the implementation of stricter metrics on tabular structures, reducing the false positives detected in cells. On the other hand, CleverCSV doesn't shows significant accuracy improvements when reading all the data from the CSV files. This unexpected result helps to reaffirm that dialect detection does not always require reading all the information from the CSV files.
The DuckDB sniff_csv function was the big revelation, proving to be superior to CleverCSV in accuracy and performance. So we can conclude that the Table Uniformity approach represents the state-of-the-art methodology in the dialect detection field, despite the fact that the increased accuracy and robustness leads to an increase in execution time. An alternative would be to use Pandas to avoid the overhead of using the file iterator to pre-filter the lines to be used in the creation of the tables to be evaluated.
Requirements
Below are the requirements for reproducing the experiments.
- Microsoft Office Excel.
- Python v3
- CleverCSV and all its dependencies.
- DuckDB and all its dependencies.
Credits
Many of the CSV files used in this research were recovered from different repositories. Below you can review the list.
- franciscom/testlink-code-playground/
- boryn/yii_demo/
- dengkeaway/kunagi/
- austinrfnd/arethedodgersplayingtonight.com/
- okfn/messytables/
- kmugglet/Reporting/
- jankvak/Schedule-of-pain/
- kamilklw/onlineDR/
- mavcunha/dojosp/
- evenwestvang/skoolgate/
- shalomb/dotfiles/
- ockam/php-csv/
- code34/war-in-takistan/
- thejesusbr/GPU-Color2Gray/
- zardosht/clustering/
- bryanburgers/personal-site/
- godlessendeavor/MultiDB/
- vofp/Relation-Browser-Ruby/
- javierfdr/Endrov-collaboration/
- nmklong/limesurvey-cdio3/
- jkamenik/Rails-Profiler/
- philogb/philogb.github.com/
- jaredcohe/sec_scrape/
- andrewxhill/MOL/
- abelhegedus/Magic-Collection-Builder/
- Momus/Multipass/
- christhorpe/gacscraper/
- ikharlampenkov/gkh/
- jekozyra/ngo-project/
- dynamicpacket-public/opensips/
- ryzhov/ATS0/
- pejo751/sube751/
- markkimsal/agent-ohm/
- ericpaulbishop/gargoyle/
- Dopi/JetPlatform/
- jburks/OpenDidj/
- matsubo/emoji-sprite/
- 19maps/getWeather/
- athomason/perl-Memcached-libmemcached/
- nadavoid/ReadySiteBase/
- lagenar/rivadavia/
- DAISUKEICHIKAWA/pred/
- mhausenblas/schema-org-rdf/
- mhausenblas/omnidator/
- shinichinomura/zipcode_jp-python/
- bashofmann/opensocial_demo_game/
- rajkosto/mxoemu/
- pino1068/riskman/
- rocksolidwebdesign/AweCMS/
- duckduckgo/zeroclickinfo-goodies/
- jeffreyhorner/coloring/
- ivanistheone/Latent-Dirichlet-Allocation/
- mvilrokx/BPAD/
- navid/Nav/
- lboaretto/Stratos/
- robertogds/ReallyLateBooking/
- regdog/goodlife/
- danux/django-holding-page/
- popdevelop/snapplr/
- MichaelMarner/Half-Real-Scenes/
- bjpop/website/
- dvydra/coupon-rails/
- aaronzhang/kunagi/
- tbielawa/py-prtstat/
- tricycle/electrodrive-market-analysis/
- darpified/kunagi/
- jgarciagarrido/SegoviaOpenTeam/
- datenspiel/is_csv_importable/
- cgueret/eRDF/
- zikula-modules/Eternizer/
- azizmb/TWSS/
- kms/ds1052e-measurements/
- pphetra/fresh-odhd/
- levivya/investmarketkz/
- TeamImba/MDAS/
- saghul/sipp-scenarios/
- wliao008/mysteryleague/
- mgius/cpe458/
- ecell/ecell3-spatiocyte/
- gnorsilva/frontlinesms-core/
- tonytian33/shoppinglist/
- Airead/excise/
- UncleCJ/alexastuff/
- colinmollenhour/magento-lite/
- barbie/test-xhtml/
- kyr0/ExtZF/
- petewarden/dstkdata/
- lacimol/kunagi/
- apslab/ap-manager/
- noosamad/mxp-consolidate/
- practicalparticipation/IKMLinkedResearch/
- cwegrzyn/RHIPE/
- nadineproject/nadine/
- samsonjs/samhuri.net/
- kcampos/Kuali-Sakai-Functional-Test-Automation-Framework/
- max7255/FPGA-Analyzer/
- LATC/sandbox/
- jeffkit/autoforms/
- d8agroup/metaLayer-Gateway/
- hmgaspar/eesddp/
- fadzril/freelovr/
- fguillen/simplecov-csv/
- aminin/google-api-adwords-php/
- michaeltyson/potionstore/9
- srveit/distance/
- zarma/baghdad01/
- hippiefahrzeug/jeannie/
- digitalfox/py10n/
- JanHoralik/jh-prototypes/
- alexmajy/flashcards/
- onlytiancai/codesnip/
- vofp/relation-browser/
- Error-331/Axis-modules/
- CarnegieLearning/MathFluency/
- i-dcc/allele_image/
- hogi/kunagi/
- BarbaraEMac/TrivBot/
- Wisser/Jailer/
- jasonlong/benfords-law/
- soccermetrics/marcotti-sql/
- kennym/itools/
- rkoeppl/skeinforge_settings/
- aalzola/prueba/
- cmheisel/kardboard/
- gfalav/newcar/
- PabloCastellano/pablog-scripts/
- elle/music-library/
- oppian/xerobis/
- twidi/satchmo/
- 34/T/
- apache/bigtop/
- ciriarte/laundry/
- romanchyla/montysolr/
- alibezz/SNRails_Research/
- SQLServerIO/IometerParser/
- renduples/alibtob/
- evanleonard/kunagi/
- yuvadm/ayalon-dat/
- marplatense/erlang/
- spazm/Iron-Munger/
- crschmidt/haitibrowser/
- eLBirador/AllAboutCity/
- Coalas/apsilesia/
- emelleme/All-Drag-Racing/
- despesapublica/site/
- apkennedy/Emmaus-Silverstripe/
- Coalas/kolakowski/
- notioncollective/
- emelleme/G8-life/
- mgc544/sscustom-tra/
- andyinabox/Beyond-the-Debt-Ceiling/
- emelleme/PhillyOpen/
- sponsoredlinx/Total-Care-Asphalting/
- davidmontgomery/Wireframe/
- frappe/frappe/
- davedash/SUMO-issues/
- Habrok/HelloWorld/
- gaubert/java-balivernes/
- befair/gasistafelice/
- komagata/cloister/
- ryannscy/Unemployment-Chart/
- guylhem/PerlMSI/
- lightspeedretail/webstore/
- naderman/symfony/
- orchestra-io/sample-symfony2/
- biow0lf/prometheus2.0/
- sctape/GrinnellPlans/
- joshuabenuck/eshell/
- cmcginn/Common/
- ruby-rdf/sparql-client/
- doubi/Text-CSV_XS/
- cpf-se/citrus/
- abhishekkr/
- NeoGeographyToolkit/kraken/
- denisoid/homebank_import_scripts/
- Jazzinghen/spamOSEK/
- DiUS/java-api-bindings/
- ikko/hazaitop/
- blancavg/ggplot2-basics/
- learnstream/ls_atomic/
- pdawczak/onko/
- peterkrenn/spit-generator/
- TasteeWheat/mxoemu1/
- digitarald/redracer/
- Ezku/xi-framework/
- item/sugarcrm_dev/
- candidasa/silverstripe-phpunit-3.4/
- EHER/phpunit-all-in-one/
- gsdevme-archive/CURL/
- vivid-planet/library/
- psawaya/CS34-Bus-Routing/
- headius/jruby-cdc/
- sbourdeauducq/milkymist-ruby/
- hamhei/hamcolor/
- johnl/deb-ruby1.9.1/
- darealcaffeine/first_app/
- srirammca53/update_status/
- nikuuchi/oreore_ruby/
- Pettrov/Asynchronous-Operations-Module--PHP-/
- TeamRocketScience/Claroline-TRS-Edition/
- khjgbkbk/CRMS/
- milandobrota/WPSocialNetwork/
- cciechad/brlcad/
- mattmccray/gumdrop/
- irace/irace.github.com/
- ronny/kodepos/
- Juuro/Dreamapp-Website/
- Belarus/Windows.OmegaT/
- tuxnani/pyrhmn/
- cesine/ToolsForFieldLinguistics/
- mathics/Mathics/
- jwiegley/CSV2Ledger/
- jmlegr/GestComp/
- mchelen/data_gov_catalog_data/
- jasherai/maatkit/
- cwarden/kettle/
- purpleKarrot/wowmapview/
- egonz/old_sql/
- johnantoni/old.tutorials/
- code-mangler/my-emacs-package/
- calle/fn/
- matthewfarrell/gargoyle/
- e2thex/hackunteers.org/
- andrewjpage/freezer_tracking/
- fbacall/simple-spreadsheet-extractor/
- simonvh/gimmemotifs/
- disnet/contracts.js/
- eckes/wandern/
- zxvf/--2/
- tyage/town-cake/
- whiteshark/kunagi/
- Surgeon/Watir/
- cawka/DSMS_NBC/
- dptww1/WW1AirMap2/
- jonibo/ledgersmb/
- dlc/ttcsbrandon/
- akariv/obudget2/
- meloun/py_ewitis/
- semantic-im/sim-rf/
- sankroh/django-form-manager/
- JudoWill/pyMutF/
- jou4/parsure/
- shanx/djangocon.eu/
- paulgessinger/fermi/
- ajpalkovic/b2010/
- OhTu-IDDQD/scheduler3000/
- OlliD/DtwSequenceCompare/
- lukeorland/phonography/
- r0ck3y3/AAI/
- glenbot/beerapp/
- StephanHoyer/magento-lucene/