Show HN: Turn CSV Files into SQL Statements for Quick Database Transfers

13 points by ryanwaldorf 2 years ago · 15 comments · 1 min read

Reader

This package lets you unload a CSV from a warehouse, turn that CSV into SQL statements creating a temp table and inserting data with a CLI command, and copy those SQL statements into your query editor so you can start using the data in a different warehouse. Useful for when you need to join data together that's stored in two separate data warehouses (e.g. finance and product data).

bdcravens 2 years ago

For small datasets, this is good, but for anything significant, using the database's native bulk load will be more efficient than INSERT statements.

retox 2 years ago

I recently had to transfer a relatively small amount of data, ~2GB between MS SQL databases hosted on the same server. It was a consolidation of the data from 10 total DBs, merging the data from 5 of them into a different schema on the other 5. Think 5 DBs for current data and 5 for legacy data, where the outcome was the original 5 current data DBs now contained the current and legacy data.
There was a requirement that the migration didn't blow up the transaction logs on the target DBs. I thought it would be a simple affair but after much testing I had to write a tool to generate a list of all the tables in each source DB, generate format dumping bcp commands for all the tables, then bcp commands to dump the data for each table, then SQL statements to bulk import the data using both files. The MS docs were woefully unhelpful.
- bormaj 2 years ago
  
  Isn't there a query option to disable/configure log writes for cases like this?
- ryanwaldorfOP 2 years ago
  
  Why were folks concerned about the transaction logs?
ryanwaldorfOP 2 years ago

Yup that's exactly it. This is an ease-of-use option for when you need to ingest somewhat small amounts of data into another database for joining. For large amounts of data, copying the CSV via a copy command should be your best bet.
BohuTANG 2 years ago

Yeah. The insert statement fails when dealing with larger datasets, and the parser incurs significant overhead. The most efficient method is to use the COPY INTO command like in Databend[1] or Snowflake[2] for bulk data loading.
[1] Databend: https://docs.databend.com/guides/load-data/load-semistructur...
[2] Snowflake: https://docs.snowflake.com/en/sql-reference/sql/copy-into-ta...
[3] Data Ingestion Benchmark: https://docs.databend.com/guides/benchmark/data-ingest

nbbaier 2 years ago

This is neat! Correct me if I'm wrong, but don't a lot of systems already have the ability to ingest CSV data? Why this intermediate step?

ryanwaldorfOP 2 years ago

It's helpful for the ones that don't have this yet like Redshift (for full disclosure I'm a PM there). If your system does have this already though you probably just want to use that.
bdcravens 2 years ago

Perhaps ease in troubleshooting? I know dealing with line endings, mystery encodings, etc can make CSV ingestion tedious.
- adammarples 2 years ago
  
  This, as far as I can tell, completely ignores all of that. You're supposed to export the data yourself as csv before beginning, so it's up to you what character encoding, delimiter, escape characters, etc. you will be using. The csv is fed directly into "pd.read_csv" with no option to pass any arguments along in case you have a slightly non-standard csv format. This is a good tool if it solves this guy's problems but as a data engineer I would suggest to him to learn the tooling a little more. He makes reference to snowflake and bigquery and I know they both allow you to upload a csv into a table and auto detect or infer the schema very easily. That would be a better way than using pandas and copy paste.
  - ryanwaldorfOP 2 years ago
    
    Fair callouts. For the first version I was looking to solve the basic use case of "I've downloaded a standard CSV from one database and I want to ingest it into another". I'll look at adding the ability to handle non-standard CSVs as well.
    Also I appreciate you pointing out what Snowflake and BigQuery have with this. I'm wondering if there's something Redshift-specific that may be useful here for larger file volumes. It is more performant to do a CSV copy, but it can be a PITA when it comes to having a role with the right permissions to copy data from the right location.
    Anything else you'd recommend I look into?
    
    adammarples 2 years ago
    
    Yeah I'm not sure about redshift, but bigquery uses "autodetect" so something like
    bq load --autodetect --source_format=CSV mydataset.mytable ./myfile.csv
    And snowflake uses INFER_SCHEMA I believe you can do this
    select * from table( infer_schema( location=>'@stage/my file.csv', file_format=>'my_csv_format' ) );
    Although tbh I'm not sure if that's what you're looking for. You might enjoy looking at duckdb for stuff like this. My policy when starting data engineering was to bung everything into pandas dataframes, and now my policy is to try to avoid them at all costs because they're slow and memory hungry!
    
    qoega 2 years ago
    
    In ClickHouse it is just `INSERT INTO t FROM INFILE 'data.csv.gz'`. Any supported format, any encryption, autodetected from file name and sample data piece to get column types, delimeters etc. Separate tools to convert CSV are not necessary if you can just import to db and export as SQL Statements.
    echo "name,age,city John,30,New York Jane,25,Los Angeles" > example.csv
    clickhouse local -q "SELECT * FROM file('example.csv') FORMAT SQLInsert" INSERT INTO table (`name`, `age`, `city`) VALUES ('John', 30, 'New York'), ('Jane', 25, 'Los Angeles');

ddgflorida 2 years ago

There are many browser based solutions for quick CSV to SQL also - i.e. www.convertcsv.com/csv-to-sql.htm

mharig 2 years ago

One can do this with the sqlite3 CLI tool.

Settings

Show HN: Turn CSV Files into SQL Statements for Quick Database Transfers

Keyboard Shortcuts