Settings

Theme

Show HN: Turn CSV Files into SQL Statements for Quick Database Transfers

github.com

13 points by ryanwaldorf 2 years ago · 15 comments · 1 min read

Reader

This package lets you unload a CSV from a warehouse, turn that CSV into SQL statements creating a temp table and inserting data with a CLI command, and copy those SQL statements into your query editor so you can start using the data in a different warehouse. Useful for when you need to join data together that's stored in two separate data warehouses (e.g. finance and product data).

bdcravens 2 years ago

For small datasets, this is good, but for anything significant, using the database's native bulk load will be more efficient than INSERT statements.

  • retox 2 years ago

    I recently had to transfer a relatively small amount of data, ~2GB between MS SQL databases hosted on the same server. It was a consolidation of the data from 10 total DBs, merging the data from 5 of them into a different schema on the other 5. Think 5 DBs for current data and 5 for legacy data, where the outcome was the original 5 current data DBs now contained the current and legacy data.

    There was a requirement that the migration didn't blow up the transaction logs on the target DBs. I thought it would be a simple affair but after much testing I had to write a tool to generate a list of all the tables in each source DB, generate format dumping bcp commands for all the tables, then bcp commands to dump the data for each table, then SQL statements to bulk import the data using both files. The MS docs were woefully unhelpful.

  • ryanwaldorfOP 2 years ago

    Yup that's exactly it. This is an ease-of-use option for when you need to ingest somewhat small amounts of data into another database for joining. For large amounts of data, copying the CSV via a copy command should be your best bet.

  • BohuTANG 2 years ago

    Yeah. The insert statement fails when dealing with larger datasets, and the parser incurs significant overhead. The most efficient method is to use the COPY INTO command like in Databend[1] or Snowflake[2] for bulk data loading.

    [1] Databend: https://docs.databend.com/guides/load-data/load-semistructur...

    [2] Snowflake: https://docs.snowflake.com/en/sql-reference/sql/copy-into-ta...

    [3] Data Ingestion Benchmark: https://docs.databend.com/guides/benchmark/data-ingest

nbbaier 2 years ago

This is neat! Correct me if I'm wrong, but don't a lot of systems already have the ability to ingest CSV data? Why this intermediate step?

  • ryanwaldorfOP 2 years ago

    It's helpful for the ones that don't have this yet like Redshift (for full disclosure I'm a PM there). If your system does have this already though you probably just want to use that.

  • bdcravens 2 years ago

    Perhaps ease in troubleshooting? I know dealing with line endings, mystery encodings, etc can make CSV ingestion tedious.

    • adammarples 2 years ago

      This, as far as I can tell, completely ignores all of that. You're supposed to export the data yourself as csv before beginning, so it's up to you what character encoding, delimiter, escape characters, etc. you will be using. The csv is fed directly into "pd.read_csv" with no option to pass any arguments along in case you have a slightly non-standard csv format. This is a good tool if it solves this guy's problems but as a data engineer I would suggest to him to learn the tooling a little more. He makes reference to snowflake and bigquery and I know they both allow you to upload a csv into a table and auto detect or infer the schema very easily. That would be a better way than using pandas and copy paste.

      • ryanwaldorfOP 2 years ago

        Fair callouts. For the first version I was looking to solve the basic use case of "I've downloaded a standard CSV from one database and I want to ingest it into another". I'll look at adding the ability to handle non-standard CSVs as well.

        Also I appreciate you pointing out what Snowflake and BigQuery have with this. I'm wondering if there's something Redshift-specific that may be useful here for larger file volumes. It is more performant to do a CSV copy, but it can be a PITA when it comes to having a role with the right permissions to copy data from the right location.

        Anything else you'd recommend I look into?

        • adammarples 2 years ago

          Yeah I'm not sure about redshift, but bigquery uses "autodetect" so something like

          bq load --autodetect --source_format=CSV mydataset.mytable ./myfile.csv

          And snowflake uses INFER_SCHEMA I believe you can do this

          select * from table( infer_schema( location=>'@stage/my file.csv', file_format=>'my_csv_format' ) );

          Although tbh I'm not sure if that's what you're looking for. You might enjoy looking at duckdb for stuff like this. My policy when starting data engineering was to bung everything into pandas dataframes, and now my policy is to try to avoid them at all costs because they're slow and memory hungry!

          • qoega 2 years ago

            In ClickHouse it is just `INSERT INTO t FROM INFILE 'data.csv.gz'`. Any supported format, any encryption, autodetected from file name and sample data piece to get column types, delimeters etc. Separate tools to convert CSV are not necessary if you can just import to db and export as SQL Statements.

            echo "name,age,city John,30,New York Jane,25,Los Angeles" > example.csv

            clickhouse local -q "SELECT * FROM file('example.csv') FORMAT SQLInsert" INSERT INTO table (`name`, `age`, `city`) VALUES ('John', 30, 'New York'), ('Jane', 25, 'Los Angeles');

ddgflorida 2 years ago

There are many browser based solutions for quick CSV to SQL also - i.e. www.convertcsv.com/csv-to-sql.htm

mharig 2 years ago

One can do this with the sqlite3 CLI tool.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection