Show HN: Turn CSV Files into SQL Statements for Quick Database Transfers
github.comThis package lets you unload a CSV from a warehouse, turn that CSV into SQL statements creating a temp table and inserting data with a CLI command, and copy those SQL statements into your query editor so you can start using the data in a different warehouse. Useful for when you need to join data together that's stored in two separate data warehouses (e.g. finance and product data). For small datasets, this is good, but for anything significant, using the database's native bulk load will be more efficient than INSERT statements. I recently had to transfer a relatively small amount of data, ~2GB between MS SQL databases hosted on the same server. It was a consolidation of the data from 10 total DBs, merging the data from 5 of them into a different schema on the other 5. Think 5 DBs for current data and 5 for legacy data, where the outcome was the original 5 current data DBs now contained the current and legacy data. There was a requirement that the migration didn't blow up the transaction logs on the target DBs. I thought it would be a simple affair but after much testing I had to write a tool to generate a list of all the tables in each source DB, generate format dumping bcp commands for all the tables, then bcp commands to dump the data for each table, then SQL statements to bulk import the data using both files. The MS docs were woefully unhelpful. Isn't there a query option to disable/configure log writes for cases like this? Why were folks concerned about the transaction logs? Yup that's exactly it. This is an ease-of-use option for when you need to ingest somewhat small amounts of data into another database for joining. For large amounts of data, copying the CSV via a copy command should be your best bet. Yeah.
The insert statement fails when dealing with larger datasets, and the parser incurs significant overhead.
The most efficient method is to use the COPY INTO command like in Databend[1] or Snowflake[2] for bulk data loading. [1] Databend: https://docs.databend.com/guides/load-data/load-semistructur... [2] Snowflake: https://docs.snowflake.com/en/sql-reference/sql/copy-into-ta... [3] Data Ingestion Benchmark: https://docs.databend.com/guides/benchmark/data-ingest This is neat! Correct me if I'm wrong, but don't a lot of systems already have the ability to ingest CSV data? Why this intermediate step? It's helpful for the ones that don't have this yet like Redshift (for full disclosure I'm a PM there). If your system does have this already though you probably just want to use that. Perhaps ease in troubleshooting? I know dealing with line endings, mystery encodings, etc can make CSV ingestion tedious. This, as far as I can tell, completely ignores all of that. You're supposed to export the data yourself as csv before beginning, so it's up to you what character encoding, delimiter, escape characters, etc. you will be using. The csv is fed directly into "pd.read_csv" with no option to pass any arguments along in case you have a slightly non-standard csv format. This is a good tool if it solves this guy's problems but as a data engineer I would suggest to him to learn the tooling a little more. He makes reference to snowflake and bigquery and I know they both allow you to upload a csv into a table and auto detect or infer the schema very easily. That would be a better way than using pandas and copy paste. Fair callouts. For the first version I was looking to solve the basic use case of "I've downloaded a standard CSV from one database and I want to ingest it into another". I'll look at adding the ability to handle non-standard CSVs as well. Also I appreciate you pointing out what Snowflake and BigQuery have with this. I'm wondering if there's something Redshift-specific that may be useful here for larger file volumes. It is more performant to do a CSV copy, but it can be a PITA when it comes to having a role with the right permissions to copy data from the right location. Anything else you'd recommend I look into? Yeah I'm not sure about redshift, but bigquery uses "autodetect" so something like bq load --autodetect --source_format=CSV mydataset.mytable ./myfile.csv And snowflake uses INFER_SCHEMA I believe you can do this select * from table(
infer_schema(
location=>'@stage/my file.csv',
file_format=>'my_csv_format'
)
); Although tbh I'm not sure if that's what you're looking for. You might enjoy looking at duckdb for stuff like this. My policy when starting data engineering was to bung everything into pandas dataframes, and now my policy is to try to avoid them at all costs because they're slow and memory hungry! In ClickHouse it is just `INSERT INTO t FROM INFILE 'data.csv.gz'`. Any supported format, any encryption, autodetected from file name and sample data piece to get column types, delimeters etc.
Separate tools to convert CSV are not necessary if you can just import to db and export as SQL Statements. echo "name,age,city
John,30,New York
Jane,25,Los Angeles" > example.csv clickhouse local -q "SELECT * FROM file('example.csv') FORMAT SQLInsert"
INSERT INTO table (`name`, `age`, `city`) VALUES ('John', 30, 'New York'), ('Jane', 25, 'Los Angeles'); There are many browser based solutions for quick CSV to SQL also - i.e. www.convertcsv.com/csv-to-sql.htm One can do this with the sqlite3 CLI tool.