Migrating 1200 databases from MySQL to Postgres

psyenix.blogspot.com

250 points by yupyup 8 years ago · 84 comments

Reader

I wrote an open source tool that migrates between any 2 relational databases.

https://github.com/seanharr11/etlalchemy

This includes automatic migration of the schema, constraints, indexes and obviously data. It leverages fast bulk loading tools like mysqlimport, and PostgreSQL's COPY FROM. It doesn't need any config. Just plug and play.

The tool uses SQLAlchemy to abstract column types, and handles various column type conversions between any flavor of SQL.

The one thing it needs are tests, and scalability support for tables bigger than a few GB!

mkez00 8 years ago

Going to give this a try. Thanks for posting!
icelancer 8 years ago

Thanks much - going to test it between MSSQL and MySQL in a few days here!
- seanharr11 8 years ago
  
  Let me know how it goes...please open issues if you find them!

willcodeforfoo 8 years ago

I'm working on migrating an app right now (just one MySQL database) but can highly recommend [pgloader](http://pgloader.io/). It has a bunch of built-in and configurable rules to cast different column types to PostgreSQL and the developer has been very responsive in helping me make weird column types work, too.

erjiang 8 years ago

Pgloader can be very powerful. Despite never having using Common Lisp, I found it easy to add in a custom column conversion functions without modifying pgloader itself. You can write additional lisp files with custom procedures and load them at runtime.
ganeshkrishnan 8 years ago

If I may ask, what are you reasons to migrate from MySQL to Postgres?
We run MySQL on product and am in the process of de-normalizing it. If I have compelling reasons, I might push for migration to Postgres but I never found any genuine reasons. Uber even migrated the opposite way from Postgres to MySQL (Although I think reddit engineers disagreed with their reasons for move).
lbotos 8 years ago

+1 for PGLoader. Still wrestling with getting it to work for a conversion, but the experience has been a pleasure so far and surprisingly easy.

sigi45 8 years ago

I like the approach and think something like this is fun but i don't understand the businsess case at all.

I would have migrated MySQL 5.5 to 5.6 first, than cleaned the shit out of it, normalized everything, added a few indezes and suddently realize that there is not much left to do.

I like to work with postgresql but if you have 1,2k dbs which are more or less not claned up and crappy, what are they used anyway?

richdougherty 8 years ago

I agree. It sounds like they decided to do a cleanup ahead of an investment and not much of a real business case. Your migration path seems much, much simpler - and that's assuming the cleanup is even needed in the first place.

SOLAR_FIELDS 8 years ago

One thing author did not touch on: since the tables were denormalized-ish in the original MySQL DB, did his application lose significant performance by having to perform the joins for every single query in the renormalized PostGres instance? Or were the DB's small enough and indexed properly enough so it didn't really matter?

Might have been worthy of testing this to see if it was worth it to materialize certain views and refresh them every 15-30 minutes or so.

koolba 8 years ago

Alternatively did the application speed up after it was normalized? A normalized data set can be substantially smaller which may allow the entire data set to fit in memory.
Also, it's officially "PostgreSQL" or informally "Postgres". Not "PostGres".
Amezarak 8 years ago

At least in my experience, joins in 100s of GB databases are not a performance problem. I've always been somewhat baffled by the claims I see pretty regularly that joins are a major performance concern. Was this a big deal in the past?
- gaius 8 years ago
  
  Was this a big deal in the past?
  In MySQL, yes. It lacked the concept of foreign keys.
  - kondro 8 years ago
    
    Lack of foreign keys does not effect performance.
    In fact, having foreign keys is a net-negative performance-wise because it means that each foreign key must be checked for every insert/update/delete.
    
    branko_d 8 years ago
    
    Foreign keys can improve performance, in some cases:
    https://stackoverflow.com/a/8154375/533120
    
    gaius 8 years ago
    
    Errm, we're talking about performance of joins
    
    ComputerGuru 8 years ago
    
    Joins will use the regular index matching it against the join condition just like a where clause would.
    
    kondro 8 years ago
    
    They don't help with that either.
    
    Sacho 8 years ago
    
    Perhaps parent is conflating indexes with constraints? I believe InnoDB creates indexes automatically with foreign keys.

berns 8 years ago

One thing that Postgres lacks is accent insensitive collations. Having clients with databases in Spanish, this is one of the reasons I wouldn't consider migrating to Postgres. I know I can use the unaccent extension, but I consider it a poor substitute to proper collations. I guess this isn't a problem for most people because it's never mentioned.

claar 8 years ago

It appears work is active (May 2017) in this area, but accent-insensitive collations are still a ways off: https://blog.2ndquadrant.com/icu-support-postgresql-10/
edit: fulafel corrects me below that you can get accent-insensitive collations working using PostgreSQL 10's new ICU collation support. Note that PostgreSQL 10 is currently pre-release.
- fulafel 8 years ago
  
  That post says that they currently work, but there is a new os-independent collation method coming.
  Edit re your edit: the already working collations are in PG 9.x - the OS independent collations are coming in 10.x. (Hence the post title, "More robust collations with ICU support in PostgreSQL 10")
  - fauigerzigerk 8 years ago
    
    But do these "already working" collations include accent-insensitive variants? I didn't think so, and I don't think they are coming in PostgreSQL 10 either. Do you have other information on this?
    
    petergeoghegan 8 years ago
    
    They do. The problem is that equality is still strict binary equality for text, for obscure technical reasons. So, while you can get extremely flexible customizations to collation with icu [1], you can't have case insensitive equality, or equality that ignores accents. This is recognized as a problem, and is probably going to be fixed in v11.
    [1] https://www.postgresql.org/message-id/CAH2-Wz=bcgmk97YaZ3rs4...
    
    anarazel 8 years ago
    
    > The problem is that equality is still strict binary equality for text, for obscure technical reasons
    AKA performance.
    
    fulafel 8 years ago
    
    My misunderstanding seemed to be that "accent-insensitive" turns out to be a precise technical term from MySQL that says "just sort UTF-8 using the American locale and collation, but pretend accented characters are equal to their non-accented versions" which is not a standard collation order in any locale.
    
    berns 8 years ago
    
    I think this is incorrect. In Mysql utf8_spanish_ci and utf8_swedish_ci have different accent rules, for example.
    
    fulafel 8 years ago
    
    But then what would be the case for using accent-ignoring spanish collation rules, rather than the default spanish collation rules that sort accented chars according to normal spanish sorting? This sounds strictly inferior in any human facing application.
fulafel 8 years ago

The docs say that locales and LC_COLLATE work, are you sure?
bmn__ 8 years ago

> One thing that Postgres lacks is accent insensitive collations.
So what if the functionality is not in core, but in an extension? You make it sound as if it does not exist at all and all we can do is gnash our teeth in frustration.
http://www.public-software-group.org/pg_collkey
first release in 2006 for PostgreSQL 8
karmakaze 8 years ago

Why not just use a functional index? I'd expect it to have much better performance.

medius 8 years ago

If you are migrating to AWS RDS, I recommend AWS Data Migration service. I migrated my live database (~50GB) from Mysql to Postgres (both RDS) with zero downtime.

I used AWS Schema Conversion Tool for initial PG schema. I customized the generated schema for my specific needs.

seanharr11 8 years ago

A few of these steps could be solved with this tool, including schema/index migrations, and even initial (fast) data transfer: https://github.com/seanharr11/etlalchemy

tkyjonathan 8 years ago

I'm not against the idea of using postgres, but I have a sneaky suspicion, that you could have gotten that 30% speed benefit from MySQL, if you did some performance work on it. Or in other words, I don't feel that migrating from one relational database to another similar relational database was worth that much trouble.

collyw 8 years ago

Thats interesting, as my dev experience would allow me to tune queries better in Postgres (the visual query planner makes it easier and there are more join types), while tuning them through configuration on either database would involve a lot more trail and error reading. I would more likely know what to do by migrating.

aidos 8 years ago

I went through a similar process a few years ago (4k dbs into a single db, all mysql). You'll get something working fairly quickly and then spend days debugging data anomalies that have crept in over the years.

We ended up with a frankenstein mixture of bash scripts, a sequence of sql transformations, some python and some php to decode some of the data that had been stored in pickled / php encoded packages.

It's not something I want to repeat any time soon :-)

fermuch 8 years ago

It truly sounds like a nightmare. How much time did it took to move everything to psql, and how long until it was stable enough to not keep an eye on it?
- aidos 8 years ago
  
  To clarify, it all stayed on mysql — just pulled everything into a single db.
  I can't remember exactly but the whole project was done in a real hurry, about 4 weeks I think. It included a rewrite of the application / website (about 20k lines of backend code, html and js / css), migration of all the data, creation of new transcoders, migration of a lot of data asset/video data from another server.
  No idea how we got through it all in retrospect, but I had to. A few days after we finished I got married and went off on my honeymoon, so I guess I decided it was stable enough the day I released it...! :-)

JelteF 8 years ago

One of the main things I ran into when migrating from MySQL to postgres was that the default text columns (TEXT, VARCHAR, etc) behave differently when searching. MySQL matches text case insensitive, while postgres matches it case sensitive. This resulted in searches for content suddenly not returning data. Luckily there's an official CITEXT extension for postgres [1], which matches text case insensitive and adds the correct indices for doing so efficiently.

[1] https://www.postgresql.org/docs/9.1/static/citext.html

pasta 8 years ago

"MySQL matches text case insensitive"
This is of course only when you set a case insensitive collation (_ci).
But I agree this can be a problem because case insensitive collations are used a lot in MySQL.
Also other collations can be a trouble. For example the collation that returns results for both 'ß' and 'ss' (German)

ehxcaet 8 years ago

I'm not entirely sure what the technical reason for this change was. Of course, cleaning up your DBs and whatnot is a good thing. But why did they move it to Postgres instead of simply clean?

noir_lord 8 years ago

Would love to do this to the monstrosity I inherited in new job but it's a sispheyan task.

One day.

cntlzw 8 years ago

One time I saw a database, with a single table, with a single text column that contained the old database.
- shrikant 8 years ago
  
  This sounds similar to what I saw at a client a few years back. All XML responses from a REST API were inserted into a single VARCHAR(MAX) column (one row per response, but a response could have multiple "records") in a SQL Server database.
  All interaction with this was through a byzantine web of XPath expressions using the severely neutered dialect of XQuery that shipped with SQL Server 2008.
  Apologies for the mini-rant, your comment appears to have triggered some painful memories.
- ishi 8 years ago
  
  It's databases all the way down...
overcast 8 years ago

You should see the one here. An old flat file database monstrosity from the 80s, imported into oracle, complete with all of the duplicated data. It's the most non relational, relational database ever.
- zimpenfish 8 years ago
  
  Would a database stored in an XML file shared between 60+ clients over NFS beat that for monstrositiness?
  - overcast 8 years ago
    
    Well, this is used now by about 4000 clients. So you be the judge. There is so many indexes, and spaghetti, just to make sure every possible "relationship" is accounted for. Amazing it works at all.
    Just using an XML file is pretty crazy though.
  - cntlzw 8 years ago
    
    How concurrent is that? Can't imagine how writing to the file is even achieved.
    
    zimpenfish 8 years ago
    
    "Badly". At least once a week someone managed to overwrite a change another client had just made. When I left, they were running a cron to copy it every minute as a "quick restore" when it happened.
    
    shoo 8 years ago
    
    maybe there is an email thread shared between all clients for acquiring / releasing the write lock on the xml file
    
    K-Wall 8 years ago
    
    Semaphore as a service Slack Chat bot.
    
    noir_lord 8 years ago
    
    Seen that.
    Shared drive with excel files, people would shout at each other to 'release the lock'
    
    zimpenfish 8 years ago
    
    Yeah, first real job had the customer database as a VB frontend onto an Access file shared over {whatever Win95 was using in 1996-97}. After a few snafus, the CS / Sales people learned that they had to single-task with the database.
  - abraae 8 years ago
    
    That made me snort my coffee.
  - jboy55 8 years ago
    
    Lol, if it sent emails, I might know of such a system
  - toomuchtodo 8 years ago
    
    This is a special level of hell.
- noir_lord 8 years ago
  
  I have 177 tables, 764 stored procedures and 44 joins.
  Every single table has either a uniqid() or a UUID() which isn't the primary key (they all have incrementing integer keys) but is 'joined' on in code.
  When he needs to 'join' he used coalesce() on selects.
  Referential integrity doesn't exist.
  Oh and the cherry on the shit sandwhich, we have four seperate systems that talk to this database written in 3 different languages (Java (Android scanners), C# (Factory scanners) and PHP (the main system).
  So there is no way to accurately know which SP's are used by what without grepping the entire codebase looking for call <foo>.
  The PHP is written badly in the old PHP pure procedural style running on an outdated version of PHP/Debian/MySQL.
  Basically if you took the absolute worst approach to everything this would be the end result.
  It's good fun!.

TekMol 8 years ago

Is it an universally accepted truth now that it's generally better to use Postgres then to use MySql?

Would anybody here use MySql for a new project? If so, why?

junkilo 8 years ago

Having used both from the prototyping stage to very large scale my answer: It doesn't matter.
They are both plenty feature complete to work well for relational data type problems. Grab the one you have more exposure to and you'll be fine.
That said, I'm still using mysql-innodb (latest percona) because the built-in replication has had more time to bake and its concurrency model (think 100k qps+) is plain better. For further reading see:
https://eng.uber.com/mysql-migration/
falcolas 8 years ago

Yes, I would use MySQL. I'm very familiar with it for one, and the extra storage engine options turn a sure thing into a no-brainer decision.
Ultimately, I think familiarity is and will always be the biggest driver for experienced devs who choose a DB.
- jeffdavis 8 years ago
  
  What storage engines and options do you use?
  - falcolas 8 years ago
    
    InnoDB is my default goto, TokuDB when on-disk size is a consideration, Archive DB for the exceptional cases where I need even more, rarely accessed, data on disk, and MyISAM for its full text searching capabilities (usually de-normalized from a backing InnoDB table). Those are the big ones.
    There's also the NDB and Galara Cluster, but you want a specific use-case to justify the added complexity and overhead.
    I've seen Blackhole used in production, but that was an odd one involving a... unique... replication setup. The CSV engine is pretty awesome for huge data imports, but again, pretty unusual for actual use.
  - forgot-my-pw 8 years ago
    
    XtraDB if you can, InnoDB if you can't.
    
    cookiecaper 8 years ago
    
    FWIW, MariaDB 10.2 no longer defaults to XtraDB, now considering InnoDB mostly on-par. [0]
    [0] https://mariadb.com/kb/en/mariadb/why-does-mariadb-102-use-i...
leesalminen 8 years ago

I chose MySQL (well, MariaDB) for a project that started in 2013. I chose it because I have the most experience maintaining it and have a pretty good handle on how to tune.
- porker 8 years ago
  
  Ditto. I still choose it for small to medium sized webapps because - let's face it - the data storage is not their pain point.
  MySQL is fast enough, usually has enough features, I know how to tune it and queries, and I normally only bump into 2-3 things in a project which would've been easier with PostgreSQL. And can work around them.
  The backwards-incompatible SQL_MODE changes between MySQL 5.6 and 5.7 made me reconsider, as legacy apps needed updating.
  tl;dr: yes. But PostgreSQL would be easier for some items
  - mseebach 8 years ago
    
    > The backwards-incompatible SQL_MODE changes between MySQL 5.6 and 5.7 made me reconsider, as legacy apps needed updating.
    Uhm, err, that change is to stop the database from silently trowing away your data. It's not about "fast enough", or "easy". It's about your database actually saving your data.
    If you're happy for your "database" to keep quietly throwing away data in order to keep legacy apps "working", you can just disable strict mode.
    
    porker 8 years ago
    
    It's not just changes to stop data being thrown away. This issue has been painful: https://github.com/doctrine/doctrine2/pull/6143
    Across multiple businesses, they've upgraded servers and got MySQL 5.7 installed, only to wonder why the app has stopped working. People didn't expect that from a point release/the default new operating system package. We could argue the point, but it's the expectation of businesses vs software engineers.
brightball 8 years ago

Nope, because using MySQL instead of Postgres essentially cripples your dev tools once you know about everything PG opens up for you.
http://www.brightball.com/articles/why-should-you-learn-post...
sandGorgon 8 years ago

Most e-commerce systems start off on magento, which is MySQL only. So a LOT of MySQL projects are simply the answer to :
1. Did you use magento/Drupal/WordPress/etc PHP framework
2. Did you host on a 20$ per month server with unlimited bandwidth and free database ?
e12e 8 years ago

Like other comments here state, in general, use what you know. That said, postgresql has ~always been the better rdbms. Some would argue mysql at times have been a better ~key-value store with ~sql interface.
Apparently postgresql is difficult to run in a properly restricted bsd jail. Other than that, I'd be hard pressed to come up with reasons to prefer mysql on technical merit over either postgresql, or, say redis or something.
- Avernar 8 years ago
  
  I'm running postgresql in a FreeBSD vnet jail without issue. The network stack is not shared as it's a vnet jail. I have it set for per jail sysvipc, new in FreeBSD 11, which was the last thing I needed to fully isolate it from other jails.
  Is there anything else that's needed to properly restrict the jail?
  - e12e 8 years ago
    
    Mostly that afaik a jail with sysvipc set isn't really isolated any more - it works, but defeats some of the purpose of using a jail in the first place?
    https://forums.freebsd.org/threads/59371/
    Apparently this can(should?) be tweaked a bit since 11-release:
    https://www.freebsd.org/cgi/man.cgi?query=jail&sektion=&n=1
    > allow.sysvipc A process within the jail has access to System V IPC primitives. This is deprecated in favor of the per-mod- ule parameters (see below). When this parameter is set, it is equivalent to setting sysvmsg, sysvsem, and sysvshm all to ``inherit''.
    I interpret that as it still being the case that setting this to the (equivalent of) the deprecated setting, one looses much of the protection a jail normally gives.
deathanatos 8 years ago

https://grimoire.ca/mysql/choose-something-else
That's essentially my reasons for preferring PostgreSQL.

nrmitchi 8 years ago

> Now is when one of the main reasons to take Postgres as our new database server. Postgres allows the use of Views(include link) supporting INSERT, UPDATE and DELETE meanwhile the relation between views and tables are one to one, ....isn’t it awesome?

Just a heads up, I think you missed a link in there

ibejoeb 8 years ago

When talking about replaying live activity on the test system:

>We took the third option because the other two options were quite intrusive and they might interfere at the normal use of our app because in both cases they were going to be in between our users request and our app

One of those two rejected options was goreplay. It is passive, like tcpdump. Right from the docs:

"Gor is not a proxy: you do not need to put 3-rd party tool to your critical path. Instead Gor just silently analyzes the traffic of your application and does not affect it anyhow."

StreamBright 8 years ago

Now that is what I call a proper migration. Great content! I was wondering how could somebody end up with corrupt data, I guess it was due to MySQL not strict enforcing the types probably.

ris 8 years ago

https://github.com/the4thdoctor/pg_chameleon

adrianlmm 8 years ago

I tried to migrate from Firebird to Postgres once and I failed, it needs more administration than Firebird.

ZeusNuts 8 years ago

What was the reason to first migrate and then repair data? Repair scripts could've been ran against the old setup too and you would've had less stuff to migrate.

Settings

Migrating 1200 databases from MySQL to Postgres

Keyboard Shortcuts