Moving Half a Million Database Tables to AWS Aurora (Part 2)

Quick recap: Migrate half a million database tables from a single bare metal server with 1 database to 101 database slices on AWS Aurora.

Wait, half a million database tables?! Answered in Part 1.

Plan

1. Stop the server, take an LVM snapshot.
2. Use mydumper to dump the snapshot to SQL files.
3. rysnc these to the new server.
4. Use myloader to load the SQL files into the new databases.

A more detailed view of our plan as seen in our migration kanban

Captain Not So Obvious

Why not setup the Aurora as a replica and then switch over?

Because our MariaDB server was a bare metal box outside of AWS. The read Replica docs imply that MySQL has to already be in AWS for that to work. If that’s not enough this doc says use mysqldump to start, then sync after. This doc also says use mysqldump. All signs point to nope.

Why not DMS?

Answered in Part 1.

Mostly, at the end of the day, because our hosted networks are already on AWS it was simply more cost effective to shut down our freemium site and migrate in one swoop than to have our whole team keep at this for weeks, possibly months.

Epilogue: What About Uploads?

Each book has media library files (GIF, PNG, JPG, EPUB, PDF, etc). A few days before the migration, we copied all files from the production server’s uploads/ directory using rsync:

rsync -avz -e "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null" someuser@oldpressbooksdotcom:/path/to/uploads/ /path/to/uploads/ --progress

This process took about 10 hours.

Then, on migration day, we ran the same command again with the --delete option to update the new server with the latest files from the old server and remove any files that have been deleted on the old server:

rsync -avz --delete -e "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null" someuser@oldpressbooksdotcom:/path/to/uploads/ /path/to/uploads/ --progress

Much quicker! (around 7 minutes)

Launch it!

“If we get into the trees it could be rather disastrous, so we’ve got to hit the roses.” – Ken Carter

Scripts from Part 1 (read it already!) were modified to include Slack notifications:

notify() {
  read -d '' payLoad << EOF
  {
    "channel": "#operations",
    "username": "Pressbot",
    "icon_emoji": ":closed_book:",
    "text": "Slice \`${1}\` has been imported on AWS."
  }
EOF

  curl \
    --write-out %{http_code} \
    --silent \
    --output /dev/null \
    -X POST \
    -H 'Content-type: application/json' \
    --data "${payLoad}" "https://hooks.slack.com/services/<SLACK_WEBHOOK_ID>"
}

# Usage

notify $slice
Slack enhanced scripts.

In an effort to reduce downtime we imported slices as soon as they were transferred. Dumping was faster than imports.

Pressbot
Still slacking!

Ned working hard:

Ned
screen -r, control+a control+d, repeat

All while coding sprint tasks in between.

Things That Went Wrong

We noticed an embarrassing typo in the first few database slices we imported. We had to redo them because renaming a database with tens of thousands of tables in it is not obvious.

I ordered takeout from the wrong fish & chips shop. I had to take a subway 30 minutes to downtown to get it. (Psst Foodora, your geolocation feature sucks!)

Otherwise, nothing. We landed in the roses.

Timeline

  • 8:00: Migration started.
  • 10:40: Database migration started.
  • 19:10: Database migration completed!
  • 19:30: Migration completed.
  • Total time: 11 hours 30 minutes.

DONE. Exciting to turn the page on this. Thanks for reading.

Moving Half a Million Database Tables to AWS Aurora (Part 1)

This post is about migrating Pressbooks.com to AWS.

Does It Scale?

At Pressbooks we use WordPress Multisite as a development platform. Pressbooks changes WordPress and makes every blog a book.

The prevailing wisdom of the day is that a relational database should have a manageable set of tables with lots of rows of data in them …and then there’s WordPress.

WP Multisite creates 10 database tables for every blog (or in our case: for every book.) Tables share IDs but they are not enforced at the database level. There are no foreign key constraints, no triggers, nor routines. It’s simple no frills MySQL.

Pressbooks dot com is currently running on a single bare metal server. (Our hosted instances are already on AWS) This server has a single MariaDB database with 60,000 books in it. When we do the math that’s over 600,000 tables in one database. Are you nuts?! Unusual? Horrible? Yet entirely possible, even plausible.

I’ve met people who worked at Automattic and from the stories, I heard WordPress dot com uses the same WP Multisite technology but instead of half a million tables it’s over a billion tables (probably not all in the same database though, more on that later).

I call this the Schrödinger’s Cat of database design because so long as we don’t look it’s alive?

Prerequisites

As I said, we already host hundreds of WP Multisite networks on AWS. We build, manage, and deploy to our infrastructure using Terraform, Ansible, wp-cli and all things competent. We simply just, sort of, well, neglected to move our freemium site over because we were too busy.

The time has come.

Research

We tried mysqldump. It was too slow. Our tests showed that a dump would take days.

Some colleagues recommended AWS DMS. It did not work.  Some reasons:

  • The AUTO_INCREMENT attribute is not migrated.
  • WP Multisite tables are prefixed wp_1_, wp_2_, wp_3_, ... MySQL considers the underscore a one character wildcard when used in LIKE queries, DMS provided no way to escape it for table filters.
  • Unexplained crashing on anything less than a c4.large (failed last Error The task stopped abnormally Stop Reason RECOVERABLE_ERROR Error Level RECOVERABLE)
  • Mostly, the fastest migration we could get going, running 4 tasks in parallel, was an ETA of 10 days

We asked for help. Ned had a conference call with a reputable consulting firm and they gave us a quote: $34K USD + travel & on-site expenses.

Coffee spitting

Next, our research led us to mydumper. From the README:

== What is mydumper? Why? ==

  • Parallelism (hence, speed) and performance (avoids expensive character set conversion routines, efficient code overall)
  • Easier to manage output (separate files for tables, dump metadata, etc, easy to view/parse data)
  • Consistency – maintains snapshot across all threads, provides accurate master and slave log positions, etc
  • Manageability – supports PCRE for specifying database and tables inclusions and exclusions

So far so good…

== How to build it? ==

Jerry Seinfeld leaving

Just kidding. It turns out we don’t have to build mydumper. On Ubuntu sudo apt install mydumper works fine. Similar command using yum on CentOS.

Our tests conclude that mydumper finishes in hours instead of days.

Plan

It is our opinion that this kind of problem is better suited for a document-oriented database. Given that this is the database design we inherited, there’s not much we can do about it, so we’ll try our best with what we’ve got. ¯\_(ツ)_/¯

At a billion tables, Automattic has already established its own internal best practices with plugins like HypderDB. Unfortunately HyperDB doesn’t have Composer support and doesn’t look maintained. LudicrousDB, a Composer compatible drop-in that works with our existing tech stack, to the rescue.

LudicrousDB is an advanced database interface for WordPress that supports replication, fail-over, load balancing, and partitioning, based on Automattic’s HyperDB drop-in.

With LudicrousDB tested and working, we are moving towards a 101 slice approach. 1 slice for core tables and 100 slices for books.

The idea is to use the last two digits of a book ID to pick one of 100 slices. If this becomes unmanageable in the future (important to remember that we already have over half a million tables in 1 database and things are fine), we can change the splitting algorithm by adding a condition to use the last X digits on books with IDs bigger than Y.

Code

For informational purposes only. Read the snippets and reason about them. Copy/paste at your own peril.

LudicrousDB  Callback

/**
 * Slices
 *
 * We can predict what slice a blog is in by looking 
 * at the last two digits of the id. Examples:
 *
 * + blog_id: 9, in db09
 * + blog_id: 74, in db74
 * + blog_id: 999989, in db89
 * + blog_id: 9200, in db00
 *
 * @param $query
 * @param \LudicrousDB $wpdb
 *
 * @return string
 */
function pb_db_callback( $query, $wpdb ) {
  if ( preg_match( "/^{$wpdb->base_prefix}\d+_/i", $wpdb->table ) ) {
    $last_two_digits = (int) substr( $wpdb->blogid, -2 );
    $db = sprintf( 'db%02d', $last_two_digits ); // db00, db01, db02, ..., db99
    return $db;
  } else {
    return 'global';
  }
}
$wpdb->add_callback( 'pb_db_callback' );

Export DB Into 101 Slices:

#!/bin/bash

# This script will CREATE 101 directories in current 
# working directory, you have been warned!

db='old_database_name'

sudo mydumper --regex="^${db}\.wp_[a-zA-Z]+.*" --database="${db}" --outputdir="core" --build-empty-files
for ((i=0; i<=99; i++)); do
  ii=`printf %02d $i`
  sudo mydumper --regex="^${db}\.(wp_${i}_|wp_\d+${ii}_).*" --database="${db}" --outputdir="${ii}" --build-empty-files
done

Import 101 Slices:

#!/bin/bash

# This script will READ 101 directories in current 
# working directory

db='new_database_name'

sudo myloader --directory="core" --database="${db}" --overwrite-tables
for ((i=0; i<=99; i++)); do
  ii=`printf %02d $i`
  sudo myloader --directory="${ii}" --database="${db}_${ii}" --overwrite-tables
done

Did it work?

To be continued in Part 2…

Bonus tips:

Because Pressbooks has so many MySQL tables, the Clients I use are always getting stuck or freezing. Here are some tricks I use to keep sane:

  • Don’t let MySQL Workbench load the table schemas. Set up your GUI so that schemas are in a separate tab, disable autoloading, autocomplete, etc. (Edit ⇨ Preferences ⇨ SQL Editor)
  • Disable MySQL CLI auto-completion with --disable-auto-rehash