Moving Half a Million Database Tables to AWS Aurora (Part 1)

This post is about migrating Pressbooks.com to AWS.

Does It Scale?

At Pressbooks we use WordPress Multisite as a development platform. Pressbooks changes WordPress and makes every blog a book.

The prevailing wisdom of the day is that a relational database should have a manageable set of tables with lots of rows of data in them …and then there’s WordPress.

WP Multisite creates 10 database tables for every blog (or in our case: for every book.) Tables share IDs but they are not enforced at the database level. There are no foreign key constraints, no triggers, nor routines. It’s simple no frills MySQL.

Pressbooks dot com is currently running on a single bare metal server. (Our hosted instances are already on AWS) This server has a single MariaDB database with 60,000 books in it. When we do the math that’s over 600,000 tables in one database. Are you nuts?! Unusual? Horrible? Yet entirely possible, even plausible.

I’ve met people who worked at Automattic and from the stories, I heard WordPress dot com uses the same WP Multisite technology but instead of half a million tables it’s over a billion tables (probably not all in the same database though, more on that later).

I call this the Schrödinger’s Cat of database design because so long as we don’t look it’s alive?

Prerequisites

As I said, we already host hundreds of WP Multisite networks on AWS. We build, manage, and deploy to our infrastructure using Terraform, Ansible, wp-cli and all things competent. We simply just, sort of, well, neglected to move our freemium site over because we were too busy.

The time has come.

Research

We tried mysqldump. It was too slow. Our tests showed that a dump would take days.

Some colleagues recommended AWS DMS. It did not work.  Some reasons:

  • The AUTO_INCREMENT attribute is not migrated.
  • WP Multisite tables are prefixed wp_1_, wp_2_, wp_3_, ... MySQL considers the underscore a one character wildcard when used in LIKE queries, DMS provided no way to escape it for table filters.
  • Unexplained crashing on anything less than a c4.large (failed last Error The task stopped abnormally Stop Reason RECOVERABLE_ERROR Error Level RECOVERABLE)
  • Mostly, the fastest migration we could get going, running 4 tasks in parallel, was an ETA of 10 days

We asked for help. Ned had a conference call with a reputable consulting firm and they gave us a quote: $34K USD + travel & on-site expenses.

Coffee spitting

Next, our research led us to mydumper. From the README:

== What is mydumper? Why? ==

  • Parallelism (hence, speed) and performance (avoids expensive character set conversion routines, efficient code overall)
  • Easier to manage output (separate files for tables, dump metadata, etc, easy to view/parse data)
  • Consistency – maintains snapshot across all threads, provides accurate master and slave log positions, etc
  • Manageability – supports PCRE for specifying database and tables inclusions and exclusions

So far so good…

== How to build it? ==

Jerry Seinfeld leaving

Just kidding. It turns out we don’t have to build mydumper. On Ubuntu sudo apt install mydumper works fine. Similar command using yum on CentOS.

Our tests conclude that mydumper finishes in hours instead of days.

Plan

It is our opinion that this kind of problem is better suited for a document-oriented database. Given that this is the database design we inherited, there’s not much we can do about it, so we’ll try our best with what we’ve got. ¯_(ツ)_/¯

At a billion tables, Automattic has already established its own internal best practices with plugins like HypderDB. Unfortunately HyperDB doesn’t have Composer support and doesn’t look maintained. LudicrousDB, a Composer compatible drop-in that works with our existing tech stack, to the rescue.

LudicrousDB is an advanced database interface for WordPress that supports replication, fail-over, load balancing, and partitioning, based on Automattic’s HyperDB drop-in.

With LudicrousDB tested and working, we are moving towards a 101 slice approach. 1 slice for core tables and 100 slices for books.

The idea is to use the last two digits of a book ID to pick one of 100 slices. If this becomes unmanageable in the future (important to remember that we already have over half a million tables in 1 database and things are fine), we can change the splitting algorithm by adding a condition to use the last X digits on books with IDs bigger than Y.

Code

For informational purposes only. Read the snippets and reason about them. Copy/paste at your own peril.

LudicrousDB  Callback

/**
 * Slices
 *
 * We can predict what slice a blog is in by looking 
 * at the last two digits of the id. Examples:
 *
 * + blog_id: 9, in db09
 * + blog_id: 74, in db74
 * + blog_id: 999989, in db89
 * + blog_id: 9200, in db00
 *
 * @param $query
 * @param \LudicrousDB $wpdb
 *
 * @return string
 */
function pb_db_callback( $query, $wpdb ) {
  if ( preg_match( "/^{$wpdb->base_prefix}\d+_/i", $wpdb->table ) ) {
    $last_two_digits = (int) substr( $wpdb->blogid, -2 );
    $db = sprintf( 'db%02d', $last_two_digits ); // db00, db01, db02, ..., db99
    return $db;
  } else {
    return 'global';
  }
}
$wpdb->add_callback( 'pb_db_callback' );

Export DB Into 101 Slices:

#!/bin/bash

# This script will CREATE 101 directories in current 
# working directory, you have been warned!

db='old_database_name'

sudo mydumper --regex="^${db}\.wp_[a-zA-Z]+.*" --database="${db}" --outputdir="core" --build-empty-files
for ((i=0; i<=99; i++)); do
  ii=`printf %02d $i`
  sudo mydumper --regex="^${db}\.(wp_${i}_|wp_\d+${ii}_).*" --database="${db}" --outputdir="${ii}" --build-empty-files
done

Import 101 Slices:

#!/bin/bash

# This script will READ 101 directories in current 
# working directory

db='new_database_name'

sudo myloader --directory="core" --database="${db}" --overwrite-tables
for ((i=0; i<=99; i++)); do
  ii=`printf %02d $i`
  sudo myloader --directory="${ii}" --database="${db}_${ii}" --overwrite-tables
done

Did it work?

To be continued in Part 2…

Bonus tips:

Because Pressbooks has so many MySQL tables, the Clients I use are always getting stuck or freezing. Here are some tricks I use to keep sane:

  • Don’t let MySQL Workbench load the table schemas. Set up your GUI so that schemas are in a separate tab, disable autoloading, autocomplete, etc. (Edit ⇨ Preferences ⇨ SQL Editor)
  • Disable MySQL CLI auto-completion with --disable-auto-rehash

Pressbooks 4.3.4 and Pressbooks Book 1.10.4

We tagged Pressbooks 4.3.4 and Pressbooks Book 1.10.4 on GitHub today and are now deploying them across our hosted networks. Here’s what’s changed:

Pressbooks 4.3.4

NOTICE: Pressbooks >= 4.3.3 requires WordPress 4.8.2.
NOTICE: Users of the Pressbooks Custom CSS theme must upgrade to Pressbooks Custom CSS 1.0 for compatibility with Pressbooks >= 4.3.0.

  • [CORE ENHANCEMENT] The user catalog title can now be changed via the pb_catalog_titlefilter (props to @monkecheese; see #961).
  • [CORE ENHANCEMENT] SCSS variables from theme options will now be passed to the SCSS compiler as key/value pairs rather than by building SCSS in PHP (see #782 and #963).
  • [FIX] Fixed an issue where the PDF margins theme option was not being applied properly.
  • [FIX] Fixed a conflict between the updated Pressbooks LaTeX module and third-party renderers (props to @monkecheese; see #958 and #959).
  • [FIX] The publication date should now save properly, regardless of book language (thanks to @thomasdumm for the bug report; see #965 and #966).

Pressbooks Book 1.10.4

  • [FIX] Fixed an issue where part numbering would not reset properly in Prince if the part was the book’s first content (see #45).

Pressbooks 4.3.3, Pressbooks Publisher 3.1.3 and DocRaptor for Pressbooks 2.1.0

We tagged Pressbooks 4.3.3, Pressbooks Publisher 3.1.3, and DocRaptor for Pressbooks 2.1.0 and deployed them across our hosted networks today. Here’s what’s changed:

Pressbooks 4.3.3

NOTICE: Pressbooks 4.3.3 requires WordPress 4.8.2.

  • [CORE ENHANCEMENT] The Pressbooks plugin is now self-updating — GitHub Updater is no longer required (see #897 and #954).
  • [CORE ENHANCEMENT] Error logs from export routines can be emailed to an array of email addresses supplied via the pb_error_log_emails filter (see #956).
  • [CORE ENHANCEMENT] Images in cloned or imported books can now be properly edited using the WordPress image editor (see #920 and #949).
  • [FIX] We’ve implemented a better solution for the PDF profile bug (see #951, #952).
  • [FIX] URLs like /catalog/page/1 will no longer attempt to load user catalogs (see #953).

Pressbooks Publisher 3.1.3

  • [FIX] Removed duplicate footer markup (thanks to @jeremyfelt; see #11).

DocRaptor for Pressbooks 2.1.0

  • [CORE ENHANCEMENT] The DocRaptor for Pressbooks plugin is now self-updating — GitHub Updater is no longer required (see #19, #20, and #21).

 

Our new roadmap

At the start of 2017, we established a development roadmap for Pressbooks to guide our work through the coming year. This past week we had the opportunity to review and reflect on that roadmap during our retreat at Pressbooks HQ, and you can see our new roadmap here.

2017: Year of Core

I want to highlight a few of our accomplishments from the past nine months (you’ll see them crossed out on the old roadmap). In retrospect, our clear focus over the past nine months was improving Pressbooks’ core technology. When Dac came back, having a second developer let us expand upon my efforts in recent years to standardize our development processes under the hood. We now use consistent coding standards across all of our open source projects, and we have adopted a standardized build process for admin and front end assets. We have expanded our code coverage on the core Pressbooks plugin, and continue to do so with every release. These improvements let us work more efficiently and give open source contributors a clear framework within which to contribute to the Pressbooks ecosystem.

Our most significant core enhancement is our new REST API, built on the WordPress Core REST API infrastructure. This is the engine that powers our new cloning tool, and we will be making use of it in other areas over the next year. We’re also extremely excited to see what the Pressbooks Open Source community does with an API for books. If you are building something with it, let us know.

2018: Year of the Author

The roadmap for our next year has a new focus: improving Pressbooks for authors. There are a number of editing features that we included on our last roadmap that didn’t make the cut, and our goal for the next year is to fill as many of these gaps in the Pressbooks authoring toolset as we can. This includes:

  • Broadening our support for math, interactive content, video and audio across all formats (with graceful fallbacks in static formats like PDF)
  • Adding support for multiple contributors: authors, editors, translators and more
  • Adding indexing and glossary support

And more! Take a look at our new roadmap to see our comprehensive plans for improving Pressbooks’ authorship and editing tools, as well as all other aspects of the project.

I’m very excited about our accomplishments since January 2017, and I’m looking forward to building on them over the next year. Thanks to Apurva, Dac, Hugh, Liz, and Zoe for making Pressbooks such a productive team!

Pressbooks 4.3.1 and Pressbooks Book 1.10.3

We tagged Pressbooks 4.3.1 and Pressbooks Book 1.10.3 on GitHub today and deployed them across our hosted networks. Here’s what’s changed:

Pressbooks 4.3.1

NOTICE: Pressbooks 4.3.1 requires WordPress 4.8.1.
NOTICE: Users of the Pressbooks Custom CSS theme must upgrade to Pressbooks Custom CSS 1.0 for compatibility with Pressbooks 4.3.1.

  • [CORE ENHANCEMENT] Added a debugging switch to Custom Styles (see #946).
  • [FIX] Resolved an issue where some fonts would not be loaded properly during the PDF export routine (see #944 and #945).
  • [FIX] Updated routines that use XPath for compatibility with HTML5, resolving some issues with multi-level TOC and EpubCheck validation (see #947).

Pressbooks Book 1.10.3

  • [FIX] Fix some issues with Biblical Hebrew, Devanagari, and Turkish fonts.

Pressbooks 4.3, Pressbooks Book 1.10.2, Pressbooks Custom CSS 1.0, and Pressbooks Publisher 3.1.2

We tagged Pressbooks 4.3.0, Pressbooks Book 1.10.2, Pressbooks Custom CSS 1.0.0, and Pressbooks Publisher 3.1.2 on GitHub yesterday, and we’re deploying them across our hosted networks today. Here’s what’s changed:

Pressbooks 4.3

NOTICE: Pressbooks 4.3.0 requires WordPress 4.8.1.
NOTICE: Users of the Pressbooks Custom CSS theme must upgrade to Pressbooks Custom CSS 1.0 for compatibility with Pressbooks 4.3.

  • [FEATURE] Custom Styles: Navigate to AppearanceCustom Styles on your book’s dashboard to add custom CSS or SCSS to any book theme (see #658, #912, #925, #937, #938, #940, #941, and #942).
  • [ENHANCEMENT] Expanded the license property of the /metadata endpoint to include a human-readable license name and custom license text (if present) (see #934 and #936).
  • [ENHANCEMENT] Added the book’s short description to the /metadata endpoint as a disambiguatingDescription (see #930 and #932).
  • [ENHANCEMENT] Clarified errors when trying to clone a book from Pressbooks < 4.1 (see #914and #931).
  • [ENHANCEMENT] Renamed several action and filter hooks and deprecated the old versions (see #926).
  • [FIX] Fixed an issue which would prevent super administrators without any books on a network from accessing the cloning page (see #913 and #933).
  • [FIX] Fixed a regression which blocked the use of custom LaTeX renderers (props to @monkecheese; see #928).

Pressbooks Book 1.10.2

  • [ENHANCEMENT] Updated to version 2.1 of pressbooks/mix.
  • [FIX] The cover page now displays the subtitle from Book Information as the book’s subtitle, rather than the tagline.

Pressbooks Custom CSS 1.0.0

NOTICE: Pressbooks Custom CSS 1.0.0 requires Pressbooks 4.3.0.

  • [ENHANCEMENT] Custom CSS functionality is now included in this theme (see #2).

Pressbooks Publisher 3.1.2

  • [FIX] Prevented Pressbooks Publisher’s wrapper from being added to the user catalog page.

Pressbooks 4.2 and Pressbooks Book 1.10.1

We tagged Pressbooks 4.2.0 and Pressbooks Book 1.10.1 on GitHub today and we’re deploying them across our hosted networks tomorrow. Here’s what’s changed:

Pressbooks 4.2

NOTICE: Pressbooks 4.2 requires WordPress 4.8.1.

  • Feature: Full-sized images will be used where possible in Print PDF exports to ensure that exported PDFs meet image resolution requirements (see #894, #898 and #900).
  • Feature: WXR import and clone operations will now attempt to fetch original images from the source book in addition to the scaled/cropped version in the book content (see #895 and #902).
  • Feature: Content on the organize page now has a View link as will as Edit and Trash (see #840and #893).
  • Enhancement: The Masterminds HTML5 parser is now used instead of \DOMDocument where possible for improved error handling and compatibility with HTML5 elements (see #889 and #896).
  • Enhancement: Unnecessary HTTP calls have been removed from export routines (see #899).
  • Enhancement: Installation instructions are now linked from the readme file instead of being included (see #891 and #892).
  • Fix: Resolved some inconsistencies with custom copyright notice and copyright year display (see #922).
  • Fix: Clone operations now have a 5-minute time limit which should reduce the occurrence of timeouts (props to @bdolor for the bug report; see #903 and #904).
  • Fix: Visiting /catalog on the root site no longer causes an error (see #905).
  • Fix: Pressbooks LaTeX settings no longer appear on the root site’s dashboard (see #910 and #911).
  • Fix: The Organize page now supports all post statuses (see #915).
  • Fix: Fixed an issue where the Pressbooks News dashboard widget would be cached in the wrong language (see #918 and #921).
  • Fix: Removed some unused code from the PB LaTeX symbiont (props to @jeremyfelt; see #923).

Pressbooks Book 1.10.1

  • Fix: Consistent display of custom copyright notice (see #38).

Can you “clone” a Pressbooks book? You can now*!

Here at Pressbooks HQ we’ve been doing a whole lot more development work for the Open Textbook world, in our opinion the most exciting space in the world of publishing. For the uninitiated, an Open Textbook is an openly licensed (i.e. free) book that supports the “5Rs,” defined by David Wiley as the rights to: remix, revise, reuse, retain, and redistribute.

Open Textbooks are powerful not just because they are free for students, but also because teachers and profs (or even students) can easily improve them and modify them for the particular needs of their students.

Theory vs. Practice

In theory, at least.

In practice, all that 5R-y stuff can be difficult: How do you revise a PDF? How do you remix an EPUB? How do you redistribute a print book?

Clone me, please!

The new answer — at least for Open Textbooks built on Pressbooks as of now is: You clone them!

That is, you can now, with the click of a button, clone/copy a complete Pressbooks book (including all metadata, image and media, and content) from one Pressbooks account or instance to another, as long as the original book is:

  • Openly licensed (i.e. licensed with a Creative Commons license)
  • Publicly available on the web

And this means, once you’ve cloned that book, you can 5R it to your heart’s content!

Wait, does this mean anyone can just copy my book?

No. No. No! … No, cloning is only possible in the case that:

  • Your book is openly licensed (with a Creative Commons license)
  • You book privacy setting is: public on the web

So for any books that have standard copyright, or are not available on the web — this doesn’t apply.

Why would you clone a book?

This is, we think, a very exciting development for the Open Textbook ecosystem.

Here are just some of the ways we expect the new feature to be utilized:

  • A community college wants to make changes to the level of subject matter in an open textbook that was originally created for upper-division undergraduates.
  • A faculty member wants to adapt an open textbook to reflect the way they personally teach the subject matter.
  • A university department wants to copy the books contained in a catalogue at a similar department in another university.
  • An instructor wants to make a copy in order to have their class expand an existing open textbook as part of a classroom project.

Cloning ultimately allows books built in Pressbooks to become more modular and easily adaptable for more courses.

Pressbooks, Ryerson University & eCampus Ontario

Have you heard about the exciting Open Textbook work happening in Ontario?

This cloning feature was developed as part of a project Pressbooks is doing with Ryerson University, funded by a grant from eCampusOntario, developing infrastructure for Open Resource Publishing in Ontario.

Also under this project, Pressbooks is getting a full design refresh, including redesigns of the book home page, the webbook reading interface, and, for Pressbooks systems, updates to the landing page and Pressbooks’ built-in catalog page.

So, How Do I Start Cloning?

The bad news is: This feature is not available on Pressbooks.com. Cloning is an educational feature only available in standalone Pressbooks systems (Pressbooks EDU client systems and Pressbooks open source). (Contact us if you’re interested in us hosting a Pressbooks EDU system for you.)

Pressbooks.com also supports replicating books. However, the process of copying a book is more labour intensive, and requires users to reach out to original creators for the book’s XML files. This new cloning feature omits these steps for enterprise users, making duplication possible with only a few clicks of a button.

Learn more about how to use the new cloning feature.

Pressbooks 4.1 and Pressbooks Book 1.10

We tagged Pressbooks 4.1.0 and Pressbooks Book 1.10.0 on GitHub on Friday and we’re deploying them across our hosted networks today. Here’s what’s changed:

Pressbooks 4.1

NOTICE: Pressbooks 4.1 requires WordPress 4.8.1.

  • Feature: Cloning! Clone any public, properly-licensed book from any Pressbooks 4.1 network including your own (super admins can clone any book from their own network, regardless of license) (see #841, #857, #881, #885).
  • Feature: Granular display controls for content licenses at the book and section level (see #805, #867, #873, #883, #884).
  • Feature: Word count for the entire book and for content marked for export is now displayed on the Organize page (see #842, #878, #880).
  • Feature: Users can now delete their own books from the book menu (see #845, #864).
  • Feature: Custom taxonomies are now available in the Pressbooks REST API v2 (see #851, #853).
  • Feature: The Schema.org isBasedOn property is now saved in metadata and displayed in the Pressbooks REST API v2 (see #850, #852).
  • Feature: Search & Replace with RegEx is now available for super admins without additional configuration (see #870, #871, #879).
  • Feature: Punjabi Gurmukhi support (props to @alexpflores) (see #877).
  • Enhancement: Book editors can now modify theme options and custom CSS (see #862, #863).
  • Enhancement: The Pressbooks News feed is now cached across all sites to reduce unnecessary network access (see #882).
  • Fix: The TOC endpoint in the Pressbooks REST API v2 now uses chapters and parts for consistency with the endpoints for these post types.
  • Fix: The EPUB importer now properly detects and handles optional whitespaces (see #554, #874).
  • Fix: Users without super admin privileges can no longer access the trash when they shouldn’t be able to do so (see #865).
  • Fix: Image URLs with #fixme on the end will now be properly copied into EPUBs during export (see #887).

Pressbooks Book 1.10

  • Feature: Add support for Pressbooks 4.1.0’s content licensing feature (see #25, #26, #31, #32, #37, pressbooks/pressbooks#805).
  • Feature: Add support for Punjabi Gurmukhi script (props to @alexpflores) (see #27).
  • Fix: Fix an issue where enabling social media would break the mobile webbook layout (see #28, #30).
  • Fix: Fix an issue where numberless chapters in a book would cause the MOBI export to fail with certain book themes (see #33).
  • Fix: Add webbook theme support for WordPress’ built-in <!--nextpage--> tag, which splits a single post into multiple web pages (see #35).

Pressbooks 4.0.1

We released Pressbooks 4.0.1 on GitHub and deployed it across our hosted networks today. Here’s what’s changed:

  • Fix: Fixed an issue where the template root for book themes was not properly updated (see #854, #859).
  • Fix: Fixed an issue where ampersands were not being sanitized in XML outputs (see #860).
  • Fix: Fixed an issue where the Disable Comments setting was not being saved properly (see #861).
  • Fix: Fixed an incorrect link in upgrade notices (see #848, #849).