How WP_Importer_Cron works

@migration #importer #cron

Initial observations from reading the source:

  1. The cron-based importer hooks onto scheduler to fire events every 60 seconds.
  2. Each cron job is limited to 30 seconds of work, fired off by WP_Importer_Cron::importer_callback().
  3. A sequence of events is defined in code, checking for its own progress in a variable (in the case of the Tumblr Importer, that would be $this->blog[$url]). Depending on how far it has progressed, the importer will hook different actions to the cron job using WP_Importer_Cron::schedule_import_job().

The progress accessed by the key $this->blog[$url]['progress'] and other indicators (e.g. number of posts fetched) are stored together in an option, which is touched on each run.

/* from WP_Importer_Cron::__construct() */
//load the variables
$options = get_option( get_class($this) );

if ( is_array($options) )
    foreach ( $options as $key => $value )
        $this->$key = $value;

/* from WP_Importer_Cron::save_vars() */
// function to let the class save its own variables
function save_vars() {
    $vars = get_object_vars($this);
    foreach($vars as $var=>$val) {
        if ($var[0] == "_") unset ($vars[$var]);
    update_option( get_class($this), $vars );

    return !empty($vars);

As we can see here, this option is stored as follows: the option name is the class name of the importer; the option value is an array of the instance's variables (serialized by update_option()).

What to do to the WXR importer

  1. First transform the existing top-level tasks (WP_Import::process_categories(), WP_Import::process_tags(), WP_Import::process_terms(), WP_Import::process_posts(), WP_Import::backfill_parents(), WP_Import::backfill_attachment_urls(), WP_Import::remap_featured_images()) into progress-stateful functions.
  2. Hook them into WP_Importer_Cron at 60 second intervals with execution time of approximately 30 seconds.
  3. Work in a do-while loop to get as much done as possible while there's time left in the current execution.
  4. Divide up post tasks into chunks by a set number of posts per execution (something reasonable for 30 seconds).