Better WDC Performance with Promises

image courtesy mariadelajuana

Web Data Connectors give analysts access to worlds of data previously locked away behind APIs. The WDC SDK provides developers a platform to standardize data extraction in a way that's friendly to Tableau's suite of analytics and business intelligence software.

Unlike traditional databases and other data stores that are often able to return a complete set of data with a single query, most APIs only expose small chunks of data that must be requested separately and stitched together. Typically, this is done using some form of pagination: a client asks the API for some number of records, then the API responds with these records, as well as some metadata about how many more records remain. Based on this metadata, the client may choose to perform additional queries.

This process constitutes the majority of a WDC's responsibility.

The standard WDC way

To this end, the WDC SDK provides a helpful interface for managing the pagination process. Take for example, the following:

myConnector.getTableData = function(lastRecordToken) {
  // If Tableau gave a page, use it. But default to 1.
  var pageNumber = lastRecordToken ? lastRecordToken + 1 : 1;

  // Load this page of data from the API.
  $.getJSON('https://api.example.com?page=' + pageNumber, function (response) {
    // Check if the current page is not the last page.
    var hasMoreData = pageNumber < response.metadata.lastPage;

    // Pass data and current page back to Tableau. If
    // hasMoreData evaluated to TRUE, Tableau will call
    // getTableData again using this pageNumber value.
    tableau.dataCallback(response.records, pageNumber, hasMoreData);
  });
}

In the contrived example above, we've implemented thegetTableData method that Tableau calls on our connector. It uses the jQuery getJSON method to pull data from an API, defaulting to the first page of data.

When the API returns a response, we pass the records to Tableau, along with some state (the page number we just loaded), and a boolean value indicating whether or not additional data should be returned (which we determine based on the current page and a hypothetical "lastPage" metadata value returned by the API).

If more data should be returned, Tableau knows to call thegetTableData method again, this time passing in the page number / state value we provided, which the code above will use to pull in the next page of data. This process repeats itself until we've run through all available pages and Tableau returns control to the end-user.

The problem with the standard flow

We're able to accomplish a lot with a small amount of code, but this isn't the most efficient way to pull down all of our data. A significant amount of time is wasted on every page, waiting for the API to receive the request, process it, and return a response. A waterfall chart representing resource usage for the above code might look something like this:

As you can imagine, the performance penalty grows considerably as data sets grow larger or if APIs enforce low limits on number of records retrieved per query relative to the number of records desired.

Although JavaScript execution is single-threaded (meaning only one piece of code can be executed at any given time), the asynchronous nature of jQuery's getJSON (and its underlyingajax method) can be exploited to effectively pull down data in parallel.

Consider the following:

$.getJSON('https://api.example.com', function (response) {
  // Executed later, once api.example.com responds with data.
  console.log('Second: Just got data!');
});

// Executed immediately, even before the API responds.
console.log('First: No data yet!');

Although we call $.getJSON with the intention of doing something with its response data in a callback, the thread of execution first continues to the instructions immediately following the call. This becomes useful for us when we realize the same principal holds true if we were to make a series of calls to $.getJSON in a loop:

var pageNumbers = [1, 2, 3],
    allResults = [];

// Loop through all pages.
pageNumbers.forEach(function (page) {
  $.getJSON('https://api.example.com?page=' + page, function (response) {
    console.log('Later: Got data for page ' + page);
    allResults.push(response.records);
  });
});

console.log('First: No data yet!');

All three API calls are initiated in rapid succession in a non-blocking way, immediately after which "First: No data yet!" is printed to the console. When each request is eventually fulfilled, additional messages are logged. Note that rather than sending data to Tableau with each response, we append the data to an array store with the intention of passing the complete data set back to Tableau later. A waterfall chart for the code above might look something like this:

We dramatically improve performance (by just under 3x), but we also create complications for ourselves. In the above example, how do we know when all API calls have been fulfilled? If the thread of execution immediately returns to us after the loop, how do we wait to pass control back to Tableau with the complete set of data?

Enter: Promises

Promises are not a new concept. Devised as a way to simplify synchronicity in concurrent or asynchronous programming languages, the idea is straightforward: a Promise is a placeholder for a value that hasn't yet been computed. It can be fulfilled when the value becomes known, or rejected in case of error. You can "then" specify code that gets executed when the Promise moves from pending to fulfilled.

Here's a naive example to illustrate how you might use Promises in JavaScript in a Web Data Connector:

// Return a Promise for the given URL.
var returnApiDataFor = function(url) {
  return new Promise(function(fulfill, reject) {
    // When the promise is invoked, send the request.
    $.getJSON(url, function (response) {
      // Fulfill the promise with data returned.
      fulfill(response.records);
    });
  });
};

// Invoke the promise for a given URL.
returnApiDataFor('https://api.example.com')
  // Once the promise is fulfilled...
  .then(function (records) {
    // Pass data to Tableau.
    tableau.dataCallback(records);
  });

If we want to wait for a series of API calls to be fulfilled before taking action, we can use a feature in the JavaScript specification for Promises called Promise.all. With it, we can specify an iterable object (like an array) of Promises, wait for them all to be fulfilled, and then react based on the aggregation of all returned values.

// Use the same Promise return helper from above.
var returnApiDataFor = function(url) {
  return new Promise(function(fulfill, reject) {
    $.getJSON(url, function (response) {
      fulfill(response.records);
    });
  });
};

// Define a helper to map a URL and page list to Promises.
var allPromisesFor = function(baseUrl, pages) {
  // Map each page to a URL wrapped in a Promise to retrieve its data.
  return pages.map(function(pageNumber) {
    return returnApiDataFor(baseUrl + '?page=' + pageNumber);
  });
};

// Invoke all Promises for a given base URL and pages.
Promise.all(allPromisesFor('https://api.example.com', [1, 2, 3])
  // Once all promises are fulfilled...
  .then(function (recordSet) {
    var allRecords = [];

    // Concatenate / union all page responses to a single array.
    recordSet.forEach(function(records) {
      allRecords = allRecords.concat(records);
    });

    // Pass the complete data set to Tableau.
    tableau.dataCallback(allRecords);
  });

The API is simple. Just as a Promise maps to a future value,Promise.all takes an array of Promises and maps them to an array of future values: [Promise, Promise, Promise] =>[[], [], []]. After a simple union-like operation during post-processing, we can pass the data back to Tableau in the format it expects [, , ], with no need to instruct Tableau to loop back and ask our WDC for additional records.

Promises provide a way to strike a balance between application performance and complexity: we can take advantage of some of JavaScript's asynchronous features while maintaining code readability and maintainability.

Things to consider

The Promise specification is part of ECMAScript 6, the first significant update to the JavaScript language since 2009. Although many modern browsers support Promises natively, they are currently unavailable in the WDC connection window in Tableau. In order to use them, you will have to include a shim or polyfill before your code.ES6-Promise andbluebird are good options, but anything that is faithful to the spec should do. For example:

<script src="es6-promise.min.js"></script>
<script src="jquery.min.js"></script>
<script src="wdc-sdk.js"></script>
<script src="your_wdc.js"></script>

It's also worth noting that the number of persistent connections that can be open for a given domain at any given time is fixed. Browsers set this number anywhere between 4 and 8, but the WDC seems to allow up to 6 simultaneous connections to a single host. This ensures you can't overwhelm an API provider by opening up hundreds or thousands of connections simultaneously. No need to implement that safety mechanism yourself.

Finally, for the sake of brevity and clarity, all of the examples above omit error handling of any kind. Beyond the benefits outlined above, Promises can also help simplify fault tolerance in asynchronous applications. I encourage you to explore JavaScript Promises in more depth.