AiiDA 1.0, a scalable computational infrastructure for automated reproducible workflows and data provenance


In this section, we examine the performance of the database and the engine. Where applicable, a comparison is drawn with earlier versions of AiiDA7, to illustrate how the new design has improved performance.

The database schema of AiiDA 1.0 includes many improvements and optimisations with respect to the schema published in ref. 7. Here we highlight two changes that have had the greatest impact on storage and query efficiency. First, in section “EAV replaced by JSONB” we describe how the schema of node attributes has been changed from a custom entity-attribute-value solution to a native JSON binary (JSONB) format. Second, section “On-the-fly transitive closure” details how the transitive-closure, originally a statically generated table, has been replaced with one that is generated on the fly in memory.

EAV replaced by JSONB

As described in “The provenance model”, the attributes of a node are stored in a relational database. The exact schema for these attributes depends on the node type and cannot be statically defined, which is in direct conflict with the modus operandi of relational databases where schemas are rigorously defined a priori. This limitation was originally overcome by implementing an extended entity-attribute-value (EAV) table that allowed storing arbitrarily nested serialisable attributes in a relational database7. While a successful solution, it comes with an increased storage cost and significant overhead in the (de-)serialisation of data, reducing the querying efficiency.

As storing semi-structured data is a common requirement for many applications, PostgreSQL added support for a native JSON and JSONB datatype as of v9.2 (postgresql.org/docs/current/static/release-9-2.html) and v9.4 (postgresql.org/docs/current/static/release-9-4.html), respectively, which is an efficient storage and indexing format (postgresql.org/docs/current/static/datatype-json.html). In AiiDA 1.0, the custom EAV implementation for node attributes has been replaced with the native JSONB provided by PostgreSQL, which yields significant improvements in both storage cost and query efficiency.

The replacement of EAV by JSONB significantly reduces storage cost in two ways: (a) the data itself is stored more compactly as it is reduced from an entire table to a single column and (b) database indexes can be removed while still providing a superior query efficiency. Figure 6(a) shows the space occupied when storing 10000 crystal structures, comparing the size of the raw files on disk, and with their content stored in the EAV and JSONB schema. In the case of raw files, the XSF format (xcrysden.org/doc/XSF.html) was used since it contains only the information that is absolutely necessary.

Fig. 6

Comparison in a log scale of the space requirements and time to solution when querying data with the two AiiDA ORM backends. (a) Space needed to store 10000 structure data objects as raw text files, using the existing EAV-based schema and the new JSON-based schema. The reduced space requirements of the JSON-based schema with respect to the raw text files are due to, among other things, white-space removal. The JSONB schema reduces the required space by a factor of 1.5 compared to the raw file size and a factor of 25 compared to the EAV-based schema. (b) Time for three different queries that return attributes of different size for the same set of nodes. The benchmarks are run on a cold database, meaning that the database caches are emptied before each query. We indicate separately the database query time (SQL time) and the total query time which includes also the construction of the Python objects in memory. The total query time of the site attributes in the JSONB format is 75 times smaller compared to the equivalent query in the EAV format. The SQL time for the same query is roughly 6.5 times smaller for the JSONB version of SQL query compared to the EAV version of the query.

This benchmark was performed on a PostgreSQL 9.6.8 database using the ORM backends as implemented in AiiDA 1.0. When comparing the EAV format to the JSONB format, a decrease in storage space of almost two orders of magnitude becomes apparent. The space gains of the new format do not only apply to the occupied space on disk, but also to the amount of data transferred when querying JSON fields, as shown in Table 2. This effect is, however, only a part of the increase in query efficiency thanks to the JSONB schema.

Table 2 Result size and number of rows of attribute queries presented in Fig. 6(b) on a database table of 300000 crystal structures.

Using the JSONB-based format also carries significant speed benefits. These mainly come from the more compact JSONB-based schema with respect to the EAV schema, as described in the previous section. This results in (a) less transferred data from the database to AiiDA, and (b) a reduced cost of deserialising the raw query result into Python objects.

Figure 6(b) shows benchmarks carried out with PostgreSQL 10.10 on an AiiDA database generated for a research paper14 which contains 7318371 nodes. The benchmarks were carried out on a subset of 300000 crystal-structure data nodes on a machine with an Intel i7-5960X CPU with 64GB of RAM. Three different kind of attributes were queried: cell, kinds and sites. The cell is a 3 × 3 array of floats, kinds contain information on atomic species (therefore, typically there is one kind per chemical element contained in the structure), while there is a site per atom, explaining the increase in result sizes as shown at Table 2.

Due to the specific format of the EAV schema, more rows need to be retrieved for every crystal structure data node. The effect of the different result size is visible both in the SQL time (reflecting the time to perform the query and to get the result from the database) and in the total amount of time spent which includes the deserialisation of raw query results into Python objects. As shown in Fig. 6(b), the total query time of the site attributes in the JSONB format is 75 times smaller than the equivalent query in the EAV format. The SQL time for the same query is roughly 6.5 times smaller for the JSONB version of SQL query compared to the EAV version of the query. The increased final speedup at the Python level is due to the fact that in the EAV based schema there is the overhead of serialising the attributes at the Python level, which is largely avoided in a JSONB-based schema.

On-the-fly transitive closure

Very often, when querying the provenance graph one is only interested in the neighbours directly adjacent to a certain node. However, some use cases require to traverse the graph taking multiple hops in the data provenance to find a specific ancestor or descendant of a given node. To make the queries for ancestors and descendants at arbitrary distance as efficient as possible, early versions of AiiDA computed and stored the transitive closure (TC) of the graph (i.e., the list of all available paths between any pair of nodes) in a separate database table. Storing these paths in a dedicated database table with appropriate indexes allowed AiiDA to query for ancestors and descendants with time complexity 𝒪(1)in dependent of the graph topology and the number of hops.

However, the typical size of the TC table is significant even for moderately sized provenance graphs, and quickly has an adverse effect on the general performance of the database. For example, a subset of just one million nodes from the database generated in ref. 14 has 226 million rows in the TC table, corresponding to 200 GB on disk. In addition to the raw disk storage cost, the time needed to store a new link also increases, as the TC is updated with automatic triggers at each update of the links table. This becomes more expensive as the table grows because table indexes need to be updated as well. AiiDA 1.0 replaces the TC explicitly stored in a table with one that is computed lazily, or on-the-fly (OTF), whenever ancestors or descendants of a node are queried for. This is implemented in the query builder via SQL common table expressions to recursively traverse the DAG. The OTF method greatly reduces the time required to store new links and does not require any disk space for storing the TC, albeit at the cost of slightly slower queries. However, the impact on the efficiency of the recursive queries for AiiDA provenance graphs is minimal, since the typical graph topology is relatively shallow and often composed of (almost) disjoint components. This can be seen in Fig. 7(a,b) that shows frequency graphs capturing the topology of a subgraph of the database of ref. 14. composed of one million nodes. Indeed, the vast majority of nodes only have a handful of ancestors and descendants and these can be reached in a relatively small number of hops.

Fig. 7
figure7

Analysis of a sample of one million nodes of the AiiDA graph published in ref. 14. (a) Frequencies of the number of ancestors and descendants of all nodes. (b) Frequencies of the number of hops, i.e., the distance to reach the farthest ancestor/descendant. (c) Required CPU time when querying for all descendants of 50 top-level nodes in a graph that consists of a number of binary trees of breadth B and depth D using the transitive closure on-the-fly (TC-OTF, diamonds) or the explicitly tabulated transitive closure (TC-TAB, squares).

To compare the performance of the explicit and lazy implementations of the TC, we performed benchmarks on multiple graphs with topologies comparable to typical AiiDA provenance graphs. Each graph consists of N binary trees with a depth and breadth (the number of downward branches and the number of outward going edges from each vertex, respectively) of 2 or 4. The benchmark records the total time it takes to query for all descendants of 50 top-level nodes using either the explicit TC table (TC-TAB) or the on-the-fly TC (TC-OTF).

Figure 3(c) clearly shows that in both cases the number of trees does not affect the query efficiency. Moreover, as the depth and breadth of the graph increases, the TC-TAB query time increases. In contrast, for the TC-OTF, the topology of the graph has little impact on the query time. Note that this holds for these particular topologies, which match that of typical AiiDA provenance graphs, but is not necessarily the case for more complex graph topologies. Finally, as expected, the TC-TAB is faster than the TC-OTF, albeit by just a factor of two. We deem this increased cost more than acceptable, given the considerable savings in storage space provided by the TC-OTF, the performance independence from the graph topology and the faster storage of new links. For these reasons, all recent versions of AiiDA implement only the TC-OTF.

Event versus polling-based engine

To evaluate the performance of the event-based engine of AiiDA 1.0 compared to the polling-based one of earlier versions, we consider an example work chain that performs simple and fast arithmetic operations. The work chain first computes the sum of two inputs by submitting a CalcJob and then performs another addition using a calcfunction. For the CalcJob, the ArithmeticAddCalculation implementation is used, which wraps a simple Bash script that sums two integers. Each work chain execution then corresponds to the execution of three processes (top work chain, a calculation job and a calculation function) and is representative of typical use cases. For each benchmark, 400 work chains are submitted to the daemon and the rate of submission and process completion is recorded, as shown in Fig. 8. These benchmarks were performed on a machine with an Intel Xeon E5-2623 v3 CPU with 64GB of RAM. Beside the results obtained for the old and new engine using optimised parameters (number of workers, transport intervals,…), for a fair comparison and to highlight the effect of different engine types, in Fig. 8(a) we also show the results for the new engine with some artificial constraints. In particular, we run the new engine with four workers only (which is roughly comparable with the old engine, with four independent tasks for submitting, checking queued jobs, retrieving files, and processing work chain steps) rather than twelve. Additionally we set a minimal interval between connections of 5 seconds in the new daemon to simulate the polling behaviour (with default polling time of 5 seconds) of the old daemon, despite all calculation jobs being run on the local host where an interval of zero is optimal.

Fig. 8
figure8

Process submission and completion rates for the old and new engine. (a) Number of submitted (solid lines) and completed (dashed lines) processes over time for the new engine (both with optimised parameters and with artificial constraints, see text) and the old engine. The submission of the old engine is slightly faster, but despite this the completion rate of the new engine is clearly higher, even under constrained conditions. (b) Number of completed processes for the old (dashed lines) and new (solid lines) engine, decomposed in the separate (sub)processes. The polling-based nature of the old engine is clearly reflected in the stepwise behaviour of the completion rate with processes being finalised in batches. In contrast, the curves for the new engine, due to its event-based design, are smooth and closely packed together, indicating processes being executed in a continuous fashion.

Figure 8(a) shows that the submission rate of the old engine is slightly faster compared to the new engine, because the procedure was significantly simpler with no communication with remote daemon workers. Nevertheless, the total completion time of the new engine (even in the constrained configuration) is shorter, with the optimised new engine completing all processes three times faster than the old one in this simple example. Additionally, for the old engine all work chains complete towards the end of the time window at roughly the same time, in a few discontinuous jumps because of the polling-based architecture. In contrast, the completion rate of the event-based engine is much smoother (beside being faster in general) because of the continuous operating character of the new engine, with processes managed concurrently and executed immediately, without waiting for the next polling interval.

The concurrency of the new daemon is highlighted even more in Fig. 8(b) where we compare the completion time for the new (optimised) daemon and old one, showing independently the completion for the top work chain and each of the two subprocesses. The reason of the large delay in the old engine is because, even though the workflow only runs two subprocesses, the internal logic consists of multiple steps, where only one is processed per polling interval. In contrast, the new engine executes all workflow steps in quick succession without interruption.

We stress that the efficiency improvements of the new engine are even larger in real high-throughput situations, since the daemon is never idle between polling intervals. Most importantly, the new daemon is scalable and the number of daemon workers can be increased dynamically to distribute heavy work loads. This effect is made visible in Fig. 8(a), where the optimised new engine (with 12 workers and without connection delay) completes all processes in half the time required by the constrained one. The effective throughput of the new engine for this experiment, which was run on a modest work station, amounts to roughly 35000 processes per hour. Due to the scalable design of the new engine, this rate can be easily increased by running more daemon runners on a more powerful machine.

Caching

The storing of complete data provenance as described in “The Provenance Model” does not only guarantee the reproducibility of results, it can also reduce the unnecessary repetition of calculations. If the engine is asked to launch a calculation, it can first check in the database if a calculation with the exact same inputs has already been performed. In that case, the engine can simply reuse the outputs of the completed calculation saving computational resources. This mechanism is referred to as caching in AiiDA and users can activate it for all calculation types or only for specific ones.

To rapidly find identical calculations in the database, one needs an efficient method to determine whether two nodes are equivalent. For this purpose, AiiDA uses the concept of hashing, where the whole content of a node is mapped onto a single short hexadecimal string. In AiiDA we employ the cryptographic BLAKE2b algorithm (blake2.net), which has a relatively low computational cost combined with an overwhelmingly unlikely probability of hash collisions. The latter property means that any two nodes with the same hash can be assumed to be identical. The content of a node that is included in the computation of its hash consists of the immutable node attributes and the file repository contents. In addition, for a calculation node the hashes of all its inputs are also included, such that looking for calculations with identical inputs can be done merely by looking at the hash of the calculation itself.

As soon as a node is stored and it becomes immutable, its hash is computed and stored as a node property, making it queryable. When the engine is asked to launch a new calculation, it first computes its hash and searches the database for an existing node with the same hash. If an identical calculation is found and caching is enabled, the engine simply clones the output nodes of the existing calculation and links them to the new calculation. This saves valuable computational resources and results in the same provenance graph as if the calculation had actually been run. Nevertheless, specific extra properties are added to indicate which calculation was used as the cache source, making it possible to identify cached calculations (mostly for debugging purposes).

The concept of caching is especially powerful when developing and running complex workflows consisting of many calculations. If any calculation fails, the workflow that launched it fails as well. If the error can be resolved (e.g., because it was due to a bug in the workflow), the workflow can be fixed and simply rerun from scratch: thanks to the caching mechanism, it will effectively continue from where it previously failed without repeating successful calculations.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *