An Efficient Adaptive Partial Snapshot Implementation

The standard single-writer snapshot type allows processes to obtain a consistent snapshot of an array of n memory locations, each of which can be updated by one of n processes. In almost all algorithms, a \Scan operation returns a linearizable snapshot of the entire array. Under realistic assumptions, where hardware registers do not have the capacity to store many array entries, this inherently leads to a step complexity of Ω(n). In this paper, we consider an alternative version of the snapshot type, where a \Scan operation stores a consistent snapshot of all n memory locations, but does not return anything. Instead, a process can later observe the value of any component of that snapshot using a separate Øbserve operation. This allows us to implement the type from fetch-and-increment and compare-and-swap objects, such that \Scan operations have constant step complexity and \Update and Øbserve operations have step complexity O(log n).


INTRODUCTION
Taking linearizable snapshots of memory is one of the most fundamental and best studied problems in the area of concurrent shared memory algorithms. The problem can be described abstractly in terms of a snapshot type, which maintains an array of m memory locations, A[0 . . . m − 1]. It supports the operations Update(i, x), Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. PODC '21, July 26-30, 2021, Virtual Event, Italy. ). The first linearizable snapshot implementations have been known since the 1990's [1,4].
Most research focuses on single-writer snapshot implementations, where m is equal to the number of processes, n, and each array index is associated with a unique process, which is the only one that is allowed to update the corresponding array entry. Attiya, Herlihy, and Rachman [9] showed that implementing single-writer snapshots reduces to solving lattice agreement. They presented a lattice agreement algorithm from registers that then yields a snapshot algorithm with subquadratic (in n) worst-case step complexity. By devising a faster lattice agreement algorithm, Inoue and Chen [14] obtained a single-writer snapshot algorithm with linear step complexity. It follows from a proof technique by Jayanti, Tan, and Toueg [17] that this is optimal. Aspnes, Attiya, Censor-Hillel, and Ellen [6] proved that it is possible to break through this barrier by restricting the number of operations. They implemented a deterministic algorithm that has worst-case step complexity of O(log 3 n), as long as the number of operations on the object is polynomial in n. Subsequently, Aspnes and Censor-Hillel, gave a randomized algorithm with poly-logarithmic expected step complexity [7], and Ahad Baig, Hendler, Milani, and Travers devised a deterministic algorithm with poly-logarithmic amortized worst-case step complexity [10].
Generally, snapshot implementations from registers are not practical: Almost all of them assume that a single hardware register can store the entire snapshot array. The only exceptions we are aware of have either exponential running time [4], or do not permit concurrent Scan() operations [18]. Therefore, snapshot implementations from stronger primitives have been considered: Using single-word compare-and-swap (CAS) and fetch-and-increment (FAI) objects, Riany, Shavit, and Touitou [19] devised a single-writer algorithm with step complexity O(1) for Update() and O(n) for Scan(). Jayanti [15] generalizes this to the multi-writer case, and achieves O(m) step complexity for Scan() operations, even without relying on FAI.
It is not hard to see that the Scan() operation of any snapshot object implemented from single-word objects must have step complexity Ω(m). Therefore, the standard snapshot specification leads to inherently inefficient Scan() operations for large values of m, and in particular in the common case of single-writer snapshots, where m = n. The specification does not allow for a faster performance, even if a process is only interested in the value of some but not all components of the snapshot object.

Result
To improve upon this inherent complexity barrier, we implement a natural extension of the single-writer snapshot: A Scan() operation takes a snapshot of the array A[0 . . . n − 1], but does not return anything. Instead, after taking a snapshot, a process can observe each individual component A [i] of that snapshot by calling Observe(i). (Hence, the entire snapshot can be obtained using n Observe() calls.) Using single-word registers, CAS, and FAI objects, we obtain an implementation, where Scan() has step-complexity O(1), and Update() and Observe() have step complexity O(log n).
Since our algorithm uses sequence numbers, only up to 2 W −1 − 1 operations can be executed, if each base object can store W bits. This is not a restriction that has any practical impact on current 64-, or even 32-bit architectures.
Note that our snapshot algorithm is more flexible than a standard snapshot: It can simulate "full" single-writer standard snapshots (with a logarithmic performance penalty over the best known algorithms [15,19]), but can be much more efficient, if not all components of the snapshot need to be observed. Moreover, which components a process wants to observe can be decided adaptively after a process has taken a snapshot. For example, if a snapshot represents a data structure, the search path through the data structure can depend on the actual values found. The fact that Scan() operations have constant step complexity can be useful when a process would have to take a snapshot before it even knows for certain that it will need the snapshot later. This could, for example, be the case in applications that require backups or error recovery.

Related Solutions
Attiya, Guerraoui, and Ruppert [8] defined a partial snapshot type, where a process can choose to scan only some of the m array components. This can be considered an oblivious version of our specification, because processes must decide at the beginning of a Scan() operation, which memory locations they are interested in. The authors provide a multi-writer algorithm from registers, CAS, and FAI objects, in which scanning r array components takes O(r 2 ) steps in the worst-case. While Update() operations are not bounded waitfree, their amortized step complexity is bounded by the maximum interval contention, as well as the maximum number of components accessed by Scan() operations.
A more flexible multi-writer snapshot specification was proposed by Wei, Ben-David, Blelloch, Fatourou, Ruppert and Sun [21]. Their object allows a process to take a snapshot of multiple CAS objects, and returns a handle to that snapshot. Using that handle, a process can later determine the value of any of those CAS objects at the point in time the snapshot was taken. They gave an implementation of that type from CAS objects, where each CAS() and snapshot operation takes constant number of steps. However, the step complexity of reading a single snapshotted value of a memory location grows linearly with the number of updates that may have occurred on that location, since the corresponding snapshot was taken. Hence, their algorithm is not bounded wait-free.
The authors use a version list for each CAS object, that stores the complete history of updates performed on the object. Each update is associated with a global sequence number, which is also stored in the version list. The sequence number is incremented with each Scan() operation and is returned as the handle. It can then be used to identify the latest update that was applied to a CAS object, using the version list.
Instead of version lists, our algorithm uses single-writer predecessor objects, which we implement from sequential persistent red-black trees [12] This allows processes to find the latest value that was written to a component of the snapshot object in logarithmic time, with respect to the total number of nodes stored in the red-black tree. We then devise a method for pruning the data structure from outdated values, in order to keep the size polynomial in n.
The idea of associating data structure modifications with timestamps has also been used in algorithms for software transactional memory [20] and multi-version databases [22]. These algorithms are either not wait-free or allow operations to fail. Moreover, using an FAI object as a global clock to linearize operations at the time of FAI() operations is used in [5] to achieve a general method to add range queries to data structures.
As far as we know, our algorithm is the only solution to the partial snapshot problem, where snapshots of o(n/log n) components can be obtained in o(n) time. A comparison of our implementation with the partial snapshot objects of [8,21] is shown in Table 1.

PRELIMINARIES
We consider the standard asynchronous shared memory model with n processes with IDs 0, . . . , n − 1, which communicate using atomic (or linearizable) shared memory operations on base objects. We assume that each process's ID is stored in a process-local variable myID. Invocation and response of an operation op are denoted inv(op) and rsp(op), respectively.
The following base objects are relevant for our work: A readwrite register supports two operations, Write(v), which changes its value to v and returns nothing, and Read(), which returns the value of the register. An LL/SC object provides two operations, LL() and SC(v). An LL() operation returns the object's value, and an SC(v) operation called by process p updates the value to v, if p has previously called LL() and no successful SC() operation has occurred since then. An SC() operation returns true if it succeeds to update the object's value, and returns false otherwise. A FAI object stores an integer, initially 1, and provides an operation FAI(), which increments the object's value by 1 and returns the value before the increment. A CAS object provides an operation CAS(old, new). If the value of the object is old, this operation updates the value to new and returns true, otherwise, the object remains unchanged and the operation returns false.
Let D val be the domain of values stored in a snapshot component. We assume that the system provides atomic registers, FAI, and CAS objects of word size W ≥ max{log |D val |, 3 log n + c} bits for a sufficiently large constant c. In order to avoid having to deal with ABAs, we will use LL/SC objects instead of CAS objects. It is trivial to replace those LL/SC objects with CAS objects and sequence numbers. Alternatively, one can use the efficient construction of LL/SC from single-word CAS by Jayanti and Petrovic [16], which has O(1) step complexity. Due to [11] we may assume w.l.o.g. that LL/SC objects have arbitrary large word size O(W ).
Generally, we measure efficiency of wait-free shared memory algorithms in terms of the number of shared memory steps executed. Thus, step complexity ignores local computation time, which does not include shared memory steps. Time complexity, on the other hand, measures local and shared memory steps of processes (assuming a standard word-RAM model).

Algorithm
Step complexity Base objects Number of primitives Reading r locations Update [8] O (r 2 ) Unbounded n-word CAS and FAI Unbounded [21] Unbounded This work O (r log n) O (log n) single-word CAS and FAI O (n 3 log n) All algorithms use sequence numbers that increment with update or scan operations, and store them in base objects. Therefore, these algorithms implicitly asume that the number of scan or update operations is bounded by 2 W , where W is the word size. 1 It is safe to assume that this assumption will never be violated on current 64-bit architectures. As a building block for our main algorithm, we use the destination array of Blelloch and Wei [11]. A destination array stores a sequence of n values, and supports the operations Read() and Copy(). Operation Read(i) takes as argument an integer i ∈ {1, ..., n}, and returns the value of the i-th component of the array. Operation Copy(R) takes as argument a reference R to a register, and if process p calls that operation, it changes the value of the p-th component of the array to the value of register R. Blelloch and Wei [11] show that a linearizable and wait-free destination array can be implemented from O(n 2 ) single-word CAS objects and registers in such a way that each Read() and each Copy() operation can be executed in a constant number of steps.
Specification of the New Algorithms. In this paper we present linearizable single-writer predecessor and adaptive partial snapshots objects. The former is used as a building block for the latter, but may be of independent interest. In the following we provide the sequential specifications of the underlying types.
The predecessor type maintains a set of pairs, each comprising a key and a value. The domain of keys must be totally ordered.
The predecessor type provides four operations. An Insert(k, v) operation inserts a pair with key k and value v, provided that the data structure does not contain a pair with key k. In that case, the Insert() operation succeeds and returns true, otherwise, it fails and returns false. Operations Remove(k), Pred(k), and Succ(k) take as a single argument a key k. Each of them fails and returns false if the data structure contains no pair with key k. Otherwise, Remove(k) removes the pair with key k from the data structure, and Pred(k) and Succ(k) return the pairs with greatest key smaller than k and smallest key larger or equal than k, respectively, if such pairs exist. If not, these operations fail and return false. We call operations Insert() and Remove() update operations, and all other operations query operations.
A predecessor object is single-writer, if there is only one dedicated process that is allowed to perform updates. We will consider single-writer predecessor objects with bounded capacity ∆, which informally means that at most ∆ elements can be stored in the data structure at any point. Since the object is single-writer, it is uniquely determined at the point of invocation of an update operation, whether that operation will be successful or not. (We call an incomplete update operation successful, if it must be successful in any extension of the execution in which it completes.) Bounded capacity ∆ formally means that at any point in time the number of invocations of successful Insert() operations minus the number of responses of successful Remove() operations is at most ∆.
The adaptive partial snapshots type stores an n-component array and supports three operations Update(v), Scan(), and Observe(k). An Update(v) operation called by process p changes the value of the p-th component to v, and returns nothing. Method Scan() does not return anything, and its behaviour is only defined in terms of method Observe(k). A process is only allowed to call an Observe(k) operation after it has performed at least one Scan() call, and an Observe(k) call by process p returns the value that the k-th component of the object had at the point of p's latest preceding Scan() operation.

SINGLE-WRITER PREDECESSOR ALGORITHM
In this section we present our linearizable and wait-free singlewriter implementation of the predecessor type from registers and CAS objects. First, we will show that the sequential balanced redblack tree of [12] can be used in a concurrent system (i.e., is linearizable), as long as there is only one process that performs update operations. This concurrent red-black tree almost immediately yields a single-writer predecessor object: it natively supports the operations Insert() and Remove(), and adding operations Pred() and Succ() is straightforward. However, in infinite executions, this algorithm may need an unbounded number of nodes. In Section 3.2 we will use memory reclamation to bound the space of our concurrent data structure.

The Basic Algorithm
Driscoll, Sarnak, Sleator, and Tarjan [12], present a technique called node-copying to make linked data structures persistent. We first describe a basic version of this technique that can be applied to any binary search tree (BST) implementations, where each node store only pointers to its children (i.e., there are no parent pointers): A dedicated variable R stores a pointer to the root r of the tree. An update operation does not modify any nodes of the data structure. Instead, it adds copies of all nodes that need to be modified, as well as a new root r ′ , and finally changes the pointer R so that it points to r ′ instead of r . To be more precise, suppose the set of nodes reachable from r form a conventional BST T 1 . Let T 2 be the BST obtained by applying a conventional update operation to T 1 (e.g., an insertion). Let S be the set of nodes in T 2 , that are either added or modified by this update operation, and S ′ the parent-closure of S (i.e., if v is in S ′ , then the parent of v is also in S ′ ). Instead of modifying the nodes in S, we create a copy v ′ of each node v in S ′ . Each field of v ′ has the same value as the corresponding field in v, except that a pointer to a node u in S ′ is replaced with a pointer to the copy u ′ of that node. Since S ′ is parent-closed, the root r of S ′ is also copied into a node r ′ . It is easy to see that the nodes reachable from r ′ now form a red-black tree that is equivalent to T 2 . Hence, to complete the update operation, it suffices to replace the value of R with a pointer to r ′ .
To perform a query operation on the persistent data structure, a process simply reads the pointer R to obtain a root r , and then performs the same operations as it would in the conventional BST algorithm, using r as a root.
For the purpose of concreteness, we will now consider a redblack tree [13]. Driscoll, Sarnak, Sleator, and Tarjan have applied the node copying technique described above to that data structure to obtain a persistent red-black tree, where each of the operations Insert(), Remove() and Find() take O(log m) steps [12], where m is the number of elements stored in the tree. It is straightforward to augment the data structure with query operations Pred() and Succ() so that all operations have time complexity O(log m). Thus, we obtain an implementation of a persistent sequential predecessor type with the same asymptotic time complexity.
We can now use that persistent predecessor implementation in a shared memory system, by storing R and each node of the data structure in an atomic register. We will allow only one process, p w , to perform update operations, but all processes are allowed to execute query operations. Observe that if at some point t pointer R points to a root r in the data structure, all nodes reachable from r form the BST that was obtained as a result of the update operation that wrote the pointer to r into R. None of these nodes can change after point t. Hence, if a query operation reads the pointer to r from R, then it will visit exactly the same nodes that would be visited in the sequential case. Similarly, an update operation by process p w does initially not make changes to any reachable nodes, and all tree modifications will become visible to other processes only when p w changes the root pointer, R, to point to the new root copy p w created. Hence, it is easy to see that each update operation can linearize with the write to R, and each query operation can linearize with the read of R. It follows that this object is linearizable, provided that only one process can perform update operations (in fact, it is linearizable as long as no two update operations are concurrent).

Recycling Outdated Nodes
During each update operation on the concurrent red-black tree, described in the previous section, p w makes copies of up to Θ(log m) nodes [12], and needs to allocate space for the registers storing them. Thus, in unbounded executions, an infinite number of registers is needed. In the following, we apply a memory reclamation technique to bound the space. We add a Recycle() method to the object, whose purpose is to remove the nodes from the tree that cannot be accessed anymore by any process. The registers storing these nodes' information can then be reused for future nodes.
We will only consider single-writer predecessor objects with bounded capacity ∆. In our snapshot application, ∆ ≤ 3n.
For the ease of discussion and readability of the pseudocode (see Algorithm 1), we assume that p w has access to a method Allocate(), which allocates a new node and returns a reference to it, and a method Deallocate(x), which deallocates a node x that p w previously allocated. We will show that at any point in time, there are at most λ = O(n∆ log ∆) nodes that p w has allocated but not deallocated. Thus, implementing methods Allocate() and Deallocate() with bounded memory is straightforward, by having p w maintain a local pool of λ registers, one for each node. Our algorithm guarantees that when a process accesses a node v, then at that point v has been allocated but not yet deallocated.
Consider an execution. Let r 0 be the initial root of the red-black tree pointed to by R, and let r i be the root pointed to by R after the i-th update operation. Let reachable(r i ) denote the set of nodes reachable from r i . A node v is outdated at point t, if at that point R points to a root r k , and there exists i ∈ {0, . . . , k − 1}, such that v ∈ reachable(r i ) \ reachable(r k ). Outdated nodes are candidates for deallocation. However, it is possible that an outdated node v may still be accessed by a process q, if v is reachable from a root r , and R pointed to r when q read that pointer at the beginning of its query operation. To prevent v from being deallocated in such a situation, process q will initially protect a root r , when it reads the pointer to r from R at the beginning of its query operation. It does so by storing the address of r in a register protectedRoot[q], where protectedRoot is an array with one entry for each process. We say a node is protected, if it is reachable from a root stored in some array entry protectedRoot[j], j ∈ {0, . . . , n − 1}. The Recycle() method will then deallocate only outdated, but not protected nodes.
Using only registers, it is difficult to protect nodes this way: Process q may fall asleep immediately after reading the address of the current root r from R, and then only wake up again when r has already been deallocated. It is then too late to protect r by writing it into protectedRoot [q]. Sophisticated techniques to deal with that in constant time and using only registers have been described in [2,3]. An easier way is to use a destination array for protectedRoot. (The main motivation for the definition of the destination array in [11] has been to solve the same type of "protection" problem.) At the beginning of its query operation, process q calls protectedRoot[q].Copy(R). This copies the address of the root r pointed to by R into protectedRoot[q], and thus protects all nodes reachable from r . As mentioned above, in a Recycle() call, p w now only needs to find all outdated nodes (i.e., those that are not reachable from the current root pointed to by R), and remove those that are not protected.
It remains to show that this can be done without decreasing the worst-case time complexity of update and query operations. For query operations this is trivial, as only one Copy() operation is added.
We now show how to modify update operations, and how to implement the Recycle() method. The updater, process p w , maintains a (local) list outdated of outdated nodes that have not yet been deallocated. Initially, outdated is an empty list. Suppose that at the end of an update operation by process p w , the root pointed to by R changes from r to r ′ . Then p w computes reachable(r)\reachable(r ′ ) and adds all nodes in that set to outdated. Note that each node in reachable(r) \ reachable(r ′ ) is in the parent closure of the set of nodes that get modified by a standard red-black tree update operation. Since that red-black tree contains at most ∆ nodes, it is easy to see that for some constant c. Moreover, the set reachable(r) \ reachable(r ′ ) can be computed in time O(log ∆).
Every n∆ update operations, process p w begins a new Recycle() operation. The total number of steps required to complete such a Recycle() call will be O(n∆ log ∆), and with each update operation, p w contributes O(log ∆) steps to an ongoing Recycle() call. In a Recycle() call, p w first renames outdated to outdated ′ , and then sets outdated to an empty list. (This way, future update operations will fill the set outdated again, while the Recycle() call can process outdated ′ . And if, while that Recycle() call is ongoing, a process p starts protecting a new node v by copying its root to protectedRoot[p], then v will not be in outdated ′ , and thus will not get deallocated.) Then, p w initializes a local empty list protected, and reads all roots stored in protectedRoot[i] for each i ∈ {0, . . . , n − 1}, and for each such root r , adds reachable(r) to protected. Finally, p w deallocates all nodes in outdated ′ that are not in protected, and append list protected to outdated, so that the protected nodes can be reconsidered in the next Recycle() operations.
Analysis. At any point in time the predecessor object stores a set of size at most ∆. Hence, from any root stored in protectedRoot[i], at most ∆ nodes can be reached, and thus |protected| ≤ ∆ · n, and protected can be computed in time O(|protected|) = O(∆n).
We will now argue that |outdated| ≤ (c log ∆ + 1)n∆ at all times, where c is the constant from (1). Clearly, this is true initially and at the beginning of each Recycle() call, when outdated is set to an empty list. In the interval during which a Recycle() call completes (or until the first Recycle() call is invoked), n∆ update operations are executed. Hence, by (1), at most cn∆ log ∆ nodes become outdated. In addition, by (2), at the end of the Recycle() call at most ∆n nodes are added from protected to outdated. Thus, once the Recycle() call terminates, |outdated| ≤ ∆n + cn∆ log ∆, and thus (3) is true. By definition, each node in the data structure is either outdated or reachable from the root pointed to by R. Since at most ∆ nodes are reachable from that node, it follows from (3) that at any point p w needs to have only λ = O(n∆ log ∆) nodes allocated. Thus, a pool P λ of λ registers suffices to store all nodes.
Since p w needs only λ nodes in its entire pool (of unallocated and allocated nodes), it is easy to see that p w can compute the set difference S 1 \ S 2 of two sets S 1 , S 2 ⊆ P λ (given as linked lists) in time O(λ) = O(n∆ log ∆) using the standard lookup-table technique. In particular, at the end of its Recycle() method, process p w can compute all nodes that are in outdated ′ but not in protected in time O(n∆ log ∆).
To summarize, a complete execution of the Recycle() method takes O(n∆ log ∆) time. In each update operation, process p w computes the set of O(log ∆) nodes that have become unreachable (see (1)) in O(log ∆) time. Then it contributes sufficiently many steps towards a (new or ongoing) Recycle() operation, so that the Recycle() method completes during n∆ update operations by p w . Since the Recycle() method takes O(n∆ log ∆) time, there is a constant κ, such that p w needs to contribute at most κ log ∆ steps to the Recycle() method during each update. Hence, each update operation takes time O(log ∆), and thus it also comprises only O(log ∆) shared memory steps.
As discussed, the algorithm needs to store λ = O(n∆ log ∆) nodes in registers. In addition it requires a destination array of size n, which can be implemented from O(n 2 ) registers and CAS objects (see Section 2). Theorem 1. A wait-free linearizable single-writer predecessor object with bounded capacity ∆ can be implemented from O(n∆ log ∆ + n 2 ) single-word CAS objects and registers, such that each update and query operation has time and step complexity O(log ∆).
A full proof is omitted due to space constraints.

Shared:
Register R = ⊥ Destination Array protectedRoot Local for Writer: List outdated Let r be the root pointed to by R 3 Perform the corresponding update operation on R as in the sequential implementation in [12], allocating new nodes using method Allocate(). 4 Let r ′ be the new root pointed to by R 5 Add all nodes in reachable(r) \ reachable(r ′ ) to outdated. protectedRoot.Copy(R) 10 Perform he corresponding query operation on protectedRoot.Read(myID) as in the sequential implementation in [12] 11 Function Recycle() 12 Rename outdated to outdated ′ 13 Let outdated and protected be new empty lists 14 for i ∈ {0, . . . , n − 1} do 15 r ← protectedRoot.Read(i) 16 if r ⊥ then 17 protected ← protected ∪ reachable(r) 18 For each node x in outdated ′ , such that x protected, perform Deallocate(x). 19 Append protected to outdated Update() by process p adds an element to versions[p], and the time complexity of Update() and Observe() operations is dominated by operations on the predecessor objects. This basic algorithm never removes elements from the predecessor objects. In order to obtain our desired space and step complexity bounds, we will later show how we can prune outdated elements from the predecessor object. This will allow us to use a predecessor object of bounded capacity ∆ = 3n. Algorithm 2 uses a shared FAI object clk in addition to the predecessor objects versions[p] for each process p. The value of clk is incremented at least once for each Update() and Scan() operation. Each Update() and Scan() operation is associated with exactly one such increment (even though multiple ones can happen during such an operation execution), and the operation linearizes at the point of that increment. I.e., each Update() and Scan() operation is associated with a unique value x, where clk is incremented from x to x + 1 at the linearization point of that operation.
We will first discuss a very simple high-level idea, which leads to an incorrect algorithm, and then show how to fix it in order to arrive at Algorithm 2. In an Update(val) operation, the calling process, p, increments the FAI object clk from x to x + 1, and then inserts can be a local variable. But for the advanced algorithm we will need it to be an LL/SC object.) In a later Observe() operation, q can then determine the last value val that was written to component p of the array, by calling versions[p].Pred(sTime), where sTime is the value of lastScan[q]. This call returns the pair (x, val) stored in the predecessor object, such that x is the largest key less than or equal to sTime.
However, this simple approach is not linearizable: Suppose that during its Update(val), process p executes FAI(), but then before it inserts the pair (x, val) into its predecessor object versions[p], some process q performs a Scan() operation, during which it fetches x ′ > x from clk. After that, q calls Observe(p), and determines the predecessor of x ′ in versions[p]. Whether or not q observes the value val depends on when p inserts the pair (x, val) into versions[p]. Thus, the relative order of p's Update(val) and q's Scan() would now have to be determined by the execution of q's Observe(p) call relative to p's insert of (x, val) into the predecessor object. In particular, q may perform two subsequent Observe(p) calls that return different values, which is incorrect.
To deal with that problem, we let any process calling Observe(p) help process p linearize a possible pending Update() operation. This is facilitated by method HelpUpdate() and an auxiliary LL/SC object lastUpdate[p] for each process p. To indicate that process p has started performing Update(val), it writes the pair (⊥, val) into that variable (using a pair of LL() and SC() operations). A process q that wants to observe the value of p's array component following has not yet incremented clk. Therefore, during its Observe() operation, q increments clk itself from x to x + 1 and tries to communicate that to other processes by trying to store the pair (x, val) into lastUpdate[p], using an SC() operation. If successful, then p's Update(val) operation can linearize with q's increment of clk. And otherwise, some other process has already helped p's Update() linearize.

Pruning Predecessor Objects
In Algorithm 2, the number of pairs stored in the predecessor objects increases with the number of Update() operations, and thus needs predecessor objects with unbounded capacity.
In order to be able to use a predecessor objects with bounded capacity, we remove unnecessary elements. The corresponding pseudo-code is show in Algorithm 3. We will need an auxiliary method HelpScan(q). Its implementation guarantees that if at the invocation of a HelpScan(q) call a Scan() call by some process q has linearized, then by the time the HelpScan(q) call responds, the value of clk that is associated with the Scan() operation will have been written to lastScan[q]. This helping mechanism is implemented in essentially the same way as that of HelpUpdate().
We now describe how a process p can remove pairs from its predecessor object, versions[p], which are not needed anymore. To that end, p computes a set of required keys, which it will not remove from versions [p]. Consider an arbitrary point t in the execution. can be safely removed after point t, if no Observe(p) operation that responds after t needs to return (k, v). Conversely, a key is required, if it may at any point after t be the predecessor of some value that is then stored in lastScan[q] of some process q.
Process p can compute the set of required keys as follows: First, at some point t 0 it determines the largest key, maxKey, stored in versions [p]. Observe that then clk > maxKey at that point, and also at any later point. Next, process p calls HelpScan(q) for each process q, and then reads a value v q from lastScan[q]. Let V be the set of values v q ⊥, for each process q. It is not hard to see that at any point after p has read v q from lastScan[q], the value of that object is either still v q , or ⊥ or an integer of value at least clk.
Since clk > maxKey at any point after t 0 , it suffices if the set of required keys contains at least all predecessors of values larger than maxKey, and of the keys in V . Let S be the set of all predecessors of values in V that are in versions[p]. Since maxKey is in versions[p], the predecessors of values larger than maxKey have a value of at least maxKey. Hence, p determines as the set of required keys as the set of keys k, where k ∈ S or k ≥ maxKey.
To prune its predecessor object, a process p can call method Prune(). In lines 62-67 of that method, p computes the value maxKey and the set S exactly as described above. Then, as indicated in the last line of the method, p removes all keys from versions[p] that are not in S and that are smaller than maxKey.
Clearly, the step complexity of method Prune() is super-linear in n. But we can distribute the total work of a Prune() call over n Update() calls. We maintain the invariant that p's predecessor object contains at most 3n pairs at any point in time, and thus we can use a predecessor object with bounded capacity ∆ = 3n. Once the predecessor object contains 2n elements, process p begins distributing the total work of a single Prune() call over its next n Update() operations. This way, p can remove at least n elements from versions[p], while n new elements are added.
We will now argue that the total amount of work of each Prune() call is O(n log n). Thus, to preserve the O(log n) step complexity for all snapshot operations, it suffices if during each Update(), p devotes O(log n) steps to this recycling method.
Let K be the set of keys with value at most maxKey that are in versions[p] at any point after t 0 , which is when p determines maxKey. Clearly, no such keys are added to this set after t 0 , so |K | = O(n). Process p can find all keys in K by first determining the successor of -1, which is the smallest such key, and then following the chain of successors, until maxKey is reached. Since the step complexity of computing a successor in versions[p] is O(log n), the total work complexity of O(n log n) for Prune() follows.

Linearizability Proof
In the following we will prove that any execution E is linearizable. Since our algorithm is wait-free, we assume w.l.o.g. that in E all operations complete.    Proof. We prove this claim for the case that op is an Update() operation. A symmetric proof applies, when op is a Scan().
Let q be the process calling help(op). By definition, op linearizes when q performs clk. Assume that process p executes a Scan() and later completes an Observe(q) operation ob that returns a value v. (Recall that p is not allowed to call Observe() until it completed a Scan()). Then the sequential specification dictates that either no Update(⊥) by process q linearizes before ob and v = ⊥, or the latest Update() by process q that linearizes before ob uses parameter v. In fact, since neither Update() nor Scan() returns anything, execution E is linearizable, if this is true for every Observe() operation. We will show in Section 4.3.4 that this is the case. To facilitate that proof, we will first show in Section 4.3.3, that if a process q executes an Update(u) operation, then once it has added u to versions[q], it won't remove u from that predecessor object until all Observe() operations that might have to return u (for linearizability), have responded. The next claim shows that if a Scan() operation sc by p linearizes before the invocation of a HelpScan(p) operation hs by q, then, val(sc) is written to lastScan[p] before rsp(hs). Therefore, if q reads lastScan[p] after it performed hs, then it either reads val(sc) or a value that is written to lastScan[p] by p during a Scan() operation that linearizes after inv(hs). This shows that the HelpScan() method prevents processes from reading outdated values from lastScan[0 . . . n − 1]. Claim 8. Suppose process p executes a Scan() operation sc, and process q executes a HelpScan(p) operation hs, such that lin(sc) < inv(hs). Then, t ′ (sc) < rsp(hs). Proof. For the purpose of a contradiction, assume t ′ (sc) ≥ rsp(hs). By definition and Lemma 3 (a), lastScan[p] = ⊥ throughout (t(sc), t ′ (sc)). Also, by Claim 4, t(sc) < lin(sc) < t ′ (sc). Therefore, lastScan[p] = ⊥ throughout the execution of hs. So process p reads ⊥ from lastScan[p] in line 56 of hs, and changes lastScan[p] in line 59 of hs to a positive integer fetched from clk. This contradicts the fact that lastScan[p] = ⊥ throughout the execution of hs. □ Finally, we use the above to prove that elements added to versions[0 . . . n − 1] are not removed prematurely, i.e., our memory reclamation works correctly. Suppose a process p adds a pair (key, v) to versions[p] during an Update(v) operation. The memory reclamation scheme ensures that the pair remains in the predecessor object until every Observe() operation that may need to return v has completed. Note that an Observe() operation by process q needs to return v if q's latest preceding Scan() linearizes after p's Update(v) and no Update() by p linearizes in between. We will prove by contradiction that no such pair exists. Suppose at point t * there is a pair E in versions[p], such that val(up) < E.key < val(sc). By Claim 7, this pair is inserted into versions [p] during an Update() operation up ′ by p, such that E ′ .key = val(up ′ ). Therefore, val(up) < val(up ′ ) < val(sc). By Claim 6, lin(up) < lin(up ′ ) < lin(sc). This contradicts the fact that no Update() operation by p linearizes between lin(up) and lin(sc). □ Lemma 10. Suppose process p executes an Update() operation up and inserts a pair (key, v) into versions[p] in line 25. Let q be a process and sc a Scan() operation by q, such that lin(up) < lin(sc), and no Update() operation by p linearizes between lin(up) and lin(sc). Let t * > max(lin(sc), rsp(up)), such that q invokes no Scan() in [rsp(sc), t * ]. Then (key, v) is in versions[p] at t * .
Proof. Let t 1 be the point in time at which q inserts (key, v) into versions [q]. We need to prove that this pair is not removed from versions[q] throughout (t 1 , t * ]. For the purpose of contradiction, assume p removes (key, v) from versions[p] in line 72 a Prune() operation pr at t rmv < t * .
Let t 3 be the point when p executes its versions[p].Pred(∞) operation in line 62 of its Prune() call pr , and let key ′ be the key that this operation returns, and which it assigns to maxKey. Process p removes pair (key, v) from versions[p] in line 72, and thus key < key ′ .
Let t 2 < t 3 be the point when p inserts the pair with key key ′ into versions[p]. Then p executes an versions[p].Insert() operation in line 25 at point t 2 , during an Update() operation up ′ . Since key < key ′ it follows from Claims 6 and 7 that p executes up before up ′ . By Lemma 2 (b), inv(up ′ ) < lin(up ′ ) < t 2 < t 3 . Since up is p's last update that linearizes before sc, we conclude lin(sc) < lin(up ′ ) < t 3 .
Let hs be the HelpScan(q) operation that p executes in line 64 of pr . Recall that t 3 is the point when p executes line 62 of pr , and thus t 3 < inv(hs). Since lin(sc) < t 3 we obtain lin(sc) < inv(hs). Therefore, by Claim 8, t ′ (sc) ≤ rsp(hs).
By definition, lastScan[q] = val(sc) at point t ′ (sc). Since there is no Scan() by q throughout (rsp(sc), t * ], by Lemma 3 (d), lastScan[q] = val(sc) throughout (t ′ (sc), t * ]. By the assumption that t rmv ≤ t * , and, as shown above, t ′ (sc) ≤ rsp(hs), lastScan[q] = val(sc) throughout (rsp(hs), t rmv ). Note that p calls hs in line 64 of Proof. In line 33 of ob, process p reads a value x from lastUpdate [q].key at t 1 , and in line 34 it reads a value y from lastScan [p]. For the purpose of contradiction, assume p executes line 36 of ob. Therefore, the if-statement in line 35 evaluates to true. Then ⊥ x < y. By Observation 11, y = val(sc). We will now show that x = val(up), and then arrive at a contradiction.
By Lemma 2, the value of lastUpdate [q].key is ⊥ throughout (t(up), t ′ (up)), and changes to val(up) at t ′ (up). Since x ⊥, t 1 (t(up), t ′ (up)). Because t(up) < t 1 , it follows that t ′ (up) < t 1 . Also, by Lemma 2 (d), lastUpdate [q].key remains val(up) until process q performs lastUpdate[q].SC(⊥) in line 22 of a later Update() operation up ′ at point t(up ′ ). Then, because up is the last Update() by q satisfying t(up) < t 1 , we obtain x = val(up). Since lin(sc) < lin(up) it follows from Claim 6 that y = val(sc) < val(up) = x. This is a contradiction. □ For the proof of linearizability, we will use the following claim. The proof is omitted due to space restrictions. Claim 13. Suppose process p executes a Scan() operation sc and later an Observe(q) operation ob, and no other Scan() between sc and ob. Let t 1 be the point at which p performs lastUpdate [q].LL() in line 33 during ob. Let up be the latest Update() operation by q, such that lin(up) < lin(sc). Then, t ′ (up) < t 1 .
The following lemma proves that the return values of Observe() operations are correct, and thus the execution is linearizable. Lemma 14. Suppose process p performs a Scan() operation sc, and later an Observe(q) operation ob, and no other Scan() between sc and ob.
(a) If process q performs no Update() operation that linearizes before lin(sc), then ob returns ⊥; (b) Otherwise, if process q's latest Update() that linearizes before lin(sc) uses argument v, then ob returns v.
Proof. We will only prove part (b). The proof of part (a) is very similar, and omitted due to space restrictions. Let up be process q's Update(v) operation that linearizes before lin(sc), and that no other Update() by q linearizes before lin(sc).
Let t 1 be the point in time at which process p performs lastUpdate[q].LL() in line 33 of ob. Process p reads a value (x, v ′ ) from lastUpdate[q] at t 1 in line 33, and then reads a value y from lastScan[p] in line 34. By Claim 13, t ′ (up) < t 1 . Also, by Observation 11, y = val(sc). We consider two cases.
Case 1: lastUpdate[q] does not change throughout (t ′ (up), t 1 ]. Then (x, v ′ ) = (val(up), v). Therefore, since lin(up) < lin(sc) it follows from Claim 6 that x < y. So the if-statement in line 36 of ob evaluates to true and process p returns v.
Case 2: lastUpdate[q] changes during (t ′ (up), t 1 ]. Let up * be the last Update() operation by q, such that t(up * ) < t 1 . Operation up up * because otherwise by Lemma 2 (d), lastUpdate[q] could not change throughout (t ′ (up), t 1 ]. Therefore, up happens before up * . Since up is the last Update() operation by q, such that lin(up) < lin(sc), it follows that lin(sc) < lin(up * ). So by Lemma 12, process p does not execute line 36 of ob, and hence executes line 38 of ob. Since process q performs lastUpdate[q].SC() in line 22 of up * before t 1 , we have inv(up * ) < t 1 . Hence, rsp(up) < t 1 . Therefore, by Observation 9 and Lemma 10, process p returns v. □ Our algorithm uses n single-writer predecessor objects of bounded capacity ∆ = 3n. By Theorem 1, these can be obtained from a total of O(n 3 log n) base objects. In addition, we use 2n O(1)-word LL/SC objects, which can be implemented from O(n 3 ) single-word CAS objects and registers [11]. Finally, we need a single FAI object. Thus, in total O(n 3 log n) base objects suffice. □

CONCLUSION
In this paper we presented a powerful variant of the single-writer snapshot type, which allows a process to adaptively read a consistent view of k memory components in O(k log n) steps. Contrary to most fast snapshot solutions, our algorithm does not make unrealistic assumptions about the size of base objects. Instead, it employs powerful synchronization primitives, which are readily available in most common hardware architectures. To achieve our result we implement as a building block a bounded memory single-writer predecessor object, which may be of independent interest. Our algorithm uses unbounded sequence numbers, which has no practical impact on modern 64-bit architectures. However, from a theoretical point of view, this is not satisfying. But we believe that our unbounded sequence numbers can be replaced with a bounded timestamp system. Unfortunately, no practical bounded timestamp systems are known that could replace the unbounded FAI in our algorithm, without significantly reducing efficiency.
An important open problem is to generalize our algorithm to a multi-writer version, or to allow snapshots of stronger primitives, such as CAS objects (similar to [21]).