Introduction

In this post, I will attempt to present the problem of proving set membership in zero-knowledge — proving that an element is part of a large public set without disclosing which element — while discussing the main existing solutions, with their pros and cons, and a recent paper on this topic.

The post is intended for researchers and practitioners interested in the topic of zero-knowledge, and it tries to convey high-level explanations while sometimes turning into the mathematical details of the solutions. For the latter, I assume familiarity with basic computer science concepts like binary trees and computational complexity, as well as with some discrete maths like the notions of finite rings and groups.

Set Membership

Set membership is the problem of checking whether an element $x$ belongs to some, possibly public, set $S$. It arises in a large variety of contexts, mostly in applications that have large lists of data, where checking if an element is in the list can be very costly, or in those applications that require some form of privacy assurance either on the set, $S$, or on the element, $x$. One example is with financial regulation, where a bank must prove to the regulator that a new client is a citizen of the given country. In order to do this, the bank will prove that the new client belongs to the set of all citizens of that country. In this setting, the list itself may be public (at least to the banks and regulators), but the bank may not want to reveal the specific client it is checking membership for. This implies that the bank would like to use some zero-knowledge version of the system, to prove to the regulator the set-membership without revealing the specific query.

Another example is where a company wants to prove to their investors that they belong to the list of certified ecological companies. In this case, however, there is seemingly no need for privacy. Indeed, there are plenty of other examples where set-membership appears in business and governamental-level applications.

More recently, this problem has also emerged in the context of blockchain, mainly in cryptocurrency designs. In Bitcoin for example, the blockchain is supposed to maintain the so-called set of “unspent transaction outputs” (UTXO, in short). In a nutshell, the UTXO set contains all the coins that have not been already spent and therefore are eligible to be spent in a new transaction. In this scenario, validating a transaction that wants to spend (or consume) a coin $x$ involves checking that $x$ is in the set UTXO. The same setting is given in the Zcash cryptocurrency, but in this case, the element $x$ must actually be hidden from the blockchain, making the set-membership proof zero knowledge.

When looking at applications of set membership there are at least two intriguing questions that arise:

  1. Scalability: Is it possible to check that an element $x$ is part of a set $S$ without having to store $S$ and by spending time significantly faster than its size $|S|$?
  2. Privacy: Is it possible to check that $x$ is in $S$ for an unknown element $x$?

If we leave them as they are, the questions above look impossible to solve, but if make a twist to the problem we can make the impossible possible. The twist is: let us assume an asymmetry of roles. There is one party, the prover, for which we are not concerned about achieving scalability and privacy. Namely, the prover knows $x$ and $S$ and can afford computational and storage costs proportional to $|S|$. Everybody else are instead the verifiers, who can invest only small computing and storage resources, and against which we want to achieve privacy (namely, they should not learn $x$).

Adopting this asymmetric setting, then the idea to solve the problem is to let the prover compute and send a short proof about $x \in S$ to the verifiers, who can then check the proof in short time.

In what follows I will first review some solutions that solve scalability, and then I will discuss techniques that enable extensions that achieve also privacy.

Verifying Set Membership, Efficiently

Solutions to this problem date back to 1980 and 1994 when Merkle trees, by Ralph Merkle and cryptographic accumulators respectively were proposed.

In its basic form, an accumulator can be seen as a triple of algorithms $(\textsf{Acc}, \textsf{Prove}, \textsf{Verify})$ with the following functionality:

  • $A \leftarrow \textsf{Acc}(S)$ compresses a set of values $S$ into a short value $A$, the accumulator.
  • $\pi_{x} \leftarrow \textsf{Prove}(S, x)$ generates a membership proof $\pi_x$ about $x \in S$.
  • $\textsf{Verify}(A, x, \pi_{x})$ accepts or rejects the proof by using only knowledge of the accumulated set $A$.

To assure verifiers against malicious provers, accumulators come with the guarantee that creating false proofs (i.e., a proof $\pi^{*}$ that is accepted by $\textsf{Verify}(\textsf{Acc}(S), x, \pi^{*})$ for an $x \notin S$) is impossible under certain computational assumptions.

If this is the functionality provided by accumulators, we are then left to discuss the main question of this section: why accumulators enable efficient verification of set membership?

This is achieved thanks to a key property of them: the size of accumulator values $A$, proofs $\pi_{x}$ and the running time of $\textsf{Verify}$ are much smaller than the length of $S$; for example they can be logarithmic $O(\log |S|)$ or even constant (note: in the literature, with the term “accumulators” one typically refers to schemes with constant-size proofs and verification.)

One detail that I kept hidden from the description above is that all algorithms actually take an additional input, which is a public set of parameters that are generated in a probabilistic and trusted way.

Nowadays, we know several realizations of accumulators. Nevertheless, here I want to mention two popular realizations: Merkle trees and RSA Accumulators. Interestingly enough, they are not only the oldest proposals but also the only ones that by now achieve all desired properties by using constant-size public parameters.
At this point it is worth mentioning that they offer different efficiency tradeoffs: proofs size and verification time are $O(\log |S|)$ in Merkle trees, and $O(1)$ in RSA accumulators.

Let us briefly review how these two accumulators constructions work.

Merkle Trees

With a Merkle tree one can accumulate sets made of arbitrary elements using a collision-resistant hash function. We refer to this resource or this one for a more detailed explanation, but the basic idea is the following.

The public parameters simply consist of an hash function (e.g., SHA-256).
In order to accumulate a set $S = \{x_1, \ldots, x_n\}$ (let us assume for simplicity that $n$ is a power of $2$), one builds a binary tree in which $ \{x_1, \ldots, x_n\}$ are the leaves, and every internal node is the hash of its two children. The accumulator value $A$ is then the value at the root of such tree.
To create a proof $\pi_j$ that a certain $x_j \in S$, one returns all the sibling nodes that are in the path from the leaf $x_j$ until the root.
From the binary tree structure one then gets that these proofs consist of $\log(n)$ strings of $\ell$ bits each (where $\ell$ is the hash’s output bit length), and can be verified by using the siblings to recompute the nodes in the path and then check if the final result is equal to the root $A$.

Merkle trees are a very powerful construct, with countless applications and a lot of nice properties. They are actually more powerful than accumulators; they realize a so-called vector commitment, since the position of each leaf in the tree is also encoded in a binding way, in the sense that it is computationally impossible to claim two distinct values at the same position.

RSA Accumulators

With RSA Accumulators one can accumulate sets made of prime numbers. The main ingredient of these constructions are groups of unknown order, of which RSA groups (from which the name) or class groups are candidate realizations. Let us show how this works in RSA groups.

The public parameters consist of an RSA modulus $N=p \cdot q$ product of two large prime numbers $p$ and $q$, and of a random generator $G \in \mathbb{Z}_{N}^{*}$. Here, $\mathbb{Z}_{N}$ denotes the ring of non-negative integers modulo $N$—the set $\{0, 1, \ldots, N-1\}$—while $\mathbb{Z}^{*}_{N}$ is the subset of elements in $\mathbb{Z}_{N}$ that are also coprime with $N$, which form a group under multiplication.

In order to accumulate a set $S$ of prime numbers $\{e_1,\ldots e_{n}\}$, one computes
$$A \leftarrow G^{\prod_{i=1}^{n} e_i} \bmod N.$$
To create a proof that a certain $e_j \in S$, one computes
$$\pi_j \leftarrow G^{\prod_{i=1,i\neq j}^{n} e_i} \bmod N = A^{1/e_j} \bmod N$$
which can be verified by checking if
$$\pi_j^{e_j} = A \bmod N$$

Overall, RSA accumulators are a quite simple and elegant construction. While the basic construction above has the limitation that elements must be prime numbers, this can be solved by using appropriately constructed hash functions that map arbitrary strings into primes. Compared to Merkle trees, they have the appealing property that everything is constant-size: both $A$ and the proof $\pi_j$ consist of one group element each, and verification requires one modular exponentiation only.
On top of this, they enjoy further properties. For example, it is possible to efficiently add elements to the accumulator value (i.e. it is dynamic), to create proofs of non-membership (it is universal), and to create short proofs for membership of many elements. See for example this recent work by Boneh, Bunz and Fisch to read about these and more properties.

Verifying Set Membership, Efficiently and Privately

We have seen how accumulators can provide a solution to the problem of proving and verifying set membership in an efficient manner.
Let us now move to the question of how to make these proofs also privacy-preserving, namely how to prove that $x \in S$ without revealing $x$.

Actually, in most applications, the privacy requirement about $x$ typically goes together with the need of proving more than just membership. One may wish to prove that a property $P(x)$ holds for some element $x \in S$, without revealing exactly for which element.
In other (more cryptographic) words, one is interested in making a zero-knowledge proof (ZKP) for the NP relation $R(S, x) := x\in S \wedge P(x)$, where $S$ is public and $x$ is secret.
A ZKP for $R$ however is not necessarily “efficient” in the same sense as discussed earlier since the verifier should read $S$, and the proof itself may not be succinct.

To make the proof also short and efficient to verify, a solution is to mix ZKPs with accumulators, that is to build a ZKP for the following relation
$$R(A, (x, \pi_x)) := \textsf{Verify}(A, x, \pi_{x}) \wedge P(x)$$
that, in words, proves existence of an element $x$ for which $P$ holds and for which there is a valid accumulator proof relative to $A$, which in turn implies membership in the set accumulated in $A$.

This blueprint can be applied to both Merkle trees and RSA accumulators.
In the rest of this post, we review the main existing solutions that follow these two approaches, and then we conclude by mentioning a recent work.

ZKPs for Set Membership via Merkle Trees

ZKPs for set membership via Merkle trees are a straightforward application of the ZKP & Accumulators mixing approach mentioned above.
Notably, this idea has been implemented in the Zcash protocol, which used the general-purpose power of zkSNARKs in order to prove in zero-knowledge existence of a valid Merkle path (in addition to other properties modeling the validity of a transaction). This in turn translates into proving correctness of about $\log |S|$ hash computations on secret inputs. Encoding hash computations in the zkSNARK is what makes this approach particularly expensive for the prover. Zcash’s Sapling provided significant speedups of this approach via an ingenious choice of a pairing-friendly elliptic curve and the Pedersen hash function. Nevertheless, proving set membership is by now the most expensive part of proof generation in Zcash.

ZKPs for Set Membership via RSA Accumulators

For the case of RSA Accumulators, a notable work is that of Camenisch and Lysyanskaya who designed a ZKP protocol for the knowledge of an integer $e$ that is in the accumulator and that is committed in a group $\mathbb{G}$ of prime order $q$. More technically, given an accumulator $A$ and a Pedersen commitment $C_{e}\in \mathbb{G}$ as public values, they show how to prove knowledge of $(e, W, r)$ such that
$$A = W^{e} \bmod N \wedge C_e = g^{e} h^{r} \in \mathbb{G}$$
Having the commitment $C_{e}$ comes in handy if the final goal is to prove some property $P$ about the set element $e$: one simply creates another commit-and-prove ZKP for proving knowledge of $(e, r)$ such that $C_e = g^{e} h^{r} \in \mathbb{G}$ and $P(e)$ holds.

The Camenisch-Lysyanskaya protocol is potentially more efficient than the one based on Merkle trees as it does not require expensive general-purpose ZKPs that encode hash computations. Nevertheless, its use can be cumbersome due to some technical details. Most notably, the accumulated sets must be prime numbers of a specific size (which also imposes a minimum size for the prime-order group), and the hash-to-prime trick to avoid this problem cannot be used straightforwardly to mitigate this issue.

ZKPs for Set Membership: Efficient, Succinct, Modular

In the last part of this post, I would like to mention a recent work [BCFK19] that investigates modular and efficient constructions of ZKPs for set membership.

Modularity in ZKP Design

One is often confronted with creating ZKPs for “composite statements”, e.g., statements of the form “I know a value $x$ that belongs to a set $S$ and for which properties $P_1(x)$ and $P_2(x)$ hold”. In such a case, it can be convenient to create a proof by using three different proof systems, one for each subtask, e.g., $\Pi_{mem}$ for set membership, $\Pi_1$ for $P_1$, $\Pi_2$ for $P_2$.
This modular approach can be beneficial in multiple ways: one can focus on designing efficiency-optimized ZKPs for specific tasks, the same scheme can be re-used and replaced in a plug-and-play fashion, and it is just a simple design paradigm.

Technically, this approach can be realized by using commit-and-prove ZKP systems.
In a nutshell, a proof system $\Pi$ for a property $P$ is commit-and-prove if it can prove statements of the form “I know $x$ such that $P(x)$ holds and $C_{x}$ is a commitment to $x$”. Since commitments are binding, generating two such proofs with respect to the same commitment immediately implies an AND composition of the two proven statement. Essentially, commitments act as a “secure glue” between different proof systems.
While commit-and-prove has been known and used extensively in cryptography constructions, the recent LegoSNARK paper studied this paradigm in the context of succinct ZKPs, aka zkSNARKs.

Modular in ZKP for Set-Membership

The [BCFK19] paper starts from the observation that an accumulator value can be seen as a commitment to a set, and that the whole accumulator primitive can be seen as a succinct commit-and-prove system for set membership relations. In this sense, accumulators with ZK proofs of knowledge (i.e., that can prove “I know a valid accumulator proof for an $x$ in the commitment”) can be seen as commit-and-prove zkSNARK for set membership relations involving two types of commitments, a commitment to a set—the accumulator—and a commitment to an element.

So, one theoretical contribution of the paper is to extend the model of commit-and-prove zkSNARKs (and their composability properties) to the setting of typed-commitments, namely commitments to messages of different types (e.g., strings and sets over strings).

More Flexible and Modular ZKPs for RSA Accumulators

On the more practical side, [BCFK19] proposes new commit-and-prove zkSNARKs for set-membership, notably two solutions based on RSA accumulators that can be combined modularly and efficiently with other popular ZKP systems, such as Bulletproofs or Groth16. The latter feature is useful since it allows one to prove statements of the form “I know a value $x$ that belongs to a set $S$ and for which property $P(x)$ holds'”, by using this specialized scheme for the set-membership part and a general-purpose one for any other property about $x$.

Going more into the detail, [BCFK19] proposes two solutions based on RSA.
Both of them provide a functionality similar to the ZKP of Camenisch and Lysyanskaya mentioned earlier, in the sense that the element $x$ object of the membership proof is committed in a Pedersen commitment $C_x$ over a group $\mathbb{G}$ of prime order $q$.
One key difference that helps efficient modularity however is that in [BCFK19] $\mathbb{G}$ can be of standard size (e.g., 256 bits for 128 bits of security). This, we recall, means that other commit-and-prove systems that want to use $C_x$ do not need to suffer efficiency slowdowns due to inflated sizes of $\mathbb{G}$.
In terms of supported sets, both solutions in [BCFK19] allow more flexible choices of sets than the Camenisch-Lysyanskaya protocol: the first scheme supports the accumulation of sets whose elements are arbitrary binary strings, while in the second scheme elements are prime numbers of exactly $\mu$ bits (for various flexible choices of $\mu$).

Deep Diving into Linking Pedersen Commitments in Different Groups

Let us go even deeper and see how [BCFK19] achieves these improvements. In a nutshell, this is due to a new way to link a proof of membership for RSA accumulators to a Pedersen commitment in a prime order group, together with a careful analysis showing how this can be secure under parameters not requiring a larger prime order group.
For this post we only summarize the main idea for the scheme supporting sets of primes; we refer the interested readers to the paper for further details.

Let us recall the setting. The statement known to the verifier consists of an accumulator $A = G^{\prod_{i=1}^{n} e_i} \bmod N$, which is a commitment to a set $S = \{e_1, \ldots, e_n\}$, and of a Pedersen commitment $C_{e}$ in a group $\mathbb{G}$ of prime order $q$.
The goal of the prover is to argue knowledge of $(e, r)$ that open $C_e$, i.e., $C_{e} = g^{e} h^{r}$, and such that $e \in S$ where $A = \textsf{Acc}(S)$. This is achieved through a combination of the following:

  • $C^{*}_e$, a commitment to $e$ created by the prover in the RSA group: $C^{*}_{e} = G^{e} H^{s} \bmod N$.
  • $\Pi_{root}$, a ZKP of a committed root for $A$, i.e., a proof of knowledge of $e, s$ and $W$ such that
    $$W^{e} = A \bmod N \quad \textrm{and} \quad C^{*}_{e} = G^{e} H^{s} \bmod N.$$
  • $\Pi_{modeq}$, a ZKP that $C^{*}_{e}$ and $C_{e}$ commit to the same value modulo $q$.
  • $\Pi_{range}$, a ZKP that $C_{e}$ commits to an integer in the range $(2^{\mu-1}, 2^{\mu})$.

Intuitively, $\Pi_{root}$ shows that $C^*_{e}$ commits to an integer that is accumulated in $A$ (at this point, however, such integer may be a trivial root, i.e., $1$). The goal of $\Pi_{modeq}$ and $\Pi_{range}$ is to rule out this corner case $e=1$ and to “securely link” the commitment $C^*_e$ in the RSA group created by the prover with the prime-order group commitment $C_e$ known to the verifier, namely to ensure that these two commitment open to the same integer $e$.
This is less straightforward than expected because $\Pi_{modeq}$ is only able to prove that the equality of the values committed in $C^*_e$ and $C_e$ holds modulo $q$ and not necessarily over the integers. Also, from $\Pi_{root}$ alone, one can only infer that $C^*_{e}$ commits to some integer $e^*$ that divides $\prod_{i=1}^{n} e_i$.

So, the bad case that malicious provers may trigger and that the protocol must exclude is that $e^*$ and $e^* \bmod q$ are different over the integers. This can be shown with a quite detailed analysis whose basic idea is to use the fact that $\Pi_{root}$ nevertheless guarantees that $e^*$ can be the product of only few primes in $S$ (say 1, 2, depending on the difference between $\mu$ and $|q|$), and that $\Pi_{range}$ ensures that $e^* \bmod q$ lies in the range $(2^{\mu-1}, 2^{\mu})$.

To summarize, this technique enables linking the commitments in the RSA and prime-order group in an efficient way, and this enables to use this ZKP for set membership in combination with other SNARKs that are instantiated over the same prime-order group.

Conclusion

To conclude, in this post we went on a journey about the set membership problem. We started from seminal works (Merkle trees and RSA accumulators) that provide a solution to the scalability problem of set membership. Next, we went on to discuss solutions that allow one to maintain the privacy of the element involved in the set membership statement — a problem that is key in a number of applications from different domains, including finance, and business- or governmental-level interaction. Given the emergence of ZKP applications in the real-world, we expect that new solutions to the problem of privacy-preserving set membership will come out.