a16z: Exploring the efficient and secure path of zkVM

Reprinted from jinse
03/12/2025·1MAuthor: Justin Thaler Source: a16z Translation: Shan Oppa, Golden Finance
Zero-knowledge virtual machines (zkVMs) are designed to “mass SNARKs” so that even people without SNARK expertise can prove that they run a program correctly on a specific input (or witness). Its core advantage lies in the developer experience, but zkVMs currently face huge challenges in terms of security and performance . If zkVMs want to deliver on their promises, designers must overcome these obstacles. This article will explore the possible stages of zkVM development, the entire process may take years to complete – don't listen to anyone saying this will be achieved quickly.
Challenges
In terms of security , zkVMs are highly complex software projects that are still full of vulnerabilities.
In terms of performance , proving that a program is executed correctly may be hundreds of thousands of times slower than native running, making the deployment of most applications in the real world temporarily unfeasible .
Nevertheless, many voices in the blockchain industry still promote zkVMs can be deployed immediately , and even some projects are already paying high computational costs to generate zero-knowledge proofs of on-chain activity. However, since zkVMs still have many vulnerabilities, this practice is really just an expensive disguise to make the system look like it is protected by SNARK, when it is actually either relying on permission control or worse – exposed to the risk of attack .
The reality is that we are still years away from building a truly secure and efficient zkVM. This article proposes a series of specific and phased goals to help us track real progress in zkVM, weaken hype, and guide the community to focus on real technological breakthroughs.
Safety development stage
background
SNARK -based zkVMs usually contain two core components:
1. Polynomial Interactive Oracle Proof (PIOP) : An interactive proof framework used to prove polynomials (or constraints derived from polynomials).
2. Polynomial Commitment Scheme (PCS) : Ensure that the prover cannot forge the polynomial evaluation results without being detected.
zkVM ensures the correct use of registers and memory of the virtual machine by encoding valid execution trajectories into constraint systems , and then uses SNARK to prove the satisfaction of these constraints.
In such a complex system, the only way to ensure that zkVM is vulnerable is formal verification . The following are the different stages of zkVM security, where the first stage focuses on protocol correctness and the second and third stages focus on implementation correctness .
Security Phase 1: The Right Protocol
-
Formal verification of PIOP reliability;
-
PCS is binding proof of formal verification under certain encryption assumptions or ideal models;
-
If Fiat-Shamir is used, the concise argument obtained by combining PIOP and PCS is a formal verification proof of security in the random oracle model (enhanced with other encryption assumptions as needed);
-
The constraint system applied by PIOP is equivalent to the formal verification proof of semantics of VMs;
-
All of these sections are fully "bonded" into a single, formally proven secure SNARK proof for running any program specified by the VM bytecode. If the protocol intends to implement zero knowledge, this attribute must also be formally verified to ensure that sensitive information about the witness is not disclosed.
If zkVM uses recursion , the PIOP, commitment scheme, and constraint systems involved in the recursion process must be verified , otherwise the sub-stage cannot be considered complete.
Security Phase 2: Correct Verifier Implementation
This stage requires formal verification of the actual implementation of the zkVM verifier (such as Rust, Solidity, etc.) to ensure that it complies with the protocols that have been verified in the first stage. Completing this stage means that the implementation of zkVM is consistent with the theoretical design , not just a security protocol on paper , or an inefficient specification written in languages such as Lean.
There are two main reasons why you only focus on the validator, but not the prover : First, ensuring that the validator is correct, you can ensure that the zkVM proves the completeness of the system (that is, ensuring that the validator will not be deceived to accept a wrong proof). Secondly, the zkVM 's verifier implementation is more than an order of magnitude simpler than the proofreader implementation , and the correctness of the verifier is easier to be guaranteed in the short term.
Security Phase 3: Correct Proof Implementation
This stage requires formal verification of the actual implementation of the zkVM prover to ensure that it can correctly generate proofs of the proof system that has been verified in the first and second stages. The goal of this stage is completeness , that is, any system using zkVM will not be stuck because it cannot prove a legal statement. If zkVM needs to have zero-knowledge attributes, formal verification must be provided to ensure that proof does not disclose any information about the witness.
Estimated timetable
Phase 1 Progress : We can look forward to some progress next year
(for example, ZKLib is such an effort). But no zkVM can fully meet the requirements of Phase 1 within at least two years.
Phase 2 and 3 : These phases can be advanced simultaneously with certain aspects of Phase 1. For example, some teams have demonstrated that the implementation of the Plonk validator matches the protocol in the paper (although the protocol itself may not be fully verified). Nevertheless, I don't expect any zkVM to reach Phase 3 in less than four years — and maybe even longer.
Key Notes: Fiat-Shamir Security vs. Verified Byte Code
A major complexity problem is that there are still unsolved research questions about the security of Fiat-Shamir transforms . All three security phases treat Fiat-Shamir and random oracles as absolutely secure, but in reality there may be vulnerabilities in the entire paradigm. This is due to the difference between the idealized model of a random oracle and the hash function actually used .
In the worst case, a system that has reached security phase 2 may be found to be completely unsafe due to Fiat-Shamir-related issues . This deserves our high attention and continuous research. We may need to modify the Fiat-Shamir transform itself to better defend against such vulnerabilities .
Systems that do not use recursion are theoretically safer , because some known attacks involve circuits similar to those used in recursive proofs. But this risk remains an unresolved fundamental problem .
Another issue to note is that even if zkVM proves that a certain computing program (specified by the bytecode) is executed correctly , if the bytecode itself is flawed , the value of this proof is extremely limited . Therefore, the practicality of zkVM depends largely on how formally validated bytecode is generated , and this challenge is extremely huge and beyond the scope of this article .
About quantum safety
Quantum computers will not pose a serious threat for at least 5 years (or even longer), while software vulnerabilities are a life-and-death issue. Therefore, the current priority should be to achieve the security and performance goals proposed in this article. If non-quantum-safe SNARKs can meet these goals faster, we should prioritize them. Consider switching when quantum-resistant SNARKs catch up with development, or when there are signs that quantum computers with actual threats are about to emerge.
Specific security level
100-bit Classic Security is the minimum standard for any SNARK to protect valuable assets (but there are still some systems that do not meet this low standard ). Even so, this should not be accepted , and standard cryptography practices usually require 128-bit security and above . If SNARK 's performance is truly up to standard, we should not reduce safety in order to improve performance .
Performance phase
Current situation
Currently, the computational overhead of the zkVM prover is about 1 million times that of native execution . In other words, if the native execution of a program requires X CPU cycles , it takes approximately X × 1,000,000 CPU cycles to generate a proof of correct execution . This was the case a year ago and still is today (although there are some misunderstandings).
Some popular statements in the industry today can be misleading, such as:
1. “The cost of generating proof for the entire Ethereum mainnet is less than $1 million per year.”
2. “We have almost achieved real-time proof generation of Ethereum blocks, requiring only a few dozen GPUs.”
3. “Our latest zkVM is 1000 times faster than previous generations.”
However, these claims can be misleading without context:
• 1000 times faster than older zkVMs, and can still be very slow , which is more of a slight slight way to be, rather than how good it is now .
• The computing volume of the Ethereum main network may increase by 10 times in the future , which will make the current performance of zkVM far less likely to keep up with demand.
• The so-called "almost real-time" proof generation is still too slow under the demands of many blockchain applications (for example, Optimism has a block time of 2 seconds, much faster than Ethereum 's 12 seconds ).
• “Double GPUs run 24/7 for a long time” does not provide sufficient guarantee of activity .
• These proof generation times are usually for proof sizes over 1MB , which is too large for many applications.
• “The cost is less than $1 million a year” is simply because the Ethereum full node only performs about $25 a year .
For application scenarios outside of blockchain, this computing overhead is obviously too high. No matter how many parallel computing or engineering optimizations are, it cannot make up for such huge computing overhead.
The basic goal we should set is: performance overhead is no more than 100,000 times the native execution. But even so, this is still only the first step. If we want to implement truly large-scale mainstream applications, we may need to reduce the overhead to 10,000 times or less of native execution.
Performance measurement
SNARK performance has three main components:
1. The underlying proves the inherent efficiency of the system .
2. Optimization for specific applications (such as precompilation).
3. Engineering and hardware acceleration (such as GPU, FPGA, or multi- core CPU).
While (2) and (3) are critical for actual deployment, they are suitable for any proof system and therefore do not necessarily reflect improvements in basic overhead . For example, adding GPU acceleration and precompilation to zkEVM can easily improve speed by 50 times more than relying solely on CPUs—which may make one inherently less efficient system look better than another that has not been the same optimization.
Therefore, this article focuses on measuring the basic performance of SNARK without dedicated hardware and precompilation . This is different from the current benchmarking approach, which typically combines all three factors into a "popular value". It's like judging a diamond by polishing time, rather than evaluating its inherent clarity .
Our goal is to isolate the inherent overhead of a universal proof system , lower the barriers to entry for technologies that have not yet been studied, and help the community eliminate distractions, thus focusing on real progress in proof system design .
Performance phase
Here are the milestones for the five performance phases I have proposed. First, we need to significantly reduce prover overhead on the CPU before we can further rely on hardware to reduce overhead. At the same time, memory usage must also be improved.
At all stages, developers should not adjust code for the performance of zkVM . Developer experience is the core advantage of zkVM. If DevEx is sacrificed to meet performance benchmarks, it will not only lose the significance of benchmarking, but also violate the original intention of zkVM.
These indicators focus mainly on prover cost . However, if the validator cost is allowed to grow unlimitedly (i.e., unlimited proof size or verification time), then any prover metric can be easily met. Therefore, to meet the requirements of the following stages, the maximum proof size and maximum verification time must be defined at the same time .
Phase 1 Requirements: "Rational non-trivial verification costs"
• Proof size : Must be smaller than the witness size.
• Verification time : The speed of verification proof must not be slower than the native execution of the program (i.e., it must not be slower than the direct execution of the calculation).
These are minimum simplicity requirements , ensuring that proof size and verification time are not worse than sending a witness directly to the validator and having it check it directly .
Stage 2 and above
• Maximum proof size : 256 KB.
• Maximum verification time : 16 milliseconds.
These caps are intentionally set looser to accommodate novel fast proof technologies, even if they may bring higher verification costs. At the same time, these upper limits exclude proof that is so expensive that few projects are willing to use on the blockchain.
Speed Stage 1
Single-threaded proofs must not be more than 100,000 times slower than native execution (for multiple applications, not just Ethereum block proofs) and must not rely on precompilation.
Specifically , assuming that a RISC-V processor on a modern laptop runs at about 3 billion cycles per second , reaching Phase 1 means that the laptop can generate proofs (single threaded) at 30,000 RISC-V cycles per second .
The validator cost must meet the previously defined "reasonable non-trivial verification cost" standard.
Speed Phase 2
Single-threaded proof must not be more than 10,000 times slower than native execution .
Alternatively , because some promising SNARK methods (especially binary domain SNARK) are limited by the current CPU and GPU, this phase can be met through an FPGA (or even an ASIC):
1. Calculate the number of RISC-V cores that the FPGA simulates at native speed.
2. Calculate the number of FPGAs required to simulate and prove RISC-V execution (near real-time).
3. If the number of (2) does not exceed 10,000 times of (1) , then phase 2 is satisfied.
• Proof size : Maximum 256 KB.
• Verification time : Maximum 16 milliseconds on standard CPU.
Speed Phase 3
On the basis of reaching speed stage 2 , a proof overhead of less than 1000× is achieved (for multiple applications) and must be precompiled with automatic synthesis and formal verification . Essentially, the instruction set of each program is dynamically customized to speed up proof generation , but ease of use and formal verification must be guaranteed. (See the next section on why precompilation is a double-edged sword and why "handwritten" Precompilation is not a sustainable approach.)
Memory Stage 1
Speed Phase 1 is reached with less than 2 GB of memory and meets zero-knowledge requirements at the same time. This phase is critical for mobile devices or browsers and opens the door to a large number of client zkVM use cases . For example, smartphones are used for location privacy, identity credentials, etc. If the proof generation requires more than 1-2 GB of memory, most mobile devices will not be able to run.
Two important notes:
1. Even for large-scale computing (necessary execution of trillions of CPU cycles), the proof system must maintain a 2 GB memory cap, otherwise the applicability will be limited.
2. If the proof is extremely slow, it is easy to maintain a 2 GB memory cap. Therefore, in order for memory stage 1 to make sense, speed stage 1 must be reached within the 2 GB memory limit.
Memory Stage 2
Speed Phase 1 is reached with less than 200 MB of memory (10 times faster than memory Phase 1).
Why reduce it to 200 MB? Consider a non-blockchain scenario : When you visit an HTTPS website, authentication and encryption certificates are downloaded. If the website sends zk proofs for these certificates instead, a large website may need to generate millions of proofs per second. If 2 GB of memory is required per proof, the compute resource requirements will reach the PB level , which is obviously not feasible. Therefore, further reducing memory usage is crucial for non-blockchain applications .
Precompiled: Last mile, or crutches?
Precompilation refers to an SNARK constraint system optimized specifically for specific functions (such as hashing, elliptic curve signature) . In Ethereum, precompilation can reduce the overhead of Merkle hashing and signature verification, but excessive dependence on precompilation cannot truly improve the core efficiency of SNARK.
Precompilation issues
1. Still too slow : Even with hashing and signature precompilation, zkVM still has the inefficiency problem of core proof systems inside and outside the blockchain.
2. Security vulnerabilities : If handwritten precompilation is not formally verified, there will be vulnerabilities that may lead to catastrophic security failure.
3. Poor developer experience : At present, many zkVMs require developer handwriting constraint systems , similar to programming methods in the 1960s, which seriously affects the development experience.
4. Benchmark misleading : If benchmarks rely on optimization of specific precompilations, it may mislead people to focus on optimizing manual constraint systems rather than enhancing the SNARK design itself.
5.I/O Overhead and RAM-free Access While precompiling can improve performance of heavy encryption tasks, they may not provide meaningful acceleration for more diverse workloads, as they incur significant overhead when passing input/output, and they cannot use RAM.
Even in a blockchain environment, as long as you go beyond a single L1 like Ethereum (for example, you want to build a series of cross-chain bridges), you will face different hash functions and signature schemes. Continuous precompilation to solve this problem is neither scalable nor poses a huge security risk.
I do believe that precompilation is still crucial in the long run, but it will only happen once they are automatically synthesized and officially validated. This way, we can maintain the developer experience advantages of zkVM while avoiding catastrophic security risks. This view is reflected in Phase 3.
Expected timetable
I expect a few zkVMs to reach speed phase 1 and memory phase 1 later this year. I think we can also achieve Speed Phase 2 in the next two years, but it is not clear whether we can achieve this without new research ideas.
I expect the rest of the phases ( speed phase 3 and memory phase 2 ) will take years to achieve.
Although this article lists the security and performance stages of zkVM respectively, these two are not completely independent. As vulnerabilities in zkVM continue to be discovered, I expect that fixing some of these vulnerabilities will inevitably lead to a significant performance drop. Therefore, the performance test results of zkVM should be considered as tentative data before it reaches security phase 2 .
zkVM has great potential in making zero-knowledge proven to be truly popular, but it is still in its early stages – full of security challenges and severe performance bottlenecks. Market hype and marketing publicity make measuring real progress difficult. With clear security and performance milestones, I hope to provide a roadmap that will clear the fog. We will eventually reach our goal, but it will take time, as well as continuous efforts in research and engineering.