August 21, 2025

Lagrange Engineering Update: July 2025

July marked a pivotal milestone in the evolution of DeepProve, Lagrange’s zero-knowledge machine learning framework. Building upon our initial support for provable MLPs and CNNs, we expanded DeepProve’s capabilities to support full inference proofs for transformer-based large language models, culminating in the successful proof of inference for OpenAI’s GPT-2. This made DeepProve the first production-ready cryptographic system to prove a full LLM inference, a foundational breakthrough that will enable us to validate more complex model architectures like Meta’s LLAMA in the near future.

Below, we detail Lagrange’s engineering and research progress for July 2025.

New Capabilities for DeepProve

Computational Graph Support

DeepProve previously assumed models as linear sequences of layers. This simplified design limited the ability to express residual connections, branching computations, or multi-input/multi-output layers—common in real-world architectures. We now support generic computational graphs. DeepProve can parse arbitrary directed acyclic graphs and prove inference across any topological path. This change was driven by the need to support transformer residuals, where outputs from one block are added back to original inputs—a critical step for LLM compatibility.

Core Layer Additions

To support transformers, we expanded the layer library with both generic and transformer-specific operations:

  • Add: Quantized tensor addition with post-op requantization. Avoids table explosion by deferring quantization until after addition, enabling table reuse across multiple ops.
  • ConcatMatMul: Enables tensor concatenation across a given dimension followed by matrix multiplication—core to multi-head attention.

Transformer-Specific Layers

  • Softmax: One of the most sensitive operations in transformers. We implemented a custom softmax circuit based on zkLLM techniques, refined to handle quantized precision bounds that the paper’s authors didn’t fully address. We validate stability with bounds-checking to ensure cryptographic soundness.
  • Embeddings + Positionals: Implementation for input embedding vectors and positional encodings using sumcheck based techniques, similar to one hot encoding techniques used in Jolt..
  • LayerNorm + GELU: Both operations are mapped to lookup tables and support range-bounded quantized arithmetic.
  • QKV Block (Attention): Proved using sumcheck-based matrix multiplication primitives, with caching to preserve autoregressive context reuse.

GGUF Support: Expanding Model Compatibility

We added support for the GGUF format—one of the most widely adopted formats for exporting and running large language models. GGUF is optimized for LLM deployment and is widely used by developers and researchers, with a growing ecosystem of compatible models hosted on platforms like Hugging Face. DeepProve can now ingest GGUF-exported models and convert them into our internal format for proving inference, as long as the model architecture falls within our supported layer set. This improves accessibility and interoperability by enabling users to work with community-adopted models without requiring custom export pipelines.

LLM Inference Engine

Transformers are autoregressive, which means the inference process is sequential and token-dependent. We added a new LLM driver module for managing stateful inference, proof generation, and proof verification across multiple tokens. This engine extends the Model API and serves as the backbone for future autoregressive model support.

Current Engineering Workstreams

With full GPT-2 inference now proven, our engineering focus has shifted toward performance—specifically reducing proving time, memory usage, and proof size.

1. Cryptographic Optimization

Polynomial Commitment Scheme Migration: We are evaluating alternatives to the current basefold PCS implementation, which introduces proof bloat and limits verifier throughput. Benchmarks are ongoing for Groth16 variants and newer SNARK-friendly PCS.

Reusable Lookup Tables: Quantization introduces heterogeneous lookup tables for each layer. We are developing a static table reuse mechanism to preserve table integrity while improving setup amortization.

2. Graph-Aware Parallelism

The new ONNX graph support allows us to represent model computation as a DAG of subtasks. Many of these subtasks (e.g., matrix multiplications, layernorms) are independent and eligible for parallel execution. Refactoring the prover to be concurrency-aware is now a top priority, as it lays the foundation for multi-core execution and streaming proofs.

This shift also requires:

  • Aggregate-proof generation
  • Task-level concurrency scheduling
  • Thread-safe data movement within the prover

3. Distributed Proving

Once graph-aware parallelism is established, we will begin distributing proving jobs across machines. Each node will handle one or more subcircuits derived from the inference DAG, returning intermediate commitments to be aggregated by a coordinating verifier.

This system is being architected with horizontal scalability in mind to meet the demands of LLM inference at scale—particularly for high-throughput environments like inference-as-a-service or low-latency blockchain verification.

What’s Next

We are now turning our attention to LLAMA. GPT-2 and LLAMA share striking similarities, which means that only a few additional neural network layers separate us from LLAMA compatibility. Integrating these layers into the DeepProve library—most notably layernorm vs RMSNorm, GELU vs SILU, and the grouped attention query layernorm—will be a core focus of our engineering and research teams over the coming months. Performance improvements from PCS upgrades, table reuse, and distributed proving will also unlock support for larger context windows and multi-turn inference.

Recent Updates

Coming soon