A 5-cluster token architecture with scheduler-supervised LLM routing, cryptographic migration wallets, and on-chain audit provenance — designed to transform reactive GPU memory management into a proactive, portable, and self-improving system.
GPU inference infrastructure operates under a fundamental architectural contradiction: the hardware is stateful — memory tiers, thermal conditions, loaded models — but the workload management layer treats every request as stateless. The consequences are measurable and severe.
The root cause is not insufficient hardware. It is the absence of a persistent, portable, self-describing batch identity — a unit that carries its own memory state, health metrics, routing history, and privacy constraints, and can be intelligently placed and moved without rebuilding context from scratch. During the decode phase, data movement so dominates execution time that GPU tensor cores can sit at near-zero utilization while the system waits for weights and KV cache tensors to arrive from memory — a condition Nvidia's own inference documentation describes as memory-bandwidth bound, not compute-bound.[3,4] The scheduler problem and the memory-hierarchy problem are the same problem.
Every BATCHMIND batch unit contains 100,000 tokens divided into five functional clusters. Each cluster serves a distinct purpose and actively supports the others — forming a self-sustaining unit that can be tracked, moved, compressed, or evicted intelligently across any GPU in any datacenter.
| # | Cluster | Function | Key Fields |
|---|---|---|---|
| C1 | Identity | "What am I?" | Wallet address · Model ID · Precision format Tenant ID · Origin region · Request IDs |
| C2 | Context | "What do I know?" | KV cache tensors (encrypted) · Semantic embedding Token sequence · Reconstruction cost · Compression flag |
| C3 | Vitals | "Am I still needed?" | Re-engagement probability · TTL countdown Memory tier · GPU utilization · Eviction risk score |
| C4 | Routing | "Where should I go?" | Candidate GPU list (ranked) · Migration cost Latency SLA · Next scheduled action · Hop history |
| C5 | Supervision | "Is this correct?" | Hard constraint rules · Decision templates · Confidence gate Audit log · Drift monitor · Fine-tune signal |
Cluster 5 is the architectural keystone. It is not served to the LLM — it governs the LLM. The scheduler reads Clusters 1, 3, and 4 to pre-filter decisions. The LLM only sees a clean ~200 token summary when the scheduler cannot resolve a case with its ruleset. Cluster 5 validates the output before execution fires.
Several active research threads address GPU memory efficiency for LLM inference. BATCHMIND does not replace them — it addresses a different layer of the problem. Understanding the gap each prior system leaves is essential to understanding what BATCHMIND proposes.
| System | What It Solves | Persistent Batch ID | Cross-GPU Migration | Predictive (Pre-eviction) | Jurisdiction Enforcement |
|---|---|---|---|---|---|
| vLLM / PagedAttention[5] | KV cache fragmentation within a single GPU — near-zero waste, 2–4× throughput | ✗ | ✗ | ✗ | ✗ |
| DistServe / Splitwise / TetriInfer[6] | Prefill–decode disaggregation — separate GPU pools for compute-bound vs memory-bound phases | ✗ | Partial | ✗ | ✗ |
| LMCache[7] | KV cache tiering across GPU/CPU/disk — reuse of cached context across queries | ✗ | Partial | ✗ | ✗ |
| NVIDIA Dynamo[8] | KV cache offload to SSD and CPU RAM via NIXL low-latency transfer library | ✗ | ✗ | ✗ | ✗ |
| BATCHMIND (proposed) | Cross-datacenter batch lifecycle — proactive routing, portable identity, verifiable audit | ✓ | ✓ | ✓ | ✓ |
Compatibility note. BATCHMIND is designed to sit above, not replace, vLLM-compatible inference engines. The C2 Context cluster's tensor fields map directly onto PagedAttention's block-table abstraction. The C3 Vitals cluster's phase flags (decode-in-progress, prefill-complete) are compatible with disaggregated P/D deployments. The scheduler's hard rules (R1: never interrupt mid-decode) are consistent with the scheduling constraints documented in Sarathi-Serve and DistServe. BATCHMIND adds the lifecycle management layer that none of these systems provide.
The open question this prior work leaves unanswered is: what happens to a batch after it leaves one GPU? Current systems dissolve it. BATCHMIND proposes that batches carry their identity, history, and constraints across every hardware boundary — and that this identity is cryptographically verifiable, not administratively asserted.
The scheduler is the brain of the system. It runs every 500ms, strips noise from all four supporting clusters, and resolves approximately 80% of routing decisions using deterministic rules — with zero LLM cost, zero latency overhead, and full auditability.
The critical design principle: the LLM performs classification, not reasoning. It selects from three pre-validated options on clean structured input. This eliminates hallucination risk from the routing layer entirely.
Every batch is assigned a migration wallet at creation — a cryptographic identity container that travels with the batch across every GPU boundary. The wallet solves privacy, jurisdiction enforcement, and tamper detection simultaneously, without adding latency to the real-time decision path.
On every migration, the migration key rotates. The sending GPU's key is cryptographically invalidated the moment the receiving GPU confirms receipt. A compromised node gains zero access to historical context and zero access to future state. Jurisdiction rules embedded in the wallet are enforced at the key-generation layer — a migration key will not generate for a destination that violates the tenant's regional constraints. This is cryptographic enforcement, not policy enforcement.
Why a ledger at all? In a single-operator datacenter, the migration wallet's cryptographic evidence (key rotation log, migration hashes) can be held and audited by that operator. No external ledger is required. The ledger becomes necessary in multi-operator environments: when a batch migrates across a datacenter boundary between two independent GPU providers, neither party has a trusted reason to accept the other's unilateral audit record. A tamper-evident, append-only shared log — where both parties can verify migration events without relying on the other's honesty — is the correct solution to this specific problem. Blockchain is not chosen because it is novel; it is chosen because it is the standard mechanism for establishing a shared record among mutually distrusting parties.
What kind of chain. This protocol does not require and does not propose a public proof-of-work or proof-of-stake chain (Ethereum, Solana, etc.). The correct architecture is a permissioned ledger — a consortium chain with a fixed, known set of validator nodes (the participating datacenter operators). Hyperledger Fabric and similar permissioned ledger systems provide the tamper-evidence and finality guarantees required here with far lower overhead, far higher throughput, and without exposing audit data to public networks. The "on-chain" records described in §5 (wallet creation, migration transfer events, key rotation log) are designed for a permissioned ledger, not a public one.
The on-chain audit log from Cluster 5 is not just a record — it is the basis for a proof-of-useful-work incentive layer. GPU providers earn tokens not for mining meaningless hashes, but for demonstrably improving inference efficiency. The key distinction: every reward claim is verifiable from the migration wallet's cryptographic evidence. A provider cannot fabricate a "successful migration" event — the receiving GPU's key confirmation and the batch's measured latency delta are embedded in the wallet before the reward is computed.
| Action | Token Effect | Reason |
|---|---|---|
| Host batch, GPU utilization improves | + Earn tokens | Measurable efficiency contribution |
| Successful migration, latency reduced | + Earn tokens | Proven routing quality |
| Keep high re-engagement batch warm | + Earn tokens | Prevented costly context rebuild |
| Drop batch mid-session unexpectedly | − Lose tokens | Forced full context reconstruction |
| Evict batch that returned within 60s | − Lose tokens | Incorrect re-engagement prediction acted on |
The speculation risk. The claim that token value will remain "utility-based, not speculative" is a design goal, not a self-enforcing property. This is the hardest part of the token design to defend and it requires explicit anti-speculation mechanisms, not just a policy statement. The proposed approach: tokens should function as non-transferable compute credits with expiry — closer to airline miles than currency. They purchase compute priority slots, routing access, and decision template library access within the protocol only. They cannot be traded on secondary markets because there is no secondary market interface. If tokens become transferable instruments, this guarantee collapses and the incentive layer should be replaced with a simpler SLA-based penalty/reward structure. That alternative is under consideration and may be the more credible path for enterprise operator adoption.
What the ledger does not solve. The permissioned chain establishes a shared record of what happened; it does not automatically enforce correct behavior in real time. Enforcement remains the scheduler's job. The chain's role is accountability and auditability after the fact — and as the source of the fine-tune signal that improves the orchestrator over time. These are meaningful properties. They are not magic.
| Audience | Current Pain | What BATCHMIND Delivers |
|---|---|---|
| Nvidia / Hardware Vendors | GPUs sit at 28–60% utilization despite massive demand | Proactive memory orchestration raises effective utilization per rack |
| Data Center Operators | No financial instrument tied to compute efficiency | Proof-of-useful-work token with measurable ROI per batch |
| AI Model Providers | No visibility into where context lives or why it's lost | Portable batch identity with verifiable migration history |
| Enterprise / Regulated Industry | Cannot use shared infrastructure due to data sovereignty risk | Cryptographic jurisdiction enforcement — math not policy |