Technical Abstract — Independent Research
March 2026 · v1.0

BATCHMIND:
A Stateful Batch Protocol
for Intelligent GPU Memory Orchestration

A 5-cluster token architecture with scheduler-supervised LLM routing, cryptographic migration wallets, and on-chain audit provenance — designed to transform reactive GPU memory management into a proactive, portable, and self-improving system.

Independent Researcher Electrical Engineering Systems Architecture
Abstract
Modern AI data centers waste 40–72% of GPU capacity not because of hardware limitations, but because memory state is stateless, context is rebuilt from scratch on every request, and routing decisions are reactive rather than predictive. This paper proposes BATCHMIND — a protocol where every inference batch carries a structured 5-cluster identity, a cryptographic migration wallet, and an on-chain audit record. A deterministic scheduler handles 80% of memory decisions without touching the LLM. The remaining 20% are resolved by a small orchestrating model reading pre-filtered, structured input — never raw conversation data. The result is GPU memory that moves before pressure hits, context that survives across hardware boundaries, and a verifiable incentive layer that rewards efficient compute contribution.
1. The Problem

GPU inference infrastructure operates under a fundamental architectural contradiction: the hardware is stateful — memory tiers, thermal conditions, loaded models — but the workload management layer treats every request as stateless. The consequences are measurable and severe.

~50%[1]
GPU utilization drop when KV cache offloading is active vs. baseline
<50%[2]
Sustained utilization in production AI inference under real load
≤1%[3]
GPU tensor compute utilization during memory-bound decode (single request)

The root cause is not insufficient hardware. It is the absence of a persistent, portable, self-describing batch identity — a unit that carries its own memory state, health metrics, routing history, and privacy constraints, and can be intelligently placed and moved without rebuilding context from scratch. During the decode phase, data movement so dominates execution time that GPU tensor cores can sit at near-zero utilization while the system waits for weights and KV cache tensors to arrive from memory — a condition Nvidia's own inference documentation describes as memory-bandwidth bound, not compute-bound.[3,4] The scheduler problem and the memory-hierarchy problem are the same problem.

2. The 5-Cluster Batch Architecture

Every BATCHMIND batch unit contains 100,000 tokens divided into five functional clusters. Each cluster serves a distinct purpose and actively supports the others — forming a self-sustaining unit that can be tracked, moved, compressed, or evicted intelligently across any GPU in any datacenter.

# Cluster Function Key Fields
C1 Identity "What am I?" Wallet address · Model ID · Precision format
Tenant ID · Origin region · Request IDs
C2 Context "What do I know?" KV cache tensors (encrypted) · Semantic embedding
Token sequence · Reconstruction cost · Compression flag
C3 Vitals "Am I still needed?" Re-engagement probability · TTL countdown
Memory tier · GPU utilization · Eviction risk score
C4 Routing "Where should I go?" Candidate GPU list (ranked) · Migration cost
Latency SLA · Next scheduled action · Hop history
C5 Supervision "Is this correct?" Hard constraint rules · Decision templates · Confidence gate
Audit log · Drift monitor · Fine-tune signal

Cluster 5 is the architectural keystone. It is not served to the LLM — it governs the LLM. The scheduler reads Clusters 1, 3, and 4 to pre-filter decisions. The LLM only sees a clean ~200 token summary when the scheduler cannot resolve a case with its ruleset. Cluster 5 validates the output before execution fires.

3. Relationship to Prior Work

Several active research threads address GPU memory efficiency for LLM inference. BATCHMIND does not replace them — it addresses a different layer of the problem. Understanding the gap each prior system leaves is essential to understanding what BATCHMIND proposes.

System What It Solves Persistent Batch ID Cross-GPU Migration Predictive (Pre-eviction) Jurisdiction Enforcement
vLLM / PagedAttention[5] KV cache fragmentation within a single GPU — near-zero waste, 2–4× throughput
DistServe / Splitwise / TetriInfer[6] Prefill–decode disaggregation — separate GPU pools for compute-bound vs memory-bound phases Partial
LMCache[7] KV cache tiering across GPU/CPU/disk — reuse of cached context across queries Partial
NVIDIA Dynamo[8] KV cache offload to SSD and CPU RAM via NIXL low-latency transfer library
BATCHMIND (proposed) Cross-datacenter batch lifecycle — proactive routing, portable identity, verifiable audit

Compatibility note. BATCHMIND is designed to sit above, not replace, vLLM-compatible inference engines. The C2 Context cluster's tensor fields map directly onto PagedAttention's block-table abstraction. The C3 Vitals cluster's phase flags (decode-in-progress, prefill-complete) are compatible with disaggregated P/D deployments. The scheduler's hard rules (R1: never interrupt mid-decode) are consistent with the scheduling constraints documented in Sarathi-Serve and DistServe. BATCHMIND adds the lifecycle management layer that none of these systems provide.

The open question this prior work leaves unanswered is: what happens to a batch after it leaves one GPU? Current systems dissolve it. BATCHMIND proposes that batches carry their identity, history, and constraints across every hardware boundary — and that this identity is cryptographically verifiable, not administratively asserted.

4. The Decision Engine

The scheduler is the brain of the system. It runs every 500ms, strips noise from all four supporting clusters, and resolves approximately 80% of routing decisions using deterministic rules — with zero LLM cost, zero latency overhead, and full auditability.

SCHEDULER cycles every 500ms across all active batches ────────────────────────────────────────────────────── Reads C3 Vitals → strips outliers → single health score 0.0–1.0 Reads C1 Identity → keeps model_id, priority, wallet_address only Reads C2 Context → strips raw tensors → semantic embedding only Reads C4 Routing → filters candidates → top 3 GPUs only ────────────────────────────────────────────────────── HARD RULES (~80% resolved here, no LLM needed) R1 batch mid-decode → do nothing R2 TTL > 60s AND engage > 0.7 → pin in place R3 GPU > 85% AND idle > 30s → compress + migrate R4 LLM confidence < 0.85 → use nearest template R5 two batches → same GPU → stagger 200ms R6 TTL expired AND engage < 0.2 → evict ────────────────────────────────────────────────────── Unresolved (~20%) → LLM receives clean 200-token payload LLM outputs one of: stay | migrate | compress_move ────────────────────────────────────────────────────── C5 SUPERVISION validates before execution fires Outcome logged → on-chain audit record → fine-tune signal

The critical design principle: the LLM performs classification, not reasoning. It selects from three pre-validated options on clean structured input. This eliminates hallucination risk from the routing layer entirely.

5. Cryptographic Migration Wallet

Every batch is assigned a migration wallet at creation — a cryptographic identity container that travels with the batch across every GPU boundary. The wallet solves privacy, jurisdiction enforcement, and tamper detection simultaneously, without adding latency to the real-time decision path.

Public Tier — Scheduler Readable
  • Wallet address (chain identity)
  • Semantic embedding (context summary)
  • Health score + routing candidates
  • Jurisdiction rules
  • Content hash (tamper proof)
  • Migration history
Private Tier — Tenant Key Only
  • Raw KV cache tensors
  • Actual token sequence
  • Full conversation content
  • Tenant private key
  • Never transmitted
  • Never on chain
On Chain — Public + Permanent
  • Wallet creation event
  • Migration transfer events
  • Key rotation log (M1→M2→M3)
  • Scheduler decision outcomes
  • Efficiency scores for rewards
Never On Chain — Ever
  • Conversation content
  • Raw tensors or weights
  • Private or migration keys
  • Tenant identity (pseudonymous)
  • Patient / financial / legal data

On every migration, the migration key rotates. The sending GPU's key is cryptographically invalidated the moment the receiving GPU confirms receipt. A compromised node gains zero access to historical context and zero access to future state. Jurisdiction rules embedded in the wallet are enforced at the key-generation layer — a migration key will not generate for a destination that violates the tenant's regional constraints. This is cryptographic enforcement, not policy enforcement.

6. Proof-of-Useful-Work Token Economy
Credibility Acknowledgment
The use of blockchain and tokenomics in an infrastructure paper will draw immediate skepticism from systems engineers — and that skepticism is earned. Most "crypto + AI" proposals conflate speculative token mechanics with genuine technical architecture. This section distinguishes the cryptographic layer (which has a precise technical justification) from the incentive layer (which carries real design risks and is treated here as a proposed mechanism, not a proven one). Both are described with their limitations stated.

Why a ledger at all? In a single-operator datacenter, the migration wallet's cryptographic evidence (key rotation log, migration hashes) can be held and audited by that operator. No external ledger is required. The ledger becomes necessary in multi-operator environments: when a batch migrates across a datacenter boundary between two independent GPU providers, neither party has a trusted reason to accept the other's unilateral audit record. A tamper-evident, append-only shared log — where both parties can verify migration events without relying on the other's honesty — is the correct solution to this specific problem. Blockchain is not chosen because it is novel; it is chosen because it is the standard mechanism for establishing a shared record among mutually distrusting parties.

What kind of chain. This protocol does not require and does not propose a public proof-of-work or proof-of-stake chain (Ethereum, Solana, etc.). The correct architecture is a permissioned ledger — a consortium chain with a fixed, known set of validator nodes (the participating datacenter operators). Hyperledger Fabric and similar permissioned ledger systems provide the tamper-evidence and finality guarantees required here with far lower overhead, far higher throughput, and without exposing audit data to public networks. The "on-chain" records described in §5 (wallet creation, migration transfer events, key rotation log) are designed for a permissioned ledger, not a public one.

The on-chain audit log from Cluster 5 is not just a record — it is the basis for a proof-of-useful-work incentive layer. GPU providers earn tokens not for mining meaningless hashes, but for demonstrably improving inference efficiency. The key distinction: every reward claim is verifiable from the migration wallet's cryptographic evidence. A provider cannot fabricate a "successful migration" event — the receiving GPU's key confirmation and the batch's measured latency delta are embedded in the wallet before the reward is computed.

Action Token Effect Reason
Host batch, GPU utilization improves + Earn tokens Measurable efficiency contribution
Successful migration, latency reduced + Earn tokens Proven routing quality
Keep high re-engagement batch warm + Earn tokens Prevented costly context rebuild
Drop batch mid-session unexpectedly − Lose tokens Forced full context reconstruction
Evict batch that returned within 60s − Lose tokens Incorrect re-engagement prediction acted on

The speculation risk. The claim that token value will remain "utility-based, not speculative" is a design goal, not a self-enforcing property. This is the hardest part of the token design to defend and it requires explicit anti-speculation mechanisms, not just a policy statement. The proposed approach: tokens should function as non-transferable compute credits with expiry — closer to airline miles than currency. They purchase compute priority slots, routing access, and decision template library access within the protocol only. They cannot be traded on secondary markets because there is no secondary market interface. If tokens become transferable instruments, this guarantee collapses and the incentive layer should be replaced with a simpler SLA-based penalty/reward structure. That alternative is under consideration and may be the more credible path for enterprise operator adoption.

What the ledger does not solve. The permissioned chain establishes a shared record of what happened; it does not automatically enforce correct behavior in real time. Enforcement remains the scheduler's job. The chain's role is accountability and auditability after the fact — and as the source of the fine-tune signal that improves the orchestrator over time. These are meaningful properties. They are not magic.

7. Stakeholders Addressed
Audience Current Pain What BATCHMIND Delivers
Nvidia / Hardware Vendors GPUs sit at 28–60% utilization despite massive demand Proactive memory orchestration raises effective utilization per rack
Data Center Operators No financial instrument tied to compute efficiency Proof-of-useful-work token with measurable ROI per batch
AI Model Providers No visibility into where context lives or why it's lost Portable batch identity with verifiable migration history
Enterprise / Regulated Industry Cannot use shared infrastructure due to data sovereignty risk Cryptographic jurisdiction enforcement — math not policy
8. Why This Is Different
9. Development Roadmap
1
Python schema for all 5 clusters
Dataclasses defining every field in every cluster. Foundation for everything else.
2
Deterministic scheduler rule engine
Six hard rules. Fully testable. Handles 80% of decisions with zero LLM cost.
3
Decision template library
5–10 JSON configs covering common routing patterns. Constrains LLM decision space.
4
GPU fleet simulation
Python + NumPy simulation of N GPUs, 1,000 batches. Compare BATCHMIND vs baseline vs rule-only.
5
Publish results + share with the community
Key metric: average time a batch spends in the wrong memory tier. Before vs after graph. Post to HuggingFace, arXiv, and AI infrastructure forums.
References