In section Releases

Dnotitia’s STAR-KV Tackles Long-Context AI Memory Bottlenecks

Selected as a Spotlight paper at ICML 2026, Dnotitia’s new STAR-KV framework achieves up to 20x compression of the KV cache. By combining low-rank approximation with mixed-precision quantization, the method addresses the memory constraints that currently limit long-context AI performance and inference speed.

The research, a collaboration between Dnotitia and UC San Diego’s VVIP Lab, focuses on the KV cache—a temporary GPU memory store that prevents large language models from recomputing previously processed tokens. As AI agents increasingly ingest vast amounts of external data, this cache has become a primary bottleneck. Using a LLaMA-3.1-8B model with a 128K-token context, the team found that the KV cache consumes roughly 81% of total GPU memory, highlighting the urgency of compression technologies.

STAR-KV utilizes custom GPU kernels to accelerate attention computation by up to 6.9x and overall generation throughput by 3.1x. Beyond mere memory savings, the approach maintains higher accuracy levels than existing compression methods. With the paper accepted into the competitive ICML 2026 program—where it joins an elite 2.2% of submissions—Dnotitia has already released the source code on GitHub. CEO MK Chung stated that the company intends to integrate these advancements into open-source inference frameworks like vLLM, aiming to lower the operational costs of long-context AI services.

Share:on TelegramXFacebook

Subscribe to our newsletter

Once a week — the best stories from our editors, no ads or push notifications. Delivered Sunday morning.

Comments (0)

Leave a comment

No comments yet. Be the first!