SwiftKV optimizations developed and integrated into vLLM can improve LLM inference throughput by up to 50%, the company said.
Cloud-based data warehouse company Snowflake has open-sourced a new proprietary approach — SwiftKV — designed to reduce the cost of inference workloads for enterprises running generative AI-based applications. SwiftKV was launched in December.
The development of SwiftKV assumes significance as inferencing costs for generative AI applications are still high and work as a deterrent for enterprises either looking to scale these applications or infuse generative AI into newer use cases, the company explained.
SwiftKV goes beyond KV cache compression
SwiftKV, according to Snowflake’s AI research team, tries to go beyond key-value (KV) cache compression — an approach used in large language models (LLMs) to reduce the memory required for storing key-value (KV) pairs generated during inference.
The reduction in memory is made possible by storing earlier computed data via methods such as pruning, quantization, and adaptive compression. What …