The innovation surrounding lossless LLM (Large Language Model) compression is poised to significantly enhance the efficiency of GPU inference, particularly for handling large-scale models like a 405B-parameter model. This method allows these massive models to run more effectively on fewer resources, specifically on setups with 8×80GB GPUs, which can greatly benefit research labs and startups by reducing the infrastructural burden. The newly introduced DFloat11 approach shows dramatic improvements in throughput, achieving 1.9-38.8x higher performance in token generation compared to traditional offloading methods to CPU, and allows for much longer context lengths (5.3-13.17x) within the same memory constraints. However, the method does face concerns regarding batch processing speeds and its true losslessness, especially in terms of operational speed with small batch sizes. Users express excitement over rapid developments in ML and transformer models while also questioning the potential limitations and applicability of this technology to different types of models. The overall sentiment underscores the transformative potential of such advancements, even as challenges in implementation and performance metrics remain.