Google Released Gemma 4 Yesterday. I Had It Fixing Real Bugs by Lunch.
Google released Gemma 4 yesterday. By the time I went to bed, I had it deployed on my home lab, running real coding benchmarks at 96 tokens per second. The catch: no official llama.cpp image suppor...

Source: DEV Community
Google released Gemma 4 yesterday. By the time I went to bed, I had it deployed on my home lab, running real coding benchmarks at 96 tokens per second. The catch: no official llama.cpp image supported the gemma4 architecture yet. The stock CUDA images crash with unknown model architecture: 'gemma4'. So I built it from source, on the same Kubernetes cluster that serves inference. This post is about what it took to go from "model dropped" to "running in production" in about two hours on consumer hardware. The Setup My home inference server (I call it ShadowStack): 2x NVIDIA RTX 5060 Ti (16GB each, 32GB total VRAM) AMD Ryzen 9 7900X, 64GB DDR5 Ubuntu 24.04, MicroK8s NVIDIA driver 590.48.01 (CUDA 13.1) Everything is managed by LLMKube, a Kubernetes operator I built for running llama.cpp inference. One CRD to define the model, one CRD to define the service, the operator handles the rest. Step 1: The Architecture Problem First attempt, I tried the server-cuda13 image (CUDA 13 build of llama.