Insights
Notes on production AI systems, distributed infrastructure, and practical model engineering.
May 2nd, 2026
Plan B in practice: four TensorRT engines (vit, embed, lm, post) plus a Python BLS router that genuinely skips the ViT for text-only requests, with the perf_analyzer numbers to show what that buys you.
April 27th, 2026
Plan A in practice: exporting llama-nemotron-embed-vl-1b-v2 as one ONNX graph, lowering it to a single TensorRT plan, serving it on Triton, and benchmarking the whole thing with perf_analyzer on an RTX 6000 Ada.
April 23rd, 2026
What the Nemotron multimodal embedding model actually is, how it processes inputs end to end, and the two serving shapes I considered before picking one.
March 29th, 2026
Three synthetic head architectures on the same hardware reveal why GPU utilization is not about model size — it is about GEMM shape.
March 10th, 2026
Load testing bert-tiny on Triton + TensorRT reveals a fundamentally different bottleneck than large models — not memory transfer latency, but CPU-GPU synchronization that keeps a fast GPU starved between batches.
March 6th, 2026
Load testing multilingual-e5-large-instruct on Triton + TensorRT and reading what Nsight Systems actually shows — batch size, instance count, and where the GPU cycles go.
March 5th, 2026
A breakdown of how GPU work is structured — SMs, warps, Tensor Cores, memory hierarchy — and what the utilization number actually measures before you try to optimize it.