Insights

Writing

Notes on production AI systems, distributed infrastructure, and practical model engineering.

When Embeddings Lie About the Facts

July 5th, 2026

I compared three metrics on XSumFaith: whole-text semantic similarity, atomic chunk alignment, and LLM-as-a-judge grounding. One metric passed a summary that invented an earthquake. Here is what each approach actually measures.

Multimodal llama-nemotron-embed-vl-1b-v2 — (Part 3): Multi-Engine BLS Router on Triton

May 2nd, 2026

Plan B in practice: four TensorRT engines (vit, embed, lm, post) plus a Python BLS router that genuinely skips the ViT for text-only requests, with the perf_analyzer numbers to show what that buys you.

Multimodal llama-nemotron-embed-vl-1b-v2 — (Part 2): Single Fused TensorRT Engine on Triton

April 27th, 2026

Plan A in practice: exporting llama-nemotron-embed-vl-1b-v2 as one ONNX graph, lowering it to a single TensorRT plan, serving it on Triton, and benchmarking the whole thing with perf_analyzer on an RTX 6000 Ada.

Multimodal llama-nemotron-embed-vl-1b-v2 — (Part 1): The Model and Inference Strategies

April 23rd, 2026

What the Nemotron multimodal embedding model actually is, how it processes inputs end to end, and the two serving shapes I considered before picking one.

GPU Utilization & Profiling — (Part 4): Model Architecture and GEMM Shape

March 29th, 2026

Three synthetic head architectures on the same hardware reveal why GPU utilization is not about model size — it is about GEMM shape.

GPU Utilization & Profiling — (Part 3): Small Models

March 10th, 2026

Load testing bert-tiny on Triton + TensorRT reveals a fundamentally different bottleneck than large models — not memory transfer latency, but CPU-GPU synchronization that keeps a fast GPU starved between batches.

GPU Utilization & Profiling — (Part 2): Mid-Large Models

March 6th, 2026

Load testing multilingual-e5-large-instruct on Triton + TensorRT and reading what Nsight Systems actually shows — batch size, instance count, and where the GPU cycles go.

GPU Utilization — (Part 1): What That Number Actually Means

March 5th, 2026

A breakdown of how GPU work is structured — SMs, warps, Tensor Cores, memory hierarchy — and what the utilization number actually measures before you try to optimize it.