Blog
Long-form technical writing.
-
Optimizing a Layer Normalization Kernel with CUDA: a Worklog
An iterative guide to writing and optimizing a CUDA layer normalization kernel — from a naive single-thread implementation to vectorized loads — benchmarked against PyTorch.