> you are a person > who wants to understand llm inference > you read papers > “we use standard techniques” > which ones? where is the code? > open vllm > 100k lines of c++ and python > custom cuda kernel for printing > close tab > now you have this tweet > and mini-sglang > ~5k lines of python > actual production features > four processes > api server > tokenizer > scheduler > detokenizer > talk over zeromq > simple > scheduler is the boss > receives requests > decides: prefill or decode > batches them > sends work to gpu > prefill...