it was fun to work on this. check out the blog post if you want to see how to get expert parallelism to scale linearly through internal worklogs and other kernel-level optimizations!