One of my conclusions from today is that the slowness in the USER stage for Tensor.uniform() calls in @__tinygrad__ comes from the amount of chained methods involved (and each call also adding some profiling/metadata overhead via __wrapper__).