DApp Store | Piattaforma Web3 per eventi e giochi

Argomenti di tendenza

Nemotron 3 Super 120B NVFP4 Istanze vLLM separate su 1x H200 PCIe NVL e 1x RTX Pro 6000 Blackwell Eseguito un carico di lavoro in stile agente di codifica sintetica su ciascuno, mirato a 2k-45k token di input, 80-3k token di output massimi e 10 richieste concorrenti con 100 prompt totali Media Tok/s: - 261.57 tok/s (H200, NVFP4 GEMM=Marlin) - 175.44 tok/s (H200, NVFP4 GEMM=Emulato) - 182.90 tok/s (RTX Pro 6000) TTFT Media / Mediana: - 2281ms / 1091ms (H200, NVFP4 GEMM=Marlin) - 2849ms / 1374ms (H200, NVFP4 GEMM=Emulato) - 1799ms / 948ms (RTX Pro 6000) Su 1x H200, vLLM torna ai seguenti backend: - FP8 Denso: Cutlass - NVFP4 GEMM: Marlin - NVFP4 MoE: Marlin - Attenzione: Triton - Cache KV: FP8 1x RTX Pro 6000 Blackwell: - FP8 Denso: FlashInfer - NVFP4 GEMM: FlashInfer Cutlass - NVFP4 MoE: FlashInfer Cutlass - Attenzione: Triton - Cache KV: FP8

configurazione vLLM su istanze H200 e RTX Pro 6000: vllm serve <NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4-Path> --async-scheduling \ --served-model-name nemotron-3-super-nvfp4 \ --dtype auto \ --kv-cache-dtype fp8 \ --tensor-parallel-size 1 \ --pipeline-parallel-size 1 \ --data-parallel-size 1 \ --trust-remote-code \ --attention-backend TRITON_ATTN \ --gpu-memory-utilization 0.9 \ --enable-chunked-prefill \ --max-num-seqs 512 \ --host 0.0.0.0 \ --port 8000 \ --api-key YOUR_API_KEY \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser-plugin <NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4-Path>/super_v3_reasoning_parser.py \ --reasoning-parser super_v3 ==================== configurazione vLLM con NVFP4 GEMM emulato su H200: export VLLM_USE_NVFP4_CT_EMULATIONS=1 vllm serve <NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4-Path> --async-scheduling \ --served-model-name nemotron-3-super-nvfp4 \ --dtype auto \ --kv-cache-dtype fp8 \ --tensor-parallel-size 1 \ --pipeline-parallel-size 1 \ --data-parallel-size 1 \ --trust-remote-code \ --attention-backend TRITON_ATTN \ --gpu-memory-utilization 0.9 \ --enable-chunked-prefill \ --max-num-seqs 512 \ --host 0.0.0.0 \ --port 8000 \ --api-key YOUR_API_KEY \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser-plugin <NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4-Path>/super_v3_reasoning_parser.py \ --reasoning-parser super_v3 \ -cc '{"cudagraph_mode":0}'

configurazione vLLM su istanze H200 e RTX Pro 6000: vllm serve <NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4-Path> \ --async-scheduling \ --served-model-name nemotron-3-super-nvfp4 \ --dtype auto \ --kv-cache-dtype fp8 \ --tensor-parallel-size 1 \ --pipeline-parallel-size 1 \ --data-parallel-size 1 \ --trust-remote-code \ --attention-backend TRITON_ATTN \ --gpu-memory-utilization 0.9 \ --enable-chunked-prefill \ --max-num-seqs 512 \ --host 0.0.0.0 \ --port 8000 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser-plugin <NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4-Path>/super_v3_reasoning_parser.py \ --reasoning-parser super_v3 ==================== configurazione vLLM con NVFP4 GEMM emulato su H200: export VLLM_USE_NVFP4_CT_EMULATIONS=1 vllm serve <NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4-Path> \ --async-scheduling \ --served-model-name nemotron-3-super-nvfp4 \ --dtype auto \ --kv-cache-dtype fp8 \ --tensor-parallel-size 1 \ --pipeline-parallel-size 1 \ --data-parallel-size 1 \ --trust-remote-code \ --attention-backend TRITON_ATTN \ --gpu-memory-utilization 0.9 \ --enable-chunked-prefill \ --max-num-seqs 512 \ --host 0.0.0.0 \ --port 8000 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser-plugin <NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4-Path>/super_v3_reasoning_parser.py \ --reasoning-parser super_v3 \ -cc '{"cudagraph_mode":0}'

Principali

Ranking

Preferiti