英伟达Senior Solutions Architect - KV Cache and AI Storage
任职要求
• Bachelor's degree or higher in Computer Science or a related field with strong systems or storage background. • 5+ years of relevant experience, including 2+ years passionate about KV stores/caches or storage backends. • Hands‑on experience with distributed storage, caching, or large‑scale backend systems. • Solid understanding of Transformer / LLM inference and KV cache concepts, plus experience with at least one LLM serving stack (for example vLLM, TensorRT‑LLM or SGLang). • Strong knowledge of NVMe SSDs, KV SSDs, and modern storage servers, including controller/firmware behavior and I/O characteristics. • Practical experience with tiered memory and KV cache optimizations such as offloading (HBM → DRAM → NVMe), eviction/selection strategies, compression/quantization, or attention‑level optimizations. • Familiarity with at least one large‑scale storage or caching system (such as Ceph, Redis, Cassandra, RocksDB‑based KV, object storage, or distributed logs). Ways to stand out from the crowd: …
工作职责
• Lead technical exploration with customer architects to understand models, frameworks, SLOs, and KV cache usage patterns. • Build end-to-end KV cache solutions using tiered memory and NVIDIA modern networking technologies. • Analyze performance profiles, identify bottlenecks, and drive PoCs and benchmarks to validate improvements. • Translate customer difficulties into clear feature requests and roadmap input for NVIDIA products. • Build reference architectures, best-practice guides, and deliver tech talks to support our field teams and customers.
• Define the end-to-end technical architecture for the NIM Factory, from container build systems and CI/CD to Kubernetes deployment patterns and runtime optimization. • Drive technical strategy and roadmap, making high-impact decisions on frameworks, technologies, and standards that empower dozens of engineering teams. • Architect and influence the design of workflow orchestration systems that underpin the NIM factory. • Coach and mentor senior engineers across the organization, fostering a culture of technical excellence, innovation, and knowledge sharing. • Champion best practices in software development, including API design, automation, observability, and secure supply chain management. • Collaborate with leadership across research, backend, SRE, and product to align technical vision with product goals and influence technical roadmaps.
• Primary responsibilities will include building AI/HPC infrastructure for new and existing customers. • Support operational and reliability aspects of large-scale AI clusters, focusing on performance at scale, real-time monitoring, logging, and alerting. • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement. • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health. • Provide feedback to internal teams such as opening bugs, documenting workarounds, and suggesting improvements.
• Design, implement, and optimize scalable ML training pipelines for training multimodal foundation models for robotics. • Collaborate with researchers to integrate cutting-edge model architectures into scalable training pipelines. • Implement scalable data loaders and preprocessors for multimodal datasets, such as videos, text, and sensor data. • Optimize GPU and cluster utilization for efficient model training and fine-tuning on massive datasets. • Develop robust monitoring and debugging tools to ensure the reliability and performance of training workflows on large GPU clusters.