特斯拉Sr. Storage Engineer
任职要求
岗位描述:
特斯拉正在利用新一轮人工智能技术解决交通,能源等棘手的世界性难题,我们正在寻找一位充满激情的存储工程师加入这个高度创新的团队,来帮助公司实现加速世界向可持续能源的转变。作为特斯拉IT基础设施团队的一员,我们负责交付始终在线的存储基础服务,以便特斯拉能够设计、构建和支持其世界级的产品。在这个关键的角色中,您将与复杂的IT系统上下游各个团队通力合作,建设和管理高可用和可伸缩的存储平台,确保其与我们的工程、制造和应用系统100%兼容。
岗位职责:
· 负责分布式存储的部署,配置和维护,包括分布式块存储、对象存储、文件存储。
· 确保我们的存储基础设施满足SDS(软件定义存储),HCI(超融合)以及混合云等各类场景的需求。
· 通过容量规划、性能控制以及配置调整确保存储和备份的扩展性和性能。
· 展示高水平的技术专长来支持复杂的存储和备份设备,包括SAN、NAS和跨所有层的备份及备份解…工作职责
无
Technical Development & Architecture * Design and implement scalable AI/ML solutions for Compliance use cases * Lead the development of efficient ML models and end‑to‑end data processing pipelines from ingestion to serving. * Build robust, production-grade AI services using Python and modern ML frameworks. * Make and document sound architectural decisions, ensuring systems are scalable, secure, and cost‑effective. * Establish and maintain high engineering standards, including testing, monitoring, and documentation. Engineering Leadership * Partner closely with data scientists, product managers, and operations teams to deliver end‑to‑end AI/ML solutions. * Define and evolve the technical architecture for AI-powered features and platforms. * Lead code reviews, enforce best practices, and elevate engineering quality across the team. * Continuously improve AI system performance, reliability, and latency through experimentation and optimization. Technical Collaboration & Operations * Work with cross‑functional partners to understand requirements, refine scope, and prioritize technical work. * Provide technical guidance and mentorship to junior and mid‑level engineers. * Collaborate with platform and DevOps teams to ensure smooth deployment, monitoring, and maintenance of AI systems. * Implement and evolve ML Ops practices (e.g., CI/CD for models, feature stores, model monitoring, and retraining workflows).
About the team The Industrial Energy team designs the eyes, ears, and brains of Tesla’s Energy Storage (Megapack) products. These system boards control the central processing, communications, thermal systems, high voltage safety, and system level components including breakers, contactors, and pyrofuses. The Role The Industrial Energy team is looking for a skilled and motivated individual to support the development, debug and continuous improvement activities of the Megapack PCBAS and factory test infrastructure. This person will serve as a first line of support to trouble-shoot PCBA failures from factory test as well as field returns. They will also perform sustaining activities such as designing in alternate components, cost-downs and design improvements. This person will interface with PCBA vendors and Tesla staff in the supply chain, factory test, field service and design engineering groups, requiring clear and organized communication. Responsibilities • Troubleshoot Megapack PCBA failures and drive corrective actions. • Start to finish design of tester PCBAs to support factory test stations. • Collaborate on the design and improvement of electronics test infrastructure hardware and software. • Support design updates to Megapack PCBAs. • Develop and execute test plans to validate circuit performance.
The Role TESLA is offering a full-time IT Support DevOps AI position in the Information Technology Department (Work Location: Tesla Giga Factory Shanghai). If you are a versatile expert integrating AI development, DevOps practices—someone who can efficiently tackle challenges, solve complex technical problems in user support and experience scenarios, and reject repetitive and inefficient work patterns—this role is perfect for you. IT Support DevOps AI is a core role connecting the company’s IT systems and user-facing processes, standing at the forefront of enhanced user support implementation. You will engage in work across multiple domains, including AI technology R&D, containerized deployment, and operational support. Through technical practice, you will support the company in optimizing user interactions, improving support efficiency, and contributing to the core goal of user experience transformation. Responsibilities • Undertake AI algorithm R&D, model optimization, and training, with a strong emphasis on fine-tuning (FT), supervised fine-tuning (SFT), reinforcement learning (RL), and advanced tuning techniques; focus on user support scenarios such as data analysis, query resolution, issue detection, and automated assistance to ensure AI technology aligns with user experience needs. • Complete the deployment, monitoring, and scaling of AI solutions based on container technologies like Kubernetes (K8s) and Docker, ensuring high availability and stability of the system in the operational environment, while integrating AI underlying technologies like neural networks and Transformer architectures for efficient performance. • Participate in DevOps process development, optimize the full lifecycle of AI model and system development, testing, and deployment, and realize automated deployment, continuous integration (CI), and continuous delivery (CD), incorporating RL-based optimization and model tuning for adaptive user support systems. • Collaborate with user support-related departments such as helpdesk, customer service, and product teams to deeply understand user pain points and provide data-driven AI technical solutions, leveraging SFT and attention mechanisms to enhance personalized user experiences. • Respond quickly to technical requirements and faults in user-facing systems, troubleshoot issues in AI systems, container clusters, and network environments, minimize impacts on user interactions, and improve support efficiency and satisfaction through advanced AI tuning and underlying model diagnostics. • Track cutting-edge technologies in the AI and DevOps fields (e.g., large language models with FT/SFT/RL integration, cloud-native operations) and industry trends, promote the pre-research and application of new technologies in user support scenarios, and continuously optimize system performance using techniques like model compression and quantization.
The Role Compute is the most important driver in accelerating the maturation of AI enabled products. Today, Tesla is at the forefront of creating meaningful real world products using AI. We design, build and run large scale GPU clusters that enable our teams to build better products faster. We are an extremely small team, and the work of every member carries an immense amount of weight. Working with the team, you will build out performance testing tools, build health check tools, create tools for better metric collection and all other fun projects. Responsibilities You’ll be working in a cross-functional and highly versatile team that designs, implements, and maintains HPC technical stacks. Leverage and improve upon existing cluster management solutions to ensure rapid deployment and scalability. Ensure the reliability of the existing systems to guarantee uptime and availability of core foundational services. Influence architectural decisions with focus on security, scalability and high-performance. Work with engineering teams to understand useful metrics to collect and implement such monitoring and alerting with existing monitoring solutions. Improve root cause analysis and corrective action for problems large and small – identify patterns and design task automations. Help develop automated tools to collect information that can be directly used to assist users creating root cause analysis for issues in their job submissions. Organize and document implemented solutions for long term information retention with our internal ticketing and documentation system. Take part in a 24 x 7 on-call rotation Must