logo of tesla

特斯拉Sr System Engineer - HPC

社招全职IT-基础架构与运营地点:上海状态:招聘

任职要求


Experience with cluster deployment and operations on Linux Operating System flavors (Ubuntu/RHEL).

Advanced experience with configuration management systems such as Ansible.

Demonstrable knowledge of TCP/IP, RoCE, Linux Operating System internals, filesystems, disk/storage technologies and storage protocols

Experience with design, deploy middle to large scale of InfiniBand network.

Proficiency in high-level programming language and/or scripting with (Python, Go, Bash).

Experience with containers (Docker, Kubernetes)

Familiar with Prometheus, Grafana, Splunk for monitoring and alerting.

Administering HPC workload managers (SLURM, BCM etc.).

Experience with high-throughput low-latency network and GPU-based computing systems

Fluently in reading, writing and speaking in English

Preferred 

Previous experience at the large-scale data center running HPC workloads
Experience with parallel filesystems
Bachelor’s degree in computer science, electrical engineering, Math or related fields with 5+ years of additional equivalent experience or evidence of exceptional ability related to the position.

This job application may involve an interview with an interviewer outside of Tesla China. If you complete your application, you agree Tesla provides your application information to overseas interviewers in Tesla, Inc. for recruitment purposes. More details and contact information please see  here. (here hyperlink: https://app.mokahr.com/social-recruitment/tesla/46129#/)

工作职责


The Role

Compute is the most important driver in accelerating the maturation of AI enabled products. Today, Tesla is at the forefront of creating meaningful real world products using AI. We design, build and run large scale GPU clusters that enable our teams to build better products faster. We are an extremely small team, and the work of every member carries an immense amount of weight. Working with the team, you will build out performance testing tools, build health check tools, create tools for better metric collection and all other fun projects.



Responsibilities

You’ll be working in a cross-functional and highly versatile team that designs, implements, and maintains HPC technical stacks.

Leverage and improve upon existing cluster management solutions to ensure rapid deployment and scalability.

Ensure the reliability of the existing systems to guarantee uptime and availability of core foundational services.

Influence architectural decisions with focus on security, scalability and high-performance. Work with engineering teams to understand useful metrics to collect and implement such monitoring and alerting with existing monitoring solutions.

Improve root cause analysis and corrective action for problems large and small – identify patterns and design task automations.

Help develop automated tools to collect information that can be directly used to assist users creating root cause analysis for issues in their job submissions.


Organize and document implemented solutions for long term information retention with our internal ticketing and documentation system.

Take part in a 24 x 7 on-call rotation



Must
包括英文材料
Linux+
Ubuntu+
Ansible+
TCP/IP+
Python+
Go+
Bash+
Docker+
Kubernetes+
Prometheus+
Grafana+
HPC+
相关职位

logo of nvidia
社招

NVIDIA data center systems, such as DGX and HGX, have become core to NVIDIA's rapidly growing enterprise and cloud provider businesses. These platforms bring together the full power of NVIDIA GPUs, NVIDIA NVLink, NVIDIA InfiniBand networking, NVIDIA Grace CPUs, and a fully optimized NVIDIA AI and HPC software stack. We are hiring Sr. Software Engineer who will help build simulators for our DGX Server platforms. Simulations play a significant role in building scalable systems at Speed of Light! You will work with world class engineering teams across HW and SW. What you’ll be doing: • Contribute to architect and develop simulation platform for next-gen NVIDIA DGX platforms. • Build, integrate and enhance simulator components with new HW features and write supporting technical documents. • Bring full SW stack up on DGX Simulator; work closely with hardware modeling, kernel & platform driver teams distributed globally. • Improve performance, fix bugs across user and kernel stack, and automate execution flow.

更新于 2025-09-22
logo of tesla
社招低压电子系统

We are seeking a highly motivated and experienced embedded SW Engineer to join our LV Electronic Module Design team in Tesla Shanghai. The candidate will play a crucial role in algorithm design and implementation, embedded software development and module/vehicle level testing and validation. The candidate should have a good understanding and hands-on experience on RTOS, ARM architecure, embedded system hardware, etc. Knowledge and experience on wireless phone charging, like Qi and high power priviate protocols are high appreciated. Responsibilities: Archietect, code and debug embedded software in C/C++      for microcontrollers to implement functions required in electronic modules. Develop device drivers (SPI/I2C/UART/USB/CAN), RTOS modules and implement communication protocal stacks Collaborate closely with the EE/FW engineers to ensure seamless integration and system level evaluations. Troubleshoot using oscilloscopes, logic analyzers, JTAG debuggers, and protocol analyzers. Support the manufactoring testing and CI. Support the production FW build and release.

logo of amazon
社招Hardware

As a Battery System Engineer, you will engage with an experienced cross-disciplinary staff to conceive, and design innovative consumer product. You will work closely with an internal interdisciplinary team, and outside partners to drive key aspects of product definition and execution. You must be responsive, flexible, and able to succeed within an open collaborative peer environment. In this role, you will: 1. Lead the design, development, and delivery of Li-ion battery system per performance and safety requirements 2. Drive battery development from NPI through mass production 3. Research and evaluate emerging battery technologies 4. Collaborate with product teams to define battery specifications 5. Design battery protection circuit and pack design for NPI programs include schematic design, and component selection. 6. Develop and review battery pack schematics, BOMs and layout to meet design requirements 7. Conduct system and design reviews, failure mode and effects analysis (DFMEA), and risk assessments 8. Analyze and resolve battery-related issues in production and field 9. Perform battery safety assessment and design for safety 10. Support battery certification processes (CTIA/IEEE1725) 11. Manage and coordinate with CMs (contract manufacturers) on battery development for NPI programs 12. Build and maintain strong relationships with suppliers and manufacturing partners

更新于 2025-10-02
logo of tesla
社招基础架构

The Role Tesla is looking for a technical and industry-experienced engineer to join a team of talented engineers. As part of Tesla IT Operation team, we are responsible to deliver 7x24 system infrastructure and provides a portfolio of services including configuration management, engineering tools, identity access and control, managing public, private cloud infrastructure, ensure security and extreme reliability is our fundamental design principal, the candidate must be hands-on on day-to-day basis with experience in building, operating and driving reliability and security for production systems at scale. Responsibilities • Responsible for the design, deployment, and support of manufacturing systems and network infrastructure. • Provide support for China-based infrastructure build-out, including datacenter, Linux system (both virtualized and bare-metal servers). • Installation, configuration, and maintenance of Linux server environment. • Ensure the reliability of the existing systems to guarantee uptime and availability of core infrastructure services. • Perform root-cause analysis of complex issues ranging through hardware, operating system, application, network, and information security platforms. • work with different business units to identify, plan, test and deploy or upgrade Linux system according to business requirements. • Partner with teams from across the organization to help tackle hard problems in a collaborative, high velocity environment. • Tackle issues across the entire stack: hardware, software, network and application. • Managing engineering tools and platform such as GitHub, Artifactory, etc. • Perform analysis, troubleshooting, and introspection on core infrastructure components and handle incident response. • Creating and maintain well documented knowledge base and be a mentor of junior engineers. • Take on call role and respond quickly to emergency bridge and provide quick and effective solutions to minimize system downtime.