特斯拉IT Incident Response Engineer

社招全职生产支持地点：上海状态：招聘

扫码手机上打开

任职要求

Must 
• Minimum 5 years of working experience with related academic background（Information Technology, Software Engineering, Computer Science. etc.).
• Deep understanding of IT infrastructure knowledge base, such as Networking, Server, Visualization, Storage. Etc. Hands on experience is preferred.
• Deep understanding of monitoring tools such Grafana, Prometheus or Splunk.
• Experience with change managemen…

登录查看完整任职要求

微信扫码，1秒登录

工作职责

THE ROLE
This role will be a support engineer within the Tesla IT Infrastructure Engineering & Operations department. The Sr. Incident Response Engineer will be coordinating with cross-functional engineering teams for Incident Response & Management in terms of the high availability to Tesla Manufacturing, Business Operations, Customer Service & Experience. We help to reduce the occurrence of incidents by using efficient IT Operation monitoring, effective risk analysis and professional team collaboration.

The Tesla APAC Incident Response Center is a growing team consist of professionals from diverse backgrounds, which will offer you a fantastic development environment. This role will be based on Giga Factory Shanghai, China but will provide support to Tesla Business globally considering of the growing business & great mission.

RESPONSIBILITIES
• Independently lead incident response and management to minimize impact and ensure optimal response times. Develop incident response plans, conduct post-mortem analyses, and organize drills to enhance preparedness.
• Drive IT service management projects. Establish/optimize SOPs to reduce inter-team communication barriers, promote technical knowledge sharing, and improve team incident response capabilities.
• Monitor IT infrastructure and data center operations, including servers, networks, and applications. Analyze real-time stability metrics, mitigate risks, and deliver regular operational analysis reports.
• Proactively enhance team efficiency through tool automation, process refinement, and adoption of industry best practices. Support daily operations and foster a culture of continuous improvement.
• Oversee infrastructure changes to minimize risks, streamline approval workflows, and ensure compliance with change management protocols.

📮 投递简历 ✨AI模拟面试

难度：

包括英文材料

Grafana+

还有更多 •••

登录查看完整学习资料

相关职位

Sr. IT Incident Response Engineer

社招生产支持

THE ROLE This role is a senior support position within Tesla IT Infrastructure Engineering & Operations. The Incident Response team provides incident response and management support to global cross-functional engineering teams, helping maintain high availability for Tesla Manufacturing, Business Operations, Customer Service & Experience. We reduce incident occurrence through effective IT operations monitoring, risk analysis, and change management. The Tesla APAC Incident Response Center (IRC) is a growing team of professionals from diverse backgrounds, with strong development opportunities. This role is based at Giga Factory Shanghai, China, and provides global support as Tesla's business and mission scale. Senior engineer team positioning: Acts as the regional incident management lead—coordinates teams through investigation and resolution, owns incident management practices (ticket management, root cause investigation, data analysis, and management reporting), and continuously improves processes and tooling. RESPONSIBILITIES • Independently lead end-to-end, 24×7 closed-loop incident management to minimize impact and optimize response time; organize emergency response plans, post-incident reviews, and drills as needed. • Lead or drive IT service management initiatives; establish or optimize SOPs to reduce cross-team communication barriers, promote technical and skills sharing, and raise the team's incident response capability. • Oversee IT Infrastructure & Operations monitoring and operational processes; maintain day-to-day stability and provide periodic reporting on data centers, servers, networks, applications, and related systems; identify and mitigate risks early. • Proactively support team operational improvement without day-to-day supervision—including tool iteration, process optimization, and adoption of industry best practices—to accelerate operational efficiency. • Participate in Infrastructure & Operations daily operations and change management; control change risk, improve change workflows, and support execution of change events. • Use company-approved AI tools for continuous learning and innovation to empower the organization.

上海

Networking Solution Test Engineer - AI Cluster Debugging

社招

Networking Solution Test Engineer – AI Cluster Debugging We are looking for a networking test engineer with strong system‑level debugging skills to join our End‑to‑End Verification team. You will work on cutting‑edge Ethernet‑based AI clusters, owning complex issues across hardware, system software and AI workloads. What you’ll be doing • Design and review test and product requirements across the Ethernet / NIC / DPU / Switch portfolio, focusing on large‑scale AI cluster behavior. • Build and maintain realistic customer‑like testbeds, including heterogeneous hardware, OS / driver combinations and complex network fabrics. • Own end‑to‑end cluster troubleshooting: reproduce customer scenarios, triage across the stack and drive issues to root cause and fix. • Read and understand relevant source code to identify defects, validate fixes and improve logging and instrumentation. • Collaborate closely with development teams to debug NCCL, RoCE/RDMA and related networking components using logs, code inspection and targeted experiments. • Define tests and guide the automation team to implement robust suites that produce actionable logs, metrics and traces. • Run Regression, Performance, Functional and Scale testing, analyze results and provide clear, data‑driven reports to stakeholders. • Profile and benchmark deep learning training and inference workloads, correlating model‑level metrics with system and network telemetry to uncover bottlenecks.

更新于 2026-02-05上海|北京

Security Operation Engineer (Intern)

实习金融

About The Team The Information Security team is at the core of the Tiger Brokers' trading platform. Comprising passionate engineers from across the globe, the team endeavors to develop the best systems using the most appropriate technologies. The SOC operations function is accountable for planning and overseeing the monitoring and maintenance of security operations, and providing guidance and leadership to internal resources. If you share the passion for cybersecurity, there's no better way to experience it firsthand. Job Description - Monitor and analyze security infrastructure to support detection and response to threats, vulnerabilities, and incidents. - Conduct basic investigations of security events, including malware infections and unauthorized access attempts. - Escalate critical cases to the incident response team and to provide support where needed - Assist in identifying opportunities for tuning to improve detection accuracy and reduce false positives - Handling case management, generating tickets and reports when required, and tracking open tickets until closure - Prepare scheduled and ad-hoc reports

更新于 2026-03-03新加坡

Networking Solution Test Engineer - AI IB and Ethernet Cluster Debugging

社招

We are looking for a networking test engineer with strong system‑level debugging skills to join our End‑to‑End Verification team. You will work on cutting‑edge Ethernet‑based AI clusters, owning complex issues across hardware, system software and AI workloads. What you’ll be doing: • Design and review test and product requirements across the InfiniBand / Ethernet / NIC / DPU / Switch portfolio, focusing on large‑scale AI cluster behavior. • Build and maintain realistic customer‑like testbeds, including heterogeneous hardware, OS / driver combinations and complex network fabrics. • Own end‑to‑end cluster troubleshooting: reproduce customer scenarios, triage across the stack and drive issues to root cause and fix. • Read and understand relevant source code to identify defects, validate fixes and improve logging and instrumentation. • Collaborate closely with development teams to debug NCCL, RoCE/RDMA and related networking components using logs, code inspection and targeted experiments. • Define tests and guide the automation team to implement robust suites that produce actionable logs, metrics and traces. • Run Regression, Performance, Functional and Scale testing, analyze results and provide clear, data‑driven reports to stakeholders. • Profile and benchmark deep learning training and inference workloads, correlating model‑level metrics with system and network telemetry to uncover bottlenecks.

更新于 2026-04-07上海|北京