logo of tesla

特斯拉IT Incident Response Engineer

社招全职生产支持地点:上海状态:招聘

任职要求


Must 
• Minimum 5 years of working experience with related academic background(Information Technology, Software Engineering, Computer Science. etc.).
• Deep understanding of IT infrastructure knowledge base, such as Networking, Server, Visualization, Storage. Etc. Hands on experience is preferred.
• Deep understanding of monitoring tools such Grafana, Prometheus or Splunk.
• Experience with change managemen…
登录查看完整任职要求
微信扫码,1秒登录

工作职责


THE ROLE
This role will be a support engineer within the Tesla IT Infrastructure Engineering & Operations department. The Sr. Incident Response Engineer will be coordinating with cross-functional engineering teams for Incident Response & Management in terms of the high availability to Tesla Manufacturing, Business Operations, Customer Service & Experience. We help to reduce the occurrence of incidents by using efficient IT Operation monitoring, effective risk analysis and professional team collaboration.

The Tesla APAC Incident Response Center is a growing team consist of professionals from diverse backgrounds, which will offer you a fantastic development environment. This role will be based on Giga Factory Shanghai, China but will provide support to Tesla Business globally considering of the growing business & great mission.

RESPONSIBILITIES
• Independently lead incident response and management to minimize impact and ensure optimal response times. Develop incident response plans, conduct post-mortem analyses, and organize drills to enhance preparedness.
• Drive IT service management projects. Establish/optimize SOPs to reduce inter-team communication barriers, promote technical knowledge sharing, and improve team incident response capabilities.
• Monitor IT infrastructure and data center operations, including servers, networks, and applications. Analyze real-time stability metrics, mitigate risks, and deliver regular operational analysis reports.
• Proactively enhance team efficiency through tool automation, process refinement, and adoption of industry best practices. Support daily operations and foster a culture of continuous improvement.
• Oversee infrastructure changes to minimize risks, streamline approval workflows, and ensure compliance with change management protocols.
包括英文材料
Grafana+
还有更多 •••
相关职位

logo of nvidia
社招

Networking Solution Test Engineer – AI Cluster Debugging We are looking for a networking test engineer with strong system‑level debugging skills to join our End‑to‑End Verification team. You will work on cutting‑edge Ethernet‑based AI clusters, owning complex issues across hardware, system software and AI workloads.  What you’ll be doing • Design and review test and product requirements across the Ethernet / NIC / DPU / Switch portfolio, focusing on large‑scale AI cluster behavior. • Build and maintain realistic customer‑like testbeds, including heterogeneous hardware, OS / driver combinations and complex network fabrics. • Own end‑to‑end cluster troubleshooting: reproduce customer scenarios, triage across the stack and drive issues to root cause and fix. • Read and understand relevant source code to identify defects, validate fixes and improve logging and instrumentation. • Collaborate closely with development teams to debug NCCL, RoCE/RDMA and related networking components using logs, code inspection and targeted experiments. • Define tests and guide the automation team to implement robust suites that produce actionable logs, metrics and traces. • Run Regression, Performance, Functional and Scale testing, analyze results and provide clear, data‑driven reports to stakeholders. • Profile and benchmark deep learning training and inference workloads, correlating model‑level metrics with system and network telemetry to uncover bottlenecks.

更新于 2026-02-05上海|北京
logo of itigerup
实习金融

About The Team The Information Security team is at the core of the Tiger Brokers' trading platform. Comprising passionate engineers from across the globe, the team endeavors to develop the best systems using the most appropriate technologies. The SOC operations function is accountable for planning and overseeing the monitoring and maintenance of security operations, and providing guidance and leadership to internal resources. If you share the passion for cybersecurity, there's no better way to experience it firsthand. Job Description - Monitor and analyze security infrastructure to support detection and response to threats, vulnerabilities, and incidents. - Conduct basic investigations of security events, including malware infections and unauthorized access attempts. - Escalate critical cases to the incident response team and to provide support where needed - Assist in identifying opportunities for tuning to improve detection accuracy and reduce false positives - Handling case management, generating tickets and reports when required, and tracking open tickets until closure - Prepare scheduled and ad-hoc reports

更新于 2026-03-03新加坡
logo of nvidia
社招

We are looking for a networking test engineer with strong system‑level debugging skills to join our End‑to‑End Verification team. You will work on cutting‑edge Ethernet‑based AI clusters, owning complex issues across hardware, system software and AI workloads.  What you’ll be doing: • Design and review test and product requirements across the InfiniBand / Ethernet / NIC / DPU / Switch portfolio, focusing on large‑scale AI cluster behavior. • Build and maintain realistic customer‑like testbeds, including heterogeneous hardware, OS / driver combinations and complex network fabrics. • Own end‑to‑end cluster troubleshooting: reproduce customer scenarios, triage across the stack and drive issues to root cause and fix. • Read and understand relevant source code to identify defects, validate fixes and improve logging and instrumentation. • Collaborate closely with development teams to debug NCCL, RoCE/RDMA and related networking components using logs, code inspection and targeted experiments. • Define tests and guide the automation team to implement robust suites that produce actionable logs, metrics and traces. • Run Regression, Performance, Functional and Scale testing, analyze results and provide clear, data‑driven reports to stakeholders. • Profile and benchmark deep learning training and inference workloads, correlating model‑level metrics with system and network telemetry to uncover bottlenecks.

更新于 2026-04-07上海|北京
logo of antgroup
社招技术-安全技术运

L1 SOC monitoring (24x7 shift basis) ● L1 SOC monitoring of security alerts 24x7 utilising SIEM, EDR tools, and intrusion detection systems (IDS/IPS) ● Analyse logs, network traffic, end point data or other source logs to identify suspicious activity or indicators of compromise (IoCs). ● Triage and prioritize alerts based on severity, impact, and organizational risk, and perform required escalations and mitigations Incident response ● Perform containment and mitigation actions for incidents. Escalate confirmed or high-risk incidents to L2/L3 analysts or incident response teams. ● Collate required information to complete incident documentation and report if necessary. Governance ● To support the Security GRC team during regulatory inspection, external audit, customer queries, security certificate programs, and internal audit projects to ensure compliance with regulations and customer requirements. ● Perform due diligence to assess the information security posture of our third parties ● Support in any on-site assessments of our third party / outsourced parties Vulnerability & threat intelligence: ● Stay updated on emerging threats through threat intelligence

更新于 2025-06-20吉隆坡