SAPSAP China iXp Intern - System Reliability Engineer Intern - Shanghai
任职要求
We are looking for a highly motivated and enthusiastic intern to join our Site Reliability Engineering (SRE) team, specializing in cloud-native technologies like Docker and Kubernetes. As an intern, you will assist the SRE team in maintaining and scaling our cloud infrastructure, ensuring high availability and performance of our services. You will work with various teams, including development, support, and security, to ensure that our cloud-based applications are resilient, scalable, and secure. Currently pursuing a degree in Computer Science, Information Technology, or a related field Knowledge of cloud-native technologies such as Docker, Kubernetes Familiarity with infrastructure as code and automation tools like Terraform Proficiency in one or more programming languages like Python, Go, or Java Understanding of Linux operating sy…
工作职责
Assist in designing, building, and maintaining a scalable and reliable cloud infrastructure Collaborate with developers, operations, and security teams to ensure that the infrastructure is performing optimally and securely Monitoring and alarm systems for our cloud infrastructure, applications, and services Monitor system performance, identify and resolve issues proactively, and troubleshoot incidents when they arise Develop and implement automation tools to streamline processes and improve operational efficiency Participate in the development of disaster recovery and business continuity plans Document infrastructure and processes to ensure knowledge transfer and institutional memory Stay up-to-date with emerging trends and technologies in cloud-native computing and SRE practices
• Help design, develop, and improve scalable infrastructure to support the next generation of AI applications, including copilots and agentic tools. • Drive improvements in architecture, performance, and reliability, enabling teams to bring to bear LLMs and advanced agent frameworks at scale. • Stay informed of the latest advancements in AI infrastructure and contribute to continuous innovation.
• Develop and optimize the control stack, including locomotion, manipulation, and whole-body control algorithms; • Deploy and evaluate neural network models in physics simulation and on real humanoid hardware; • Design and maintain teleoperation software for controlling humanoid robots with low latency and high precision; • Implement tools and processes for regular robot maintenance, diagnostics, and troubleshooting to ensure system reliability; • Monitor teleoperators at the lab and develop quality assurance workflows to ensure high-quality data collection; • Collaborate with researchers on model training, data processing, and MLOps lifecycle.
• Develop and optimize the control stack, including locomotion, manipulation, and whole-body control algorithms; • Deploy and evaluate neural network models in physics simulation and on real humanoid hardware; • Design and maintain teleoperation software for controlling humanoid robots with low latency and high precision; • Implement tools and processes for regular robot maintenance, diagnostics, and troubleshooting to ensure system reliability; • Monitor teleoperators at the lab and develop quality assurance workflows to ensure high-quality data collection; • Collaborate with researchers on model training, data processing, and MLOps lifecycle.