小米Site Reliability Engineer-Experienced
任职要求
1. Proficiency in one of the following programming languages: Python, Go, or shell scripting, with demonstrated ability to independently develop modules or platforms. 2. Familiar with cloud computing; experience in managing multi-cloud or hybrid cloud platforms (e.g., Alibaba Cloud, Azure, AWS) is preferred. 3. Strong foundation in computer science, with hands-on experience in Linux, networking, load balancing,…
工作职责
1. Ensure the stability, reliability, and efficient operation of the Xiaomi's global business, maintaining high availability of services at all times. 2. Responsible for core operational tasks such as resource provisioning and management, incident response, capacity management, monitoring, and reliability improvements. 3. Review technical architecture design, assess soundness of the design, and proactively identify and resolve reliability risks. 4. Conduct in-depth analysis of systemic deficiencies, identify bottlenecks and develop optimization strategies; plan and execute projects to improve system reliability and ensure cost-effectiveness and highly availability of the systems. 5. Participate in 24/7 on-call rotation, promptly respond to and resolve production incidents to ensure service availability. 6. Analyze and improve processes to build stable, highly available systems; drive continuous automation improvements, and minimize manual intervention.
Take ownership of internal system SRE practices including CI/CD, observability, and system reliability Manage and ensure the reliability of big data platforms (e.g., Hadoop, Spark, Flink) in cloud environments Design highly available architectures tailored to business needs and define ops standards and incident playbooks Lead technology choices, performance tuning, and stability enhancements for core infrastructure Work Location: China-Shenzhen
Assist in designing, building, and maintaining a scalable and reliable cloud infrastructure Collaborate with developers, operations, and security teams to ensure that the infrastructure is performing optimally and securely Monitoring and alarm systems for our cloud infrastructure, applications, and services Monitor system performance, identify and resolve issues proactively, and troubleshoot incidents when they arise Develop and implement automation tools to streamline processes and improve operational efficiency Participate in the development of disaster recovery and business continuity plans Document infrastructure and processes to ensure knowledge transfer and institutional memory Stay up-to-date with emerging trends and technologies in cloud-native computing and SRE practices
* Perform design and equipment submittal review for new Data Centers in your region. * Troubleshoot, conduct Root Cause Analysis (RCA) and create Corrective Action (CA) documentation for site/equipment failures. * Directly support operational issues with ad-hoc training, complex operating procedure reviews, including essential equipment, and event support. * Provide technical support to the design for existing data center upgrades and design-solutions, which add capacity, improve availability, and increase efficiency. * Supporting operating partners to lead, Review, and approve designs for existing data center upgrades which improve availability/efficiency. * Interface with operating partners, data center design engineering team, server hardware team, environmental health and safety team to promote standards that maintain consistency and reliability in services delivered by operating partners. * Work on concurrent projects, sometimes in multiple geographical regions. * Initiate and lead engineering site audits within leased or colo data centers. Produce reports outlining risks with recommended mitigations and remediation's. * Act as resident engineer during new construction projects. Support construction, commissioning, and turnover. A day in the life Each day you will interact with different teams responsible for all aspects of the data centers. You will prioritize your activities to support data center capacity availability and safety focusing on the actions that are most impactful. You will have the opportunity to work on projects locally and globally.
THE ROLE: This dynamic role drives product Quality & Reliability performance of key customers utilizing AMD Client APU/CPU products on their production lines. The directive of this role is to provide a differentiated quality experience, improve customer satisfaction, and to build customer confidence with a focus on AMD customer satisfaction across the Product Introduction & volume ramp phases. This is a high visibility role that acts as a key interface into the following organizations: AMD Customers, AMD Business unit, AMD Silicon & Package Reliability teams, AMD Engineering leadership, AMD Global Product Engineering & Operations organization, and AMD Sales Teams.