蚂蚁金服Ant International-SRE Engineer-Malaysia
任职要求
1. Bachelor’s degree in Computer Science, a related field, or equivalent practical experience. 2. 7+ years of experience in site reliability engineering. 3. Extensive experience in performing O&M activities which includes security patching, version upgrade, alarm management and handling in public cloud especially Google Cloud or AWS services. 4. Advance proficiency and understanding in the factors and scenarios that generate technology risks in public cloud infrastructure. 5. Have the know how to manage and prevent these risks, and be able to design general technology risk solutions/systems/products, etc. through systematic abstraction. 6. Excellent communication and interpersonal skills with very pro-active attitude in solving difficult problems.
工作职责
About Ant International With headquarters in Singapore and main operations across Asia, Europe, the Middle East and Latin America, Ant International is a leading global digital payment, digitisation and financial technology provider. Through collaboration across the private and public sectors, our unified techfin platform supports financial institutions and merchants of all sizes to achieve inclusive growth through a comprehensive range of cutting-edge digital payment and financial services solutions. We are seeking for Senior and Junior SRE Engineers for our Malaysia Tech Center, work on end-to-end solutions for cross-border payments for our global merchants and globalization business. 1. Collaborate with global teams to complete the daily ops and alarm handling. 2. Identify and implement solutions on stability, scalability and security of business infrastructure using frameworks and industry best practices. 3. Drive and manage technical and solution architecture discussions between global teams and partners to ensure timely delivery that meet customer needs. 4. Plan and execute roadmap for strategic infrastructure improvement incorporating initiatives that align with the company goals.
Site Reliability Engineer (SRE) 结合了软件和系统工程,致力于打造高扩展、高可用的分布式系统。 1、保障大数据&计算多个核心系统的可靠性与正常运行,同时关注系统成本与稳定性; 2、为大型系统构建自动化运营解决方案;与系统开发团队合作,从系统设计到上线的整个生命周期内保障系统可靠性; 3、通过监控系统组件可用性、性能指标提升系统可见性,帮助系统开发以及团队快速定位故障; 4、推动提升服务的可靠性、可扩展性以及成本、性能优化,保障系统 SLA; 5、参与设计、实现能够保障线上大规模集群快速迭代的自动化平台; 6、基于业务使用场景,深入优化提供最佳服务治理实践,包含不局限于关键链路性能瓶颈分析、业务问题定位排障、推进系统高可用架构改造升级等。
1、Site Reliability Engineer (SRE) 结合了软件和系统工程,致力于打造高扩展、高可用的分布式系统; 2、保障大数据/计算/云原生/分布式存储等多个核心系统的可靠性与正常运行,同时关注系统成本与稳定性; 3、为大型系统构建自动化运营解决方案;与系统开发团队合作,从系统设计到上线的整个生命周期内保障系统可靠性; 4、通过监控系统组件可用性、性能指标提升系统可见性,帮助系统开发以及团队快速定位故障; 5、推动提升服务的可靠性、可扩展性以及成本、性能优化,保障系统SLA;参与设计、实现能够保障线上大规模集群快速迭代的自动化平台; 6、基于业务使用场景,深入优化提供最佳服务治理实践,包含不局限于关键链路性能瓶颈分析、业务问题定位排障、推进系统高可用架构改造升级等。
1、Site Reliability Engineer (SRE) 结合了软件和系统工程,致力于打造高扩展、高可用的分布式系统; 2、保障大数据/计算/云原生/分布式存储等多个核心系统的可靠性与正常运行,同时关注系统成本与稳定性;为大型系统构建自动化运营解决方案;与系统开发团队合作,从系统设计到上线的整个生命周期内保障系统可靠性; 3、通过监控系统组件可用性、性能指标提升系统可见性,帮助系统开发以及团队快速定位故障;参与设计、实现能够保障线上大规模集群快速迭代的自动化平台; 4、推动提升服务的可靠性、可扩展性以及成本、性能优化,保障系统SLA; 5、基于业务使用场景,深入优化提供最佳服务治理实践,包含不局限于关键链路性能瓶颈分析、业务问题定位排障、推进系统高可用架构改造升级等。
1、Site Reliability Engineer (SRE) 结合了软件和系统工程,致力于打造高扩展、高可用的分布式系统; 2、保障大数据/计算/云原生/分布式存储等多个核心系统的可靠性与正常运行,同时关注系统成本与稳定性; 3、为大型系统构建自动化运营解决方案;与系统开发团队合作,从系统设计到上线的整个生命周期内保障系统可靠性; 4、通过监控系统组件可用性、性能指标提升系统可见性,帮助系统开发以及团队快速定位故障; 5、推动提升服务的可靠性、可扩展性以及成本、性能优化,保障系统SLA;参与设计、实现能够保障线上大规模集群快速迭代的自动化平台; 6、基于业务使用场景,深入优化提供最佳服务治理实践,包含不局限于关键链路性能瓶颈分析、业务问题定位排障、推进系统高可用架构改造升级等。