logo of oracle

甲骨文Principal AIML Solution Architect

社招全职PRODEV-SWENG状态:招聘

任职要求


The Strategic Customers, Engineering team (SCE) at OCI is tasked with managing the relationships with some of our most significant AI Infra customers, who are the key drivers of our revenue. In response to rapid market growth across AI Infra business in APAC, we are establishing the APAC Strategic Pursuits team to …
登录查看完整任职要求
微信扫码,1秒登录

工作职责


Responsibilities
Collaborate with GPU sales team and SCE AIML TPM team to provide technical support for customers both at pre-sales and after-sales stage. Take ownership of problems and work to identify solutions.
Design, deploy, and manage infrastructure components such as cloud resources, distributed computing systems, and data storage solutions to support AI/ML workflows.
Collaborate with customers’ scientists and software/infrastructure engineers to understand infrastructure requirements for training, testing, and deploying machine learning models.
Implement automation solutions for provisioning, configuring, and monitoring AI/ML infrastructure to streamline operations and enhance productivity.
Optimize infrastructure performance by tuning parameters, optimizing resource utilization, and implementing caching and data pre-processing techniques.
Troubleshoot infrastructure performance, scalability, and reliability issues and implement solutions to mitigate risks and minimize downtime.
Stay updated on emerging technologies and best practices in AI/ML infrastructure and evaluate their potential impact on our systems and workflows.
Document infrastructure designs, configurations, and procedures to facilitate knowledge sharing and ensure maintainability.
Qualifications: 
Experience in scripting and automation using tools like Ansible, Terraform, and/or Kubernetes. Experience with containerization technologies (e.g., Docker, Kubernetes) and orchestration tools for managing distributed systems.
Solid understanding of networking concepts, security principles, and best practices.
Excellent problem-solving skills, with the ability to troubleshoot complex issues and drive resolution in a fast-paced environment.
Strong communication and collaboration skills, with the ability to work effectively in cross-functional teams and convey technical concepts to non-technical stakeholders.
Strong documentation skills with experience documenting infrastructure designs, configurations, procedures, and troubleshooting steps to facilitate knowledge sharing, ensure maintainability, and enhance team collaboration.
Strong Linux skills with hands-on experience in Oracle Linux/RHEL/CentOS, Ubuntu, and Debian distributions, including system administration, package management, shell scripting, and performance optimization.
包括英文材料
相关职位

logo of amazon
社招Solution

- As an AIML Specialist Solutions Architect (SA) in AI Infrastructure, you will serve as the Subject Matter Expert (SME) for providing optimal solutions in model training and inference workloads that leverage Amazon Web Services accelerator computing services. As part of the Specialist Solutions Architecture team, you will work closely with other Specialist SAs to enable large-scale customer model workloads and drive the adoption of AWS EC2, EKS, ECS, SageMaker and other computing platform for GenAI practice. - You will interact with other SAs in the field, providing guidance on their customer engagements, and you will develop white papers, blogs, reference implementations, and presentations to enable customers and partners to fully leverage AI Infrastructure on Amazon Web Services. You will also create field enablement materials for the broader SA population, to help them understand how to integrate Amazon Web Services GenAI solutions into customer architectures. - You must have deep technical experience working with technologies related to Large Language Model (LLM), Stable Diffusion and many other SOTA model architectures, from model designing, fine-tuning, distributed training to inference acceleration. A strong developing machine learning background is preferred, in addition to experience building application and architecture design. You will be familiar with the ecosystem of Nvidia and related technical options, and will leverage this knowledge to help Amazon Web Services customers in their selection process. - Candidates must have great communication skills and be very technical and hands-on, with the ability to impress Amazon Web Services customers at any level, from ML engineers to executives. Previous experience with Amazon Web Services is desired but not required, provided you have experience building large scale solutions. You will get the opportunity to work directly with senior engineers at customers, partners and Amazon Web Services service teams, influencing their roadmaps and driving innovations.

更新于 2025-07-18上海|北京|深圳
logo of amd
社招 Enginee

THE ROLE: The mission of the Principal Technical Lead is to orchestrate and elevate the quality, consistency, and competitiveness of AMD's GPU software ecosystem on Linux. This leader will bridge strategic objectives with technical execution across the ROCm stack and Linux driver portfolios (both packaged and inbox), ensuring a seamless, powerful, and reliable experience for developers, researchers, and enterprises choosing AMD for their accelerated computing needs.   KEY RESPONSIBILITIES: Strategic Technical Leadership & SOW Definition Act as the central technical nexus between Product Management, Software Architecture, and engineering teams (kernel, ROCm, QA, support). Translate high-level product goals and market requirements into detailed, actionable, and prioritized Technical Statements of Work (SOWs) for RSL AI validation team ensure validation plans are coherent, dependencies are managed, and resources are aligned to deliver on strategic commitments for both Radeon and Ryzen AI solutions. Quality, Test & Process Optimization: Own the definition and evolution of the product quality bar for AMD's Linux GPU software. · Champion and drive the implementation of a robust, scalable, and automated CI/CD and test infrastructure across Native Linux, WSL, and various hardware platforms. Establish key performance indicators (KPIs) for software quality, release velocity, and regression rates. Use data to drive continuous improvement in development and testing efficiency Unified User Experience & Competitive Analysis: Define and monitor a holistic user experience (UX) scorecard encompassing installation, performance predictability, documentation, and debugging. Institute a formal, ongoing competitive analysis framework to benchmark the AMD software stack (ROCm + Drivers) against key competitors across performance, feature parity, stability, and usability. Serve as the ultimate internal advocate for the end-user, ensuring customer and community feedback is systematically integrated into the development lifecycle. Linux Ecosystem & Driver Consistency: Provide technical guidance and oversight to ensure flawless synchronization between the AMD packaged driver and the upstream Linux kernel (inbox) driver. Strengthen AMD's partnership with the Linux kernel community and major distributions (e.g., Canonical, Red Hat, SUSE). Drives a consistent and high-quality user experience regardless of the driver delivery channel (OS vendor vs. AMD.com).

更新于 2025-09-24上海
logo of nvidia
社招

• Define clear vision and roadmap for productivity efficiency improvement solutions in alignment with business needs, and drive execution from design through delivery. • Lead cross-function engineering teams on project deliverables commitment to streamline the system design and verification process and workflow. • Partner with global automation and Infrastructure teams to design, build, and maintain large-scale, cloud-based and on-premises infrastructures. • Stay hands-on technically, providing architectural guidance on complex infrastructure challenges. • Cross-function Collaboration with ASIC, SW, System Design, Product, Security, and Operations teams to ensure reliability, scalability, and performance, fostering a culture of technical excellence, collaboration, and ownership. • Continuously improve processes to ensure efficiency, reliability, and adaptability.

更新于 2025-11-14上海
logo of microsoft
社招Research

• Research, design, and prototype methods to leverage LLMs for product scenarios such as text understanding, summarization, dialogue, translation, content generation, and reasoning. • Fine-tune, adapt, and optimize pre-trained LLMs for domain-specific tasks while balancing model performance, efficiency, and cost. • Develop scalable pipelines for data collection, cleaning, augmentation, and evaluation. • Collaborate with product and engineering teams to translate applied research into production-quality features. • Define and track key performance metrics for LLM-based features, including accuracy, latency, robustness, and user satisfaction. • Stay current with advances in generative AI, multimodal models, and applied ML techniques, and bring forward innovative ideas to improve our products. • Publish technical insights internally (and externally where appropriate) to advance organizational knowledge and thought leadership.

更新于 2025-09-03苏州