Strategize and implement advanced testing and monitoring frameworks to ensure optimal performance of GPU solutions. Engage with observability engineers to ensure systems are scalable, reliable, and aligned with customer needs. Develop and integrate advanced storage solutions, including Object, NFS, and block storage technologies, ensuring their seamless performance and scalability within GPU infrastructure projects. Implement comprehensive data lifecycle management practices to enhance the perfo...
Do you ever wonder what happens inside the cloud?
DigitalOcean (NYSE: DOCN) simplifies cloud computing so builders can spend more time creating software that changes the world. With our mission-critical infrastructure and fully managed offerings, DigitalOcean enables startups and small and medium-sized businesses (SMBs) to rapidly deploy and scale modern applications. As a remote-first organization, our employees, like our customers, are based around the world.
We are seeking a Principal Engineer to lead the quality, performance, and monitoring initiatives for our GPU infrastructure and assist in the Agile transformation of the organization.
This pivotal role focuses on developing advanced testing frameworks, robust quality assurance, and comprehensive monitoring within an Agile/Kanban environment. The ideal candidate will drive rapid project delivery and continuous system improvements, ensuring our GPU solutions meet global standards for speed and reliability. Furthermore, this role will crucially guide the organization's shift towards Agile practices, significantly enhancing speed, agility, and innovation across all operations.
Key Responsibilities:
- Lead performance and quality testing initiatives for GPU infrastructure, ensuring high standards are met.
- Develop and optimize monitoring systems to enhance real-time insights and operational efficiency.
- Drive projects with a focus on speed and agility using Agile/Kanban/Lean methodologies.
- Collaborate with cross-functional teams to integrate performance, quality, and monitoring practices throughout the product lifecycle.
- Coach and mentor teams across the company in adopting Agile/Kanban methodologies at scale, shaping model teams, and leading company-wide transformation in development practices.
What You’ll Be Doing:
- Strategize and implement advanced testing and monitoring frameworks to ensure optimal performance of GPU solutions.
- Engage with observability engineers to ensure systems are scalable, reliable, and aligned with customer needs.
- Develop and integrate advanced storage solutions, including Object, NFS, and block storage technologies, ensuring their seamless performance and scalability within GPU infrastructure projects.
- Implement comprehensive data lifecycle management practices to enhance the performance and efficiency of storage solutions within the GPU infrastructure.
- Optimize GPU infrastructure to support AI/ML workloads for both internal and external customers, ensuring robust and scalable solutions tailored to complex data processing needs.
What We’ll Expect From You:
- Proven expertise in leading, developing, and scaling complex solutions from inception to global scale, emphasizing speed, quality, and customer satisfaction.
- Excellent leadership and collaborative skills, with a capacity to drive innovation in an Agile/Kanban environment.
- Demonstrated ability to develop and optimize complex storage and networking solutions.
- Strong strategic analytical skills to effectively utilize data, enhancing system stability and performance, and improving decision-making and problem-solving capabilities.
- Commitment to continuous learning and improvement, particularly in technologies that support AI/ML workloads.
- Always prioritize customer needs and feedback to drive product excellence.
- Ability to quickly adapt to new technologies and challenges.
- Excellent communication and collaboration skills to work effectively across teams.
- Understanding of AI/ML workloads and overall industry trends.
- Strong collaborator and consensus builder. Author and review design documentation.
- Experience troubleshooting, analyzing, and debugging.
- Experience as a software engineer/developer in a large-scale, distributed environment.
- Experience writing secure, testable, and robust low-level code.
- A critical thinker dedicated to solving problems and delivering solutions.
- Industry leader in performance engineering, quality assurance, and systems monitoring.
- Proven track record of transforming an organization into a fast-flowing Agile organization.
Why You’ll Like Working for DigitalOcean:
- We reward our employees. The base salary range for this position is between $225,000.00 - $308,000.00 on relevant years of experience and skills. The salary range for this role is specific to candidates located within the U.S. and will vary for candidates outside the U.S.. Employees may qualify for a bonus in addition to base salary; bonus amounts are determined based on company and individual performance. We also provide equity compensation to eligible employees including grants of equity upon hire and the option to participate in our Employee Stock Purchase Program.
- We value development. You will work with some of the smartest and most interesting people in the industry. We are a high-performance organization that is always challenging our teams and employees to continuously grow. We maintain a growth mindset in everything we do and invest deeply in employee development through formalized mentorship and other internal programs. We provide all employees with reimbursement for relevant conferences, training, and education.
- We care about your well-being. In addition to cash and equity compensation, we also offer employees a competitive array of benefits. In the United States, these include health insurance, flexible vacation, retirement benefits, a generous parental leave program, and additional resources to support employees' overall well-being. While the philosophy around our benefits is the same worldwide, specific benefits may vary in other countries due to local regulations and preferences.
- We value diversity and inclusivity. We are an equal opportunity employer and we do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
*This is a remote role
#LI-Remote