Senior Staff Engineer - Observability Engineering

Senior Staff Backend Engineer (AI Infra - Observability Engineering)Coupang is reimagining the shopping experience with the goal of wowing each customer from the instant they open the Coupang app to the moment an order is delivered to their door. Powered by an outstanding end-to-end e-commerce and logistics network and a fanatical culture of customer centricity, Coupang has broken tradeoffs around speed, selection and price. Today, we provide exceedingly fast shipping speeds on millions of items including fresh groceries, delivered within hours nationwide, 365 days a year.

We are doing this for millions of consumers in Korea. Korea is home to one of the largest and fastest growing e-commerce opportunities anywhere in the world. Summary:As a Senior Staff Backend Engineer (AI Infra - Observability Engineering) in the Cloud Platform team, you will focus on architecting and providing observability and maintaining monitoring solutions for Hybrid Cloud Platform Infrastructure at 10X scale for ML workloads.

This involves collecting and utilizing metrics and logs related to the health and reliability of Coupang ML services, as well as developing tools to help detect, alert, and troubleshoot anomalies effectively. The Observability Engineering teams vision is to enable Coupangs engineering teams to maintain and improve the quality of their services by providing clear visibility into their systems health, intelligence, and actionable insights. You will have the opportunity to build the next-generation Observability Platform based on Kubernetes and other OSS solutions, as well as develop software components from scratch.

You will work directly with various engineering teams at Coupang, influence them with OE, SRE principles and best practices, and see your impact directly. Key Responsibilities:

  • Design, implement, and maintain observability solutions such as metrics monitoring, alerting, logging, and tracing across various ML platforms, applications, and infrastructure using Grafana, Mimir, Loki, ClickHouse, Victoria Metrics, Prometheus, Thanos.
  • Collaborate with cross-functional teams, including software engineers, SREs, and infrastructure teams, to identify and define observability requirements.
  • Develop and implement best practices for creating and maintaining effective monitoring, alerting, and telemetry systems.
  • Evaluate and recommend industry-leading observability tools and technologies to improve system visibility and reliability.
  • Define and track key performance indicators (KPIs) and service-level objectives (SLOs) related to system availability, performance, and reliability.
  • Support the troubleshooting and resolution of complex incidents by analyzing data from observability tools and participate in the teams on-call rotation.
  • Handle large-scale telemetry data, integrating inputs from GPU clusters running on Cloud and on-premises.
  • Provide guidance and mentorship to other engineers on observability principles, practices, and tools.
  • Conduct ongoing evaluations of observability systems and identify opportunities for improvements and optimizations.
  • Drive the standardization and simplification of observability processes, tools, and frameworks across the organization.
  • Contribute to the development of training materials, documentation, and runbooks for observability systems and practices.

Essential Qualifications:

  • Bachelors degree in computer science, Engineering, or a related technical field.
  • 7+ years of strong experience in implementing and managing observability solutions in large-scale, complex environments.
  • 7+ years of deep knowledge of monitoring, alerting, and logging systems and tools, such as Prometheus, Grafana, Elastic Stack, Datadog, or

New Relic.

  • Familiarity with distributed tracing technologies, such as Jaeger or Zipkin.
  • Experience with cloud-based infrastructure, including AWS, Azure, or Google Cloud Platform.
  • Strong experience with Kubernetes, Docker, Linux, and cloud services integrations.
  • Strong understanding of DevOps and SRE practices, including continuous integration, continuous delivery, and infrastructure as code (IaC).
  • Skilled in PromQL and other querying languages, with a keen interest in observability data models.
  • Proficiency in scripting languages, such as

Go, Python, Bash, or Ruby.

  • Excellent communication and collaboration skills, with the ability to work with teams across different functions and technical domains.
  • Strong problem-solving and analytical skills, with a focus on data-driven decision-making.
  • A proven track record of leading and delivering successful observability projects and initiatives.
  • Excellent cross-group collaboration, outstanding verbal and written communication.
  • Possessing expert knowledge in performance (sub-millisecond latencies), scalability, availability (99.99% uptime), enterprise architecture best practices.

Preferred Qualifications:

  • Experience with containerization and orchestration technologies, such as Docker and Kubernetes.
  • Familiarity with application performance management (APM) tools, such as Dynatrace or AppDynamics.
  • Professional certifications in cloud platforms, monitoring tools, or related technologies.
  • Experience in AI Infra Engineering, building observability platforms, and operating at scale.
  • NVIDIA and AI certifications.
  • 7+ years of hands-on experience in managing Observability platform.
  • Excellent cross-group collaboration and communication skills.

Information :

  • Company : Coupang
  • Position : Senior Staff Engineer - Observability Engineering
  • Location : Bangalore, Karnātaka
  • Country : IN

Attention - In the recruitment process, legitimate companies never withdraw fees from candidates. If there are companies that attract interview fees, tests, ticket reservations, etc. it is better to avoid it because there are indications of fraud. If you see something suspicious please contact us: support@jobkos.com

Post Date : 2025-06-18 | Expired Date : 2025-07-18