Parabola is a no-code data workflow automation tool for operations teams. The company uses LLMs as a core part of its product and increasingly to run infrastructure. The role is for a Site Reliability Engineer focused on observability and keeping systems healthy and understandable. You will own monitoring and alerting infrastructure, drive incident response, and partner with engineering to ensure deep visibility. Responsibilities include building and maintaining observability stack (Prometheus, Grafana, dashboards, alerting, on-call workflows), incident response and postmortems, defining SLIs/SLOs/error budgets, monitoring LLM-specific infrastructure (latency, token throughput, model error rates, cost attribution), and AWS infrastructure across Lambda, ECS, RDS, OpenSearch, CloudFront, etc. You will also work with CDK-based IaC and CI/CD pipelines as needed. Requirements include hands-on Prometheus and Grafana experience (or similar), strong alerting/observability instincts, ability to debug distributed systems end-to-end, experience owning on-call and incident response, AWS familiarity and IaC experience (CDK or Terraform). Nice to have includes instrumenting LLM pipelines, familiarity with TypeScript/Node.js, startup experience, or background in security/compliance. Contact: cj@parabola.io.

Parabola

Roles

Tech stack

Location

Work setup

Contact

Description

Similar jobs