Proven experience in building and scaling observability platforms in a cloud-native environment. Observability Expertise: Deep understanding of observability pillars (metrics, logs, traces) and related tools (e.g., Prometheus, Grafana, OpenTelemetry, Jaeger, Kibana Elastic Stack). AI/ML Proficiency: Hands-on experience integrating ML/AI models into observability systems to drive advanced insights, anomaly detection, and predictive analysis. Distributed More ❯
Datadog, Sumologic, NewRelic, AppDynamics, Dynatrace, Prometheus, Logz.io, SignalFX, Instana, Splunk, Honeycomb, Jaeger Hands-on experience with Infrastructure as a Code (Terraform/Ansible) Hands-on experience in technical integrations (OpenTelemetry/fluentd/fluentbit/filebeat/logstash) Hands-on experience with complex troubleshooting of Kubernetes and Docker container Good knowledge of RegEx, Lucene, PromQL Good knowledge of Linux Experience More ❯
systems administration combined with strong SQL skills and proficiency in scripting languages such as Python or Java.* Demonstrated experience with monitoring and observability tools including Prometheus, Grafana, Splunk, Geneos, OpenTelemetry or Corvil is highly desirable.* Familiarity with cloud platforms as well as containerisation technologies like Kubernetes or Docker alongside CI/CD pipeline management is important for this role.* Comprehensive More ❯
Cloud Platform Engineer (DV Security Clearance) Position Description Secure Innovation is part of CGI's Space, Defence, and Intelligence business unit, focused primarily on the delivery of contemporary and innovative technical solutions for the government agencies most challenging problems. Our More ❯
Cambridge, Cambridgeshire, United Kingdom Hybrid / WFH Options
Arm Limited
Modernise our infrastructure by leading the migration from Docker Swarm to Kubernetes Design and operate CI/CD pipelines using CloudBees and GitLab Build out observability with Prometheus, Grafana, OpenTelemetry, and Dynatrace Automate cloud deployments (AWS-first) using Terraform and platform tooling Improve security posture across IAM, secrets, and networking Help the team ship faster and safer by mentoring on … distributed systems at scale in production. Cloud AWS (primary), Kubernetes (future), Docker (current), Terraform. Excellent debugging skills across network, systems, and data stack. Observability tooling, e.g. custom metrics pipelines, OpenTelemetry tracing, or integrations across telemetry stacks. Security engineering and practical understanding of IAM hardening, zero-trust network principles, and secrets management in data-heavy systems. Passion for building reliable, secure More ❯
experience. What will help you succeed Preferred Requirements: Experience with query languages such as SQL, SPL, or KQL. Experience with observability and log collectors/pipelines such as FluentBit, OpenTelemetry, Cribl, and Logstash. Experience with web technologies such as HTML, CSS, and JavaScript. Experience with programming/scripting side technologies such as Java, .NET, PHP, Go, Node.js and database. Advanced More ❯
and programming languages such as C++, Java or Python. Strong understanding of distributed systems and low-latency architectures Hands-on experience with observability stacks (e.g., Prometheus, Grafana, Splunk, Geneos, OpenTelemetry) and infrastructure automation (e.g., Ansible, Terraform, CI/CD pipelines) Strong understanding of the trade lifecycle, market data, and fixed income products, FX or algorithmic trading experience is a plus More ❯
and programming languages such as C++, Java or Python. Strong understanding of distributed systems and low-latency architectures Hands-on experience with observability stacks (e.g., Prometheus, Grafana, Splunk, Geneos, OpenTelemetry) and infrastructure automation (e.g., Ansible, Terraform, CI/CD pipelines) Strong understanding of the trade lifecycle, market data, and fixed income products, FX or algorithmic trading experience is a plus More ❯
artifact promotion, and release gating into the SDLC. Ensure pipeline scalability and governance while maintaining developer velocity. Observability & Troubleshooting Lead the implementation and usage of modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana, Splunk, Datadog). Establish SLOs, SLIs, and error budgets with product and engineering teams. Drive root cause identification using distributed tracing, advanced log analysis, and anomaly detection. Security More ❯
and programming languages such as C++, Java or Python. Strong understanding of distributed systems and low-latency architectures Hands-on experience with observability stacks (e.g., Prometheus, Grafana, Splunk, Geneos, OpenTelemetry) and infrastructure automation (e.g., Ansible, Terraform, CI/CD pipelines) Strong understanding of the trade lifecycle, market data, and fixed income products, FX or algorithmic trading experience is a plus More ❯
under test: Containerisation (e.g. Docker), Virtualisation and Provisioning, Workload and job scheduling (e.g. Kubernetes, Ray) on high core-count machines and rack-scale installations, Management and Observability (e.g. Prometheus, OpenTelemetry, DataDog, Splunk, etc.). 10+ years of relevant experience related to quality assurance/testing teams. Experience with the Atlassian suite and CI/CD platforms such as Jenkins; GitHub More ❯
under test: Containerisation (e.g. Docker), Virtualisation and Provisioning, Workload and job scheduling (e.g. Kubernetes, Ray) on high core-count machines and rack-scale installations, Management and Observability (e.g. Prometheus, OpenTelemetry, DataDog, Splunk, etc.). 10+ years of relevant experience related to quality assurance/testing teams. Experience with the Atlassian suite and CI/CD platforms such as Jenkins; GitHub More ❯
optimization in cloud environments with Kafka for data streaming. Familiarity with CI/CD and infrastructure-as-code tools (e.g., Terraform, CloudFormation). Experience with observability tools (e.g., CloudWatch, OpenTelemetry). Experience working in a global enterprise software company. Our commitment to you! BMC's culture is built around its people. We have 6000+ brilliant minds working together across the More ❯
between Google's Load Balancer and the HTTP server in our main Elixir application causing HTTP 5XX responses to be returned to our customers. - Debugging an issue in our OpenTelemetry pipelines causing us to silently drop spans. - An enthusiasm for both software development and systems engineering. - A high bar for code and configuration quality and readability. - A good understanding of … to managing our Kubernetes configuration, using ArgoCD and Helm. - We manage a high-availability metrics collection system using Grafana, Thanos & Prometheus. We're in the process of transitioning to OpenTelemetry and Honeycomb for our application telemetry (traces and metrics). - We manage a data pipeline using Pub/Sub, Airbyte, and dbt. Our Current Focus We're currently driving a … how we think about and monitor reliability across the engineering organisation, with a focus on early detection of customer-impacting issues. We're extending and standardising our use of OpenTelemetry, and introducing Honeycomb as the single place for engineers to understand how our applications are operating in production. This project involves both technical work, on the application libraries and infrastructure More ❯
At Anaplan, we are a team of innovators who are focused on optimizing business decision-making through our leading scenario planning and analysis platform so our customers can outpace their competition and the market. What unites Anaplanners across teams and More ❯
At Anaplan, we are a team of innovators who are focused on optimizing business decision-making through our leading scenario planning and analysis platform so our customers can outpace their competition and the market. What unites Anaplanners across teams and More ❯
patterns, and packaging. Familiarity with building performant and reliable Python systems, including low-level C/C++ extensions (e.g., using pybind11, Cython) and instrumentation for production telemetry (e.g., Prometheus, OpenTelemetry). A proactive ownership mindset and the ability to navigate ambiguity. Excellent collaboration and communication skills for working effectively with teams and stakeholders. Ideally Professional experience GPGPU programming (e.g., CUDA More ❯
platform, writing new monitoring queries to drive our alerting, or coordinating across multiple teams to manage the response to an incident. Our technology stack: AWS (including ECS and RDS), OpenTelemetry, NewRelic, Python, Postgres, Liquibase, Angular, Docker Who you are: Four or more years professional experience in a customer-facing technical support or engineering role Excellent verbal and written communication skills More ❯