Kofi Gyasi
← All posts

2025-05-01

The Ephemeral IP Problem: When Your DNS Doesn't Know Your Service Moved

A real-world incident where a service restart silently broke public routing — and how to build AWS-hosted services so this can never happen again.

Incident post-mortem · AWS · monitoring & synthetics (Datadog used as the concrete example) · ~10 min read · Infrastructure & reliability

In this post

  1. What happened — the incident
  2. Why it happened — root cause
  3. Setting up services the right way on AWS
  4. Detection with Datadog
  5. Resolution playbook
  6. Prevention checklist

1. What happened

A public-facing service was deployed on AWS, and its assigned public IP or DNS hostname was manually configured in Cloudflare as the routing target — telling Cloudflare “send traffic for this domain to that address.” No stable internal endpoint was linked to the service; the mapping pointed directly to whatever address the service had at that particular moment.

Then the service was restarted. On AWS, a restart of an EC2 instance, ECS task, or any container-based workload that isn't behind a stable load balancer will often yield a new IP address or a different hostname. The service came back online healthy, but Cloudflare was still routing to the old address. Users began seeing connection failures.

The core failure
The Cloudflare DNS record became stale the moment the service restarted. There was no indirection layer between Cloudflare and the live service — just a static entry that was correct once, and wrong ever after.

2. Why it happened — root cause

This is a classic infrastructure anti-pattern: coupling a routing layer to a transient address. On AWS, this is especially easy to fall into because instance public IPs, ECS task IPs, and container addresses are all considered ephemeral by design. The failure has three compounding layers.

Failure chain

  1. No stable AWS service endpoint
    The service had no Application Load Balancer (ALB) or Network Load Balancer (NLB) in front of it. Its public identity was its ephemeral IP — which AWS reassigns on restart unless an Elastic IP is explicitly allocated.

  2. Cloudflare pointed at that ephemeral address directly
    The DNS record in Cloudflare used the raw EC2 public IP or ECS task hostname instead of an ALB DNS name. This bypassed AWS service discovery entirely.

  3. No monitoring or synthetics covering the Cloudflare origin
    There was no synthetic test or origin health check (from Datadog or any equivalent tool) that would alert when the upstream behind the Cloudflare record became unreachable. The failure was discovered through user complaints, not monitoring.

  4. Service restarted and got a new address
    The service came back healthy from AWS's perspective, but Cloudflare kept routing to the old, now-unresponsive address. Users received 502 or 521 errors at the edge.

3. Setting up services the right way on AWS

The fix is architectural — not operational. The goal is a setup where a service restart can never silently break public routing. On AWS, this means placing a stable managed endpoint between Cloudflare and your running workload.

Use an Application Load Balancer (ALB) as the Cloudflare target

Every public-facing service should sit behind an ALB. AWS assigns the ALB a permanent DNS name (e.g. my-service-alb-123456.us-east-1.elb.amazonaws.com) that remains stable regardless of how many times the underlying EC2 instances, ECS tasks, or Lambda functions behind it restart or scale. Cloudflare should always target this ALB DNS name — never the IP of a specific instance or task.

# WRONG — pointing Cloudflare at an ephemeral EC2 public IP
CNAME  api.yourdomain.com  →  54.23.101.12   # reassigned on restart

# RIGHT — pointing at the ALB DNS name (stable forever)
CNAME  api.yourdomain.com  →  my-service-alb-123456.us-east-1.elb.amazonaws.com

On ECS: use a Service with an ALB Target Group

When deploying containers via Amazon ECS (EC2 or Fargate launch type), define an ECS Service — not a standalone task — and attach it to an ALB Target Group. ECS will automatically register and deregister task IPs in the target group as tasks start and stop. The ALB handles all the routing; Cloudflare never needs to know a task's address.

AWS-specific note
On ECS Fargate, every new task revision gets a new private IP in your VPC. Without a load balancer, there is literally no persistent address to point Cloudflare at. The ALB isn't just best practice here — it's the only correct architecture for public Fargate services.

On EC2: use an ALB or allocate an Elastic IP

For EC2-hosted services, two options are viable. The preferred approach is the same: sit the instance behind an ALB. If for some reason a direct IP is required (uncommon), allocate an Elastic IP and associate it with the instance. Elastic IPs persist across instance stops and restarts, making them safe to put in Cloudflare. A regular public IP does not.

Manage DNS in code using Terraform

Manual Cloudflare DNS updates create silent dependencies that live outside version control. Define both the AWS infrastructure and the Cloudflare DNS record in Terraform so they stay in sync. When the ALB is recreated, the DNS record is updated in the same plan.

# ALB definition (simplified)
resource "aws_lb" "api" {
  name               = "api-service-alb"
  internal           = false
  load_balancer_type = "application"
  subnets            = var.public_subnet_ids
}

# Cloudflare CNAME automatically picks up the ALB's DNS name
resource "cloudflare_record" "api" {
  zone_id = var.cloudflare_zone_id
  name    = "api"
  type    = "CNAME"
  value   = aws_lb.api.dns_name
  proxied = true
}
AvoidUse this
Direct to instance/task — EC2 public IP, ECS task IP, or raw hostname in Cloudflare. One restart breaks everything silently. No tolerance for scaling events.ALB DNS name in Cloudflare — The ALB DNS name stays stable through restarts, redeployments, and autoscaling. Cloudflare always routes to healthy targets.

4. Detection with Datadog

Tooling
The examples below use Datadog (Synthetics, monitors, integrations) because that is what this post was written against. The ideas are vendor-agnostic: multi-region synthetic checks on the public domain, a second check that hits the ALB directly, alerting on ALB 5xx from your cloud metrics pipeline, and a post-deploy smoke test in CI/CD. You can implement the same playbook with Grafana Cloud, New Relic, Dynatrace, Honeycomb (plus external synthetics), AWS CloudWatch Synthetics and alarms, Azure Monitor, Google Cloud Monitoring, or any stack that gives you synthetics plus metrics and routing to on-call.

Even with a well-architected setup, misconfigurations happen — and the goal is to detect them in minutes, not from a user support ticket. A full-featured observability platform makes that realistic; Datadog is one option that covers this failure class end-to-end.

Synthetic API tests on the public domain (Datadog)

Create a Datadog Synthetic API test that makes an HTTP GET request to your public domain every minute. Configure it to assert on a 200 status code and optionally a response body substring. If the test starts failing — whether from a stale Cloudflare record, a downed origin, or an SSL issue — Datadog alerts your on-call channel immediately.

Recommended test setup in Datadog
Navigate to Synthetics → New Test → API Test. Set the URL to your public domain (e.g. https://api.yourdomain.com/health), choose multiple test locations (at least 3 regions to avoid false positives from a single PoP), set check frequency to 1 minute, and alert after 2 consecutive failures. Wire the alert to your PagerDuty or Slack integration.

Separate origin health test (Datadog)

Create a second Synthetic test that targets your ALB DNS name directly — bypassing Cloudflare. If the origin test passes but the domain test fails, the fault is in Cloudflare routing, not the service. This two-tier approach cuts mean time to diagnosis dramatically, because the team immediately knows the blast radius.

# Test 1 — public domain (end-to-end, tests Cloudflare + origin)
URL       https://api.yourdomain.com/health
Assert    status_code == 200
Frequency 1 min  |  Locations us-east-1, eu-west-1, ap-southeast-1
Alert     after 2 failures → PagerDuty + #incidents Slack

# Test 2 — ALB origin direct (bypasses Cloudflare)
URL       http://my-service-alb-123456.us-east-1.elb.amazonaws.com/health
Assert    status_code == 200
Frequency 1 min  |  Locations us-east-1
Alert     after 2 failures → #incidents Slack

Datadog monitor on ALB 5xx metrics (AWS + Datadog)

Enable the Datadog AWS integration to pull CloudWatch metrics from your ALB. Create a Datadog monitor on the aws.applicationelb.httpcode_elb_5xx_count metric. A spike in 5xx responses from the ALB — especially 502 or 504 — is an early signal that the ALB can't reach its registered targets, which may indicate a mismatch between the ALB's target group and the actual running service.

# Alert when ALB 5xx count exceeds threshold
sum(last_5m):aws.applicationelb.httpcode_elb_5xx_count{load_balancer:api-service-alb}.as_count() > 10

# Also monitor Cloudflare edge errors via Cloudflare → Datadog log pipeline
sum(last_5m):cloudflare.requests.edge_response_status{status:5xx}.as_count() > 20

Add a post-deploy validation step to your CI/CD pipeline

After every ECS deployment or EC2 AMI rollout, add a pipeline step that curls the public domain and the ALB origin, asserting healthy responses before marking the deployment complete. If either check fails, the pipeline fails the deploy and rolls back — and your observability or CI system can record the event for correlation later (Datadog Deployment Tracking is one example).

5. Resolution playbook

When this incident fires — the public domain returning 5xx errors, origin confirmed healthy by direct test — follow this sequence. Time-to-resolution with this playbook should be under 10 minutes.

Resolution steps — target: under 10 minutes

  1. Confirm service is running in AWS — Check the ECS Service in the AWS Console (or run aws ecs describe-services). Confirm tasks are in RUNNING state and the health check endpoint responds locally.

  2. Check synthetic test results — Open your synthetics product (e.g. Datadog → Synthetics). If the origin test (ALB direct) is passing but the domain test is failing, the issue is definitively in Cloudflare routing — not the service.

  3. Identify the current AWS service address — If no ALB is in place, find the new IP/hostname the service was assigned (EC2 public IP in the console, or ECS task ENI address). This is what Cloudflare needs to point to now.

  4. Update the Cloudflare DNS record — In Cloudflare DNS, update the A or CNAME record for the affected domain to the current service address. If using Terraform, update the value in code and apply — never patch manually long-term.

  5. Purge Cloudflare cache — Under Caching → Configuration in the Cloudflare dashboard, purge everything (or purge by URL) to force the edge to re-resolve the updated origin immediately.

  6. Verify synthetics and monitors — Watch the failing synthetic run — it should recover to passing within 1–2 check intervals (1–2 minutes). Confirm the related alert or monitor clears before closing the incident.

  7. File a remediation ticket — The DNS update was a hotfix. File a ticket to place the service behind an ALB before the next restart. Document the incident and link to this playbook.

6. Prevention checklist

Use this as a pre-launch review for any service that will be publicly routed through Cloudflare on AWS. Where the items say Datadog, read that as your monitoring stack of choice — anything that can run multi-region synthetics, chart ALB metrics, and page on-call the same way.

  • The Cloudflare DNS record targets an AWS ALB DNS name or Elastic IP — never a raw EC2 public IP or ECS task address
  • The ALB has health checks configured on the target group, so it stops routing to unhealthy tasks automatically
  • The AWS infrastructure and Cloudflare DNS record are both defined in Terraform and live in the same repository
  • A Datadog Synthetic API test probes the public domain every minute from at least 3 regions, with PagerDuty/Slack alerting
  • A second Datadog Synthetic test probes the ALB origin directly, so routing vs. service failures are immediately distinguishable
  • A Datadog monitor is set on aws.applicationelb.httpcode_elb_5xx_count and alerts on spikes
  • The CI/CD deploy pipeline runs an end-to-end health check post-deployment and fails the release if it doesn't pass
  • A short playbook for this failure mode is easy to find for on-call (README, wiki, Notion, PagerDuty runbook, etc.)

Key takeaway
Routing stability is a design property, not an operational habit. On AWS, the ALB is your stable identity layer — not the instance or task behind it. Build the ALB first, configure Cloudflare to target it, and use synthetic checks from your monitoring tool (Datadog or otherwise) to confirm it end-to-end. A service restart should be a non-event from a routing perspective.


This article is based on patterns seen in production systems; it is written for a public audience (including LinkedIn and other channels). The aim is practical: help teams avoid coupling public DNS to ephemeral infrastructure. If it resonates or you would like to compare notes, I am happy to continue the conversation in the comments or via my contact links on this site.