Internal outage
Incident Report for EasyPost
Postmortem

Many operations on EasyPost were unavailable on Wednesday, March 25, 2020 between approximately 09:04 and 09:25 Pacific due to a DNS outage related to DNSSEC. DNS (the Domain Name System) is the central record that maps hostnames (such as api.easypost.com) to IP addresses and other computer-readable metadata; DNSSEC is the group of extensions to DNS to provide cryptographically-verified trust for DNS records. This trust helps ensure that the party that owns a domain is also the party responsible for mapping the domain to the addresses where their services are available. Part of DNSSEC is a system called DNSSEC Lookaside Validation (DLV), which was introduced in 2005 and deprecated for new domains in 2017. DLV relies on a server run by the Internet Systems Consortium (ISC), which maintains a cryptographically signed file of the DLV zone.

EasyPost, like many companies, uses the open-source BIND DNS server (from the aforementioned ISC), both to perform recursive resolutions (internal use) and to serve our canonical zones (to the public). BIND supports DLV via the "dnssec-lookaside" configuration parameter. We have multiple levels of internal caching between our applications and our recursive resolvers.

At 09:04 Pacific time, the cryptographic signature on ISC's DLV zone unexpectedly expired, causing all DLV lookups to fail (see this tweet thread from ISC documenting the issue). As caches expired on other records, they began to fail DNSSEC verification, causing an outage of many of our internal systems that make requests to other providers (e.g., carriers). These failures slowly ramped up between 09:04 and 09:15 Pacific until it reached a threshold causing high priority outage notifications to engineers.

We quickly realized that the issue was DNS-related and temporarily disabled DNSSEC in order to resume operations; all services were back to full capacity by 09:25 (albeit without the protection of DNSSEC). Upon consulting with outside experts (much thanks to Peter van Dijk, the maintainer of PowerDNS and @Habbie on Twitter), we identified that the DLV signature was the source of the problem and disabled DLV entirely by setting the "dnssec-lookaside no" flag in our BIND configurations through our configuration-management system at 10:09 Pacific. DLV is not required for any current DNSSEC functionality (see ISC's blog post Decommissioning the DLV for details), so if you use BIND or any other DNS resolver that supports DLV, you should disable DLV.

Any customers who experienced errors during this time should retry their requests. If you have any questions or concerns, please contact our support team at support@easypost.com.

Posted Mar 26, 2020 - 17:08 PDT

Resolved
This incident has been resolved.
Posted Mar 25, 2020 - 11:19 PDT
Monitoring
A misconfiguration in our DNS server was corrected. Some webhook deliveries may be delayed.
Posted Mar 25, 2020 - 10:21 PDT
Identified
Engineers have identified a likely cause of the outage and are implementing a workaround
Posted Mar 25, 2020 - 09:31 PDT
Investigating
EasyPost is currently experiencing an outage for some customers. Engineers are investigating.
Posted Mar 25, 2020 - 09:22 PDT
This incident affected: EasyPost Backend (Webhooks, API) and Website.