Interrupción Mayor de Servicios

Incident Report for FinanzaPro

Resolved

Este incidente ha sido resuelto.
Posted Oct 20, 2025 - 19:59 CST

Update

AWS ya logró restablecer el servicio a sus niveles normales de operación "pre-evento". En nuestra plataforma, todos los eventos pendientes se han terminado de procesar y toda la información entre servicios ha sido actualizada.

Vamos a continuar monitoreando la plataforma para confirmar que todo se encuentra operando de manera normal.

---

[02:48 PM PDT] We have restored EC2 instance launch throttles to pre-event levels and EC2 launch failures have recovered across all Availability Zones in the US-EAST-1 Regions. AWS services which rely on EC2 instance launches such as Redshift are working through their backlog of EC2 instance launches successfully and we anticipate full recovery of the backlog over the next two hours. We can confirm that Connect is handling new voice and chat sessions normally. There is a backlog of analytics and reporting data that we must process and anticipate that we will have worked through the backlog over the next two hours. We will provide an update by 3:30 PM PDT.
Posted Oct 20, 2025 - 16:30 CST

Update

AWS continúa su proceso de mitigar y resolver los problemas a lo largo de su infraestructura.

En nuestra plataforma, vemos que se han procesado bastantes de los eventos que no se habían podido procesar durante esta interrupción de servicio, por lo que pronto vamos tener todos los datos de todos los servicios totalmente actualizados con los cambios de otros servicios.

Fuera de esto, toda la plataforma está operando de manera normal.

—-

[01:52 PM PDT] We have continued to reduce throttles for EC2 instance launches in the US-EAST-1 Region and we continue to make progress toward pre-event levels in all Availability Zones (AZs). AWS services such as ECS and Glue, which rely on EC2 instance launches will recover as the successful instance launch rate improves. We see full recovery for Lambda invocations and are working through the backlog of queued events which we expect to be full processed in approximately in the next two hours. We will provide another update by 2:30 PM PDT.

[01:03 PM PDT] Service recovery across all AWS services continues to improve. We continue to reduce throttles for new EC2 Instance launches in the US-EAST-1 Region that were put in place to help mitigate impact. Lambda invocation errors have fully recovered and function errors continue to improve. We have scaled up the rate of polling SQS queues via Lambda Event Source Mappings to pre-event levels. We will provide another update by 1:45 PM PDT.

[12:15 PM PDT] We continue to observe recovery across all AWS services, and instance launches are succeeding across multiple Availability Zones in the US-EAST-1 Regions. For Lambda, customers may face intermittent function errors for functions making network requests to other services or systems as we work to address residual network connectivity issues. To recover Lambda’s invocation errors, we slowed down the rate of SQS polling via Lambda Event Source Mappings. We are now increasing the rate of SQS polling as we experience more successful invocations and reduced function errors. We will provide another update by 1:00 PM PDT.
Posted Oct 20, 2025 - 15:23 CST

Monitoring

AWS continúa su proceso de mitigar y resolver los problemas a lo largo de su infraestructura. De nuestro lado, casi todos los servicios se encuentran nuevamente en niveles normales de operación, con errores intermitentes que se dan mientras nuestra plataforma se estabiliza.

En este momento, estamos a la espera que se procesen todos los mensajes pendientes en las colas de eventos, por lo que todavía se pueden observar retrasos importantes en la propagación de información entre algunos servicios (por ejemplo, un nuevo producto podría tardar todavía algún tiempo indeterminado en aparecer en el POS v2).

—-

[11:22 AM PDT] Our mitigations to resolve launch failures for new EC2 instances continue to progress and we are seeing increased launches of new EC2 instances and decreasing networking connectivity issues in the US-EAST-1 Region. We are also experiencing significant improvements to Lambda invocation errors, especially when creating new execution environments (including for Lambda@Edge invocations). We will provide an update by 12:00 PM PDT.
Posted Oct 20, 2025 - 13:21 CST

Update

AWS ya empezó el proceso de propagar una solución relevante para nuestra plataforma a varias de sus zonas de disponibilidad. Debido a esto, logramos restablecer el servicio del app clásica y este se encuentra funcionando normalmente.

Los siguientes servicios siguen fuera de operación:

1. La aplicación web legacy (administración de compañías y configuración de facturación electrónica).
2. La generación de reportes y PDF desde la API.

Otros servicios de la plataforma se encuentran funcionando, pero todavía no están operando a su capacidad normal.

—-

[10:38 AM PDT] Our mitigations to resolve launch failures for new EC2 instances are progressing and the internal subsystems of EC2 are now showing early signs of recovering in a few Availability Zones (AZs) in the US-EAST-1 Region. We are applying mitigations to the remaining AZs at which point we expect launch errors and network connectivity issues to subside. We will provide an update by 11:30 AM PDT.
Posted Oct 20, 2025 - 12:15 CST

Update

AWS continúa trabajando en la resolución de este problema. En este momento, se encuentra validando una posible solución a los problemas que están experimentando, que de funcionar, debería permitirnos a nosotros iniciar la infraestructura faltante para estabilizar totalmente nuestros servicios.
—-

[10:03 AM PDT] We continue to apply mitigation steps for network load balancer health and recovering connectivity for most AWS services. Lambda is experiencing function invocation errors because an internal subsystem was impacted by the network load balancer health checks. We are taking steps to recover this internal Lambda system. For EC2 launch instance failures, we are in the process of validating a fix and will deploy to the first AZ as soon as we have confidence we can do so safely. We will provide an update by 10:45 AM PDT.
Posted Oct 20, 2025 - 11:28 CST

Update

AWS continúa trabajando en la resolución de este problema. De nuestro lado, hemos logrado iniciar algunas instancias más para dar más estabilidad a la plataforma. Algunos servicios se encuentran actualmente fuera de operación:

1. La aplicación web legacy (administración de compañías y configuración de facturación electrónica).
2. La generación de reportes y PDF desde la API.

Adicionalmente, los servidores de sesiones (la app clásica) se encuentran limitados en capacidad, ya que no hemos logrado iniciar instancias adicionales de este servicio. Por esta razón, es posible que experimentes dificultades para acceder a esta app. Seguimos intentando añadir más infraestructura a este servicio.

---

[09:13 AM PDT] We have taken additional mitigation steps to aid the recovery of the underlying internal subsystem responsible for monitoring the health of our network load balancers and are now seeing connectivity and API recovery for AWS services. We have also identified and are applying next steps to mitigate throttling of new EC2 instance launches. We will provide an update by 10:00 AM PDT.
Posted Oct 20, 2025 - 10:39 CST

Update

AWS continúa trabajando en la resolución de este problema. La capacidad de iniciar nuevas instancias para agregar capacidad a nuestra plataforma continúa severamente limitada.

Sin embargo, ya hemos podido levantar algunas instancias adicionales de los servidores afectados, y con esto hemos logrado restaurar los servicios y empezar a brindar un servicio más continuo. Todavía no hay suficiente infraestructura para atender la carga normal de la plataforma, por lo que todavía los servicios se sienten lentos o generan errores, pero al menos ya es posible acceder a las aplicaciones y trabajar con ellas.

---

[08:43 AM PDT] We have narrowed down the source of the network connectivity issues that impacted AWS Services. The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers. We are throttling requests for new EC2 instance launches to aid recovery and actively working on mitigations.
Posted Oct 20, 2025 - 10:10 CST

Identified

En este momento, nos encontramos experimentando una interrupción mayor de servicios, causada por una caída mayor de nuestro proveedor de servicios, Amazon Web Services. No es posible acceder del todo a la plataforma.

El problema se inició aproximadamente a la 1:10 a.m. y AWS ha estado trabajando desde ese momento en restaurar los servicios. Sin embargo, luego de varios intentos de mitigar el problema, todavía no les ha sido posible restaurar los servicios.

En nuestro caso, este evento nos empieza a afectar aproximadamente a las 8:15 a.m. cuando se empieza a acumular la carga normal de usuarios, y a nuestra plataforma no le es posible incrementar los recursos necesarios para esa carga, lo cual normalmente se hace a partir de las 6:30 a.m.

En este momento, estamos a la espera de más información de AWS sobre la resolución de este problema.

Por el momento, todas las empresas deben activar sus procesos de contingencia para facturación manual, mientras AWS determina la causa del problema y brinda una solución definitiva al problema.

A continuación, el detalle del problema brindado por AWS hasta el momento. Les mantendremos informados por medio de este canal, conforme tengamos más información del problema.

---

[08:04 AM PDT] We continue to investigate the root cause for the network connectivity issues that are impacting AWS services such as DynamoDB, SQS, and Amazon Connect in the US-EAST-1 Region. We have identified that the issue originated from within the EC2 internal network. We continue to investigate and identify mitigations.

[07:29 AM PDT] We have confirmed multiple AWS services experienced network connectivity issues in the US-EAST-1 Region. We are seeing early signs of recovery for the connectivity issues and are continuing to investigate the root cause.

[07:14 AM PDT] We can confirm significant API errors and connectivity issues across multiple services in the US-EAST-1 Region. We are investigating and will provide further update in 30 minutes or soon if we have additional information.

[06:42 AM PDT] We have applied multiple mitigations across multiple Availability Zones (AZs) in US-EAST-1 and are still experiencing elevated errors for new EC2 instance launches. We are rate limiting new instance launches to aid recovery. We will provide an update at 7:30 AM PDT or sooner if we have additional information.

[05:48 AM PDT] We are making progress on resolving the issue with new EC2 instance launches in the US-EAST-1 Region and are now able to successfully launch new instances in some Availability Zones. We are applying similar mitigations to the remaining impacted Availability Zones to restore new instance launches. As we continue to make progress, customers will see an increasing number of successful new EC2 launches. We continue to recommend that customers launch new EC2 Instance launches that are not targeted to a specific Availability Zone (AZ) so that EC2 has flexibility in selecting the appropriate AZ.

We also wanted to share that we are continuing to successfully process the backlog of events for both EventBridge and Cloudtrail. New events published to these services are being delivered normally and are not experiencing elevated delivery latencies.

We will provide an update by 6:30 AM PDT or sooner if we have additional information to share.

[05:10 AM PDT] We confirm that we have now recovered processing of SQS queues via Lambda Event Source Mappings. We are now working through processing the backlog of SQS messages in Lambda queues.

[04:48 AM PDT] We continue to work to fully restore new EC2 launches in US-EAST-1. We recommend EC2 Instance launches that are not targeted to a specific Availability Zone (AZ) so that EC2 has flexibility in selecting the appropriate AZ. The impairment in new EC2 launches also affects services such as RDS, ECS, and Glue. We also recommend that Auto Scaling Groups are configured to use multiple AZs so that Auto Scaling can manage EC2 instance launches automatically.

We are pursuing further mitigation steps to recover Lambda’s polling delays for Event Source Mappings for SQS. AWS features that depend on Lambda’s SQS polling capabilities such as Organization policy updates are also experiencing elevated processing times. We will provide an update by 5:30 AM PDT.

[04:08 AM PDT] We are continuing to work towards full recovery for EC2 launch errors, which may manifest as an Insufficient Capacity Error. Additionally, we continue to work toward mitigation for elevated polling delays for Lambda, specifically for Lambda Event Source Mappings for SQS. We will provide an update by 5:00 AM PDT.

[03:35 AM PDT] The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution. Additionally, some services are continuing to work through a backlog of events such as Cloudtrail and Lambda. While most operations are recovered, requests to launch new EC2 instances (or services that launch EC2 instances such as ECS) in the US-EAST-1 Region are still experiencing increased error rates. We continue to work toward full resolution. If you are still experiencing an issue resolving the DynamoDB service endpoints in US-EAST-1, we recommend flushing your DNS caches. We will provide an update by 4:15 AM, or sooner if we have additional information to share.

[03:03 AM PDT] We continue to observe recovery across most of the affected AWS Services. We can confirm global services and features that rely on US-EAST-1 have also recovered. We continue to work towards full resolution and will provide updates as we have more information to share.

[02:27 AM PDT] We are seeing significant signs of recovery. Most requests should now be succeeding. We continue to work through a backlog of queued requests. We will continue to provide additional information.

[02:22 AM PDT] We have applied initial mitigations and we are observing early signs of recovery for some impacted AWS Services. During this time, requests may continue to fail as we work toward full resolution. We recommend customers retry failed requests. While requests begin succeeding, there may be additional latency and some services will have a backlog of work to work through, which may take additional time to fully process. We will continue to provide updates as we have more information to share, or by 3:15 AM.

[02:01 AM PDT] We have identified a potential root cause for error rates for the DynamoDB APIs in the US-EAST-1 Region. Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1. We are working on multiple parallel paths to accelerate recovery. This issue also affects other AWS Services in the US-EAST-1 Region. Global services or features that rely on US-EAST-1 endpoints such as IAM updates and DynamoDB Global tables may also be experiencing issues. During this time, customers may be unable to create or update Support Cases. We recommend customers continue to retry any failed requests. We will continue to provide updates as we have more information to share, or by 2:45 AM.

[01:26 AM PDT] We can confirm significant error rates for requests made to the DynamoDB endpoint in the US-EAST-1 Region. This issue also affects other AWS Services in the US-EAST-1 Region as well. During this time, customers may be unable to create or update Support Cases. Engineers were immediately engaged and are actively working on both mitigating the issue, and fully understanding the root cause. We will continue to provide updates as we have more information to share, or by 2:00 AM.

[12:51 AM PDT] We can confirm increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. This issue may also be affecting Case Creation through the AWS Support Center or the Support API. We are actively engaged and working to both mitigate the issue and understand root cause. We will provide an update in 45 minutes, or sooner if we have additional information to share.

[12:11 AM PDT] We are investigating increased error rates and latencies for multiple AWS services in the US-EAST-1 Region. We will provide another update in the next 30-45 minutes.
Posted Oct 20, 2025 - 09:07 CST
This incident affected: Servicios de Sesiones (Servicios de Sesiones Zona 1, Servicios de Sesiones Zona 2, Servicios de Sesiones Zona 3), Aplicación Web, Servicios de Aplicaciones, Servicios de Reportes, and Infraestructura (Balanceadores de Carga, Infraestructura de Servidores, Servicios de Funciones, Servicios de Bases de Datos, Servicios de Bases de Datos NoSQL, Servicios de Notificaciones, Servicios de Mensajería Interna (entre servicios), Servicios de Almacenamiento de Archivos, Servicios de Caché, CDN).