9/10/2024, 12:00:00 AM ~ 9/11/2024, 12:00:00 AM (UTC)

Recent Announcements

Amazon EKS support in Amazon SageMaker HyperPod to scale foundation model development

We are excited to announce the general availability of Amazon EKS support in SageMaker HyperPod which enables customers to run and manage their Kubernetes workloads on SageMaker HyperPod, a purpose-built infrastructure for foundation model (FM) development which reduces time to train models by up to 40%.\n Many customers use Kubernetes to orchestrate their ML workflows due to its portability, scalability, and rich ecosystem of tools. These customers want to continue using Kubernetes’ familiar interface, but still want an automated way to manage hardware failures. EKS support in HyperPod combines the benefits of SageMaker HyperPod offering self-healing performant clusters with the containerization capabilities of Amazon EKS, a managed Kubernetes service. With this launch, customers can run deep health checks during cluster creation to reduce failures during training. Further, HyperPod automatically replaces faulty nodes and resumes training from your last checkpoint on both AWS Trainium and Nvidia GPU at a scale of more than a thousand accelerators. Customers have the flexibility to use either the new HyperPod CLI, or their preferred tools, to submit, manage, and monitor workloads. The persistent cluster environment offers ssm access and the ability to customize the cluster. EKS orchestrated HyperPod clusters also integrate with CloudWatch Container Insights to provide out-of-the-box observability, by auto-discovering HyperPod node health status and visualizing them in curated dashboards. This release is generally available in the AWS Regions where SageMaker HyperPod is available except Europe (London). To learn more, see the following list of resources: Webpage, AWS News Blog, Documentation, Github repository.

Amazon EMR on EC2 improves cluster launch experience with intelligent subnet selection

Starting today, Amazon EMR on EC2 offers improved reliability and cluster launch experience for instance fleet clusters through enhanced subnet selection. With this feature, EMR on EC2 reduces cluster launch failures caused due to IP address shortages.\n Amazon EMR is a cloud big data platform for data processing, interactive analysis, and machine learning using open-source frameworks such as Apache Spark, Apache Hive, and Presto. Previously, the subnet selection for EMR clusters only considered the available IP addresses for the core instance fleet. Amazon EMR now employs subnet filtering at cluster launch and selects one of the subnets that have adequate available IP addresses to successfully launch all instance fleets. If EMR cannot find a subnet with sufficient IP addresses to launch the whole cluster, it will prioritize the subnet that can at least launch the core and primary instance fleets. In this scenario, EMR will also publish a CloudWatch warning event to notify the user. If none of the configured subnets can be used to provision core and primary fleet, EMR will fail the cluster launch and provide a critical error event. These CloudWatch events enables you to monitor your clusters and take remedial actions as necessary. Customers will benefit from this feature on all EMR 5.12.1 and later releases when launching EMR instance fleet clusters using allocation strategies. No further action is needed from your end. This capability is available in all AWS Regions, including the AWS GovCloud (US) Regions, where Amazon EMR on EC2 is available. To learn more, please refer to the documentation here.

Container Insights now announces SageMaker HyperPod node health observability on EKS

Amazon CloudWatch Container Insights now auto-discovers the health status of your SageMaker HyperPod nodes running on EKS and visualizes them in curated dashboards to help you monitor your node availability for operational excellence. Using out-of-the-box dashboards, you can identify unhealthy nodes easily and mitigate quickly to achieve efficient training durations.\n Container Insights works with SageMaker to collect deep health check test results for HyperPod nodes and displays them in preset dashboards to help you understand the health and performance of your nodes, and identify if they are ready for scheduling. Container Insights assists you in optimizing training durations by classifying failing nodes as “pending reboot” and “pending replacement,” and guiding you on maintaining node health in case automatic node replacement is disabled. If auto-recovery is enabled, you can gain visibility into your node mutations, delays in your training jobs, and understand how your tasks resume from the last check-point. Getting started with Container Insights is easy. You can onboard either by installing CloudWatch Observability EKS Add-on or the latest CloudWatch agent into your clusters, or upgrading your Helm charts with the latest CloudWatch Agent version. Once configured you can navigate to Container Insights console and view your SageMaker Hyperpod node health status out-of-the-box. SageMaker HyperPod node health observability is now available in Container Insights for EKS in all commercial regions where SageMaker HyperPod is present. HyperPod node health metrics follow observation based pricing – see Container Insights pricing page for details. For further information, see the Container Insights user guide.

Amazon MSK enhances cross-cluster replication with support for identical topic names

Amazon MSK Replicator now supports a new configuration that enables you to preserve original Kafka topic names while replicating streaming data across Amazon Managed Streaming for Apache Kafka (Amazon MSK) clusters. Amazon MSK Replicator is a feature of Amazon MSK that lets you reliably replicate data across MSK clusters in the same or different AWS region(s) with just a few clicks. The new configuration reduces the need for you to reconfigure client applications during setup and makes it even more simple to operate multi-cluster streaming architectures, while continuing to benefit from MSK Replicator’s reliability.\n With Amazon MSK Replicator, you can easily build regionally resilient streaming applications for business continuity, share data with partners, aggregate data from multiple clusters for analytics, and serve clients globally with lower latency. With the new configuration, you can retain topic names during replication while automatically avoiding the risk of infinite replication loops that comes with using third-party or open-source tools for replication. If you setup active-passive cluster architecture to build regionally resilient streaming applications, where one cluster handles live traffic while another acts as a standby, the new configuration also streamlines the failover process. Applications can seamlessly failover to the standby cluster without requiring reconfiguration, as topic names remain intact. Support for the new configuration is available in all regions where Amazon MSK Replicator is available. To see all the regions where Amazon MSK Replicator is available, see the AWS Region table. To learn more, visit our developer guide or product page.

Amazon OpenSearch Service now supports OpenSearch version 2.15

You can now run OpenSearch version 2.15 in Amazon OpenSearch Service. With OpenSearch 2.15, we have made several improvements in the areas of search performance, query optimization, and added capabilities to help you to build AI-powered applications with greater flexibility and ease.\n This launch includes radial search that allows you to search points in a vector space that reside within a specified maximum distance or minimum score threshold from a query point, offering greater flexibility for various applications like anomaly detection and geospatial searches. In addition, this release includes performance optimizations like two-phase processor for neural sparse search, and conditional scoring logic and optimized data handling for hybrid search. These performance improvements now allow you to run complex queries on larger datasets more efficiently. OpenSearch now supports reindex workflow, allowing users to enable vector and hybrid search on existing indexes to reduce time and resources spent on re-indexing from source indexes. In addition, you can configure remote models to serve as guardrails to detect harmful, offensive, or inappropriate content (toxicity) more accurately. Finally, a new ML inference processor enables users to enrich ingest pipelines using inferences from OpenSearch-provided pretrained models. For information on upgrading to OpenSearch 2.15, please see the documentation. OpenSearch 2.15 is now available in all AWS Regions where Amazon OpenSearch Service is available.

AWS Blogs

AWS Japan Blog (Japanese)

AWS News Blog

AWS Cloud Operations Blog

AWS Big Data Blog

AWS Database Blog

AWS HPC Blog

AWS for Industries

AWS for M&E Blog

AWS Storage Blog

Open Source Project

AWS CLI

AWS CDK

Amplify for Android

Karpenter