12/3/2025, 12:00:00 AM ~ 12/4/2025, 12:00:00 AM (UTC)
Recent Announcements
Amazon SageMaker HyperPod now supports checkpointless training
Amazon SageMaker HyperPod now supports checkpointless training, a new foundational model training capability that mitigates the need for a checkpoint-based job-level restart for fault recovery. Checkpointless training maintains forward training momentum despite failures, reducing recovery time from hours to minutes. This represents a fundamental shift from traditional checkpoint-based recovery, where failures require pausing the entire training cluster, diagnosing issues manually, and restoring from saved checkpoints, a process that can leave expensive AI accelerators idle for hours, costing your organization wasted compute.\n Checkpointless training transforms this paradigm by preserving the model training state across the distributed cluster, automatically swapping out faulty training nodes on the fly and using peer-to-peer state transfer from healthy accelerators for failure recovery. By mitigating checkpoint dependencies during recovery, checkpointless training can help your organization save on idle AI accelerator costs and accelerate time. Even at larger scales, checkpointless training on Amazon SageMaker HyperPod enables upwards of 95% training goodput on cluster sizes with thousands of AI accelerators.
Checkpointless training on SageMaker HyperPod is available in all AWS Regions where Amazon SageMaker HyperPod is currently available. You can enable checkpointless training with zero code changes using HyperPod recipes for popular publicly available models such as Llama and GPT OSS. For custom model architectures, you can integrate checkpointless training components with minimal modifications for PyTorch-based workflows, making it accessible to your teams regardless of their distributed training expertise.
To get started, visit the Amazon SageMaker HyperPod product page and see the checkpointless training GitHub page for implementation guidance.
New serverless model customization capability in Amazon SageMaker AI
Amazon Web Services (AWS) announces a new serverless model customization capability that empowers AI developers to quickly customize popular models with supervised fine-tuning and the latest techniques like reinforcement learning. Amazon SageMaker AI is a fully managed service that brings together a broad set of tools to enable high-performance, low-cost AI model development for any use case. \n Many AI developers seek to customize models with proprietary data for improved accuracy, but this often requires lengthy iteration cycles. For example, AI developers must define a use case and prepare data, select a model and customization technique, train the model, then evaluate the model for deployment. Now AI developers can simplify the end-to-end model customization workflow, from data preparation to evaluation and deployment, and accelerate the process. With an easy-to-use interface, AI developers can quickly get started and customize popular models, including Amazon Nova, Llama, Qwen, DeepSeek, and GPT-OSS, with their own data. They can use supervised fine-tuning and the latest customization techniques such as reinforcement learning and direct preference optimization. In addition, AI developers can use the AI agent-guided workflow (in preview), and use natural language to generate synthetic data, analyze data quality, and handle model training and evaluation—all entirely serverless.
You can use this easy-to-use interface in the following AWS Regions: Europe (Ireland), US East (N. Virginia), Asia Pacific (Tokyo), and US West (Oregon). To join the waitlist to access the AI agent-guided workflow, visit the sign-up page.
To learn more, visit the SageMaker AI model customization page and blog.
Announcing TypeScript support in Strands Agents (preview) and more
In May, we open sourced the Strands Agents SDK, an open source python framework that takes a model-driven approach to building and running AI agents in just a few lines of code. Today, we’re announcing that TypeScript support is available in preview. Now, developers can choose between Python and TypeScript for building Strands Agents.\n TypeScript support in Strands has been designed to provide an idiomatic TypeScript experience with full type safety, async/await support, and modern JavaScript/TypeScript patterns. Strands can be easily run in client applications, in browsers, and server-side applications in runtimes like AWS Lambda and Bedrock AgentCore. Developers can also build their entire stack in Typescript using the AWS CDK. We’re also announcing three additional updates for the Strands SDK. First, edge device support for Strands Agents is generally available, extending the SDK with bidirectional streaming and additional local model providers like llama.cpp that let you run agents on small-scale devices using local models. Second, Strands steering is now available as an experimental feature, giving developers a modular prompting mechanism that provides feedback to the agent at the right moment in its lifecycle, steering agents toward a desired outcome without rigid workflows. Finally, Strands evaluations is available in preview. Evaluations gives developers the ability to systematically validate agent behavior, measure improvements, and deploy with confidence during development cycles. Head to the Strands Agents GitHub to get started building.
Introducing elastic training on Amazon SageMaker HyperPod
Amazon SageMaker HyperPod now supports elastic training, enabling organizations to accelerate foundation model training by automatically scaling training workloads based on resource availability and workload priorities. This represents a fundamental shift from training with a fixed set of resources, as it saves hours of engineering time spent reconfiguring training jobs based on compute availability.\n Any change in compute availability previously required manually halting training, reconfiguring training parameters, and restarting jobs—a process that requires distributed training expertise and leaves expensive AI accelerators sitting idle during training job reconfiguration. Elastic training automatically expands training jobs to absorb idle AI accelerators and seamlessly contracting when higher-priority workloads need resources—all without halting training entirely.
By eliminating manual reconfiguration overhead and ensuring continuous utilization of available compute, elastic training can help save time previously spent on infrastructure management, reduce costs by maximizing cluster utilization, and accelerate time-to-market. Training can start immediately with minimal resources and grow opportunistically as capacity becomes available.
SageMaker HyperPod is available in all regions where Amazon SageMaker HyperPod is currently available. Organizations can enable elastic training with zero code changes using HyperPod recipes for publicly available models including Llama and GPT OSS. For custom model architectures, customers can integrate elastic training capabilities through lightweight configuration updates and minimal code modifications, making it accessible to teams without requiring distributed systems expertise.
To get started, visit the Amazon SageMaker HyperPod product page and see the elastic training documentation for implementation guidance.
Amazon Bedrock now supports reinforcement fine-tuning, helping you improve model accuracy without needing deep machine learning expertise or large sums of labeled data. Amazon Bedrock automates the reinforcement fine-tuning workflow, making this advanced model customization technique accessible to everyday developers. Models learn to align with your specific requirements using a small set of prompts rather than the large sums of data needed for traditional fine-tuning methods, enabling teams to get started quickly. This capability teaches models through feedback on multiple possible responses to the same prompt, improving their judgement of what makes a good response. Reinforcement fine-tuning in Amazon Bedrock delivers 66% accuracy gains on average over base models so you can use smaller, faster, and more cost-effective model variants while maintaining high quality.\n Organizations struggle to adapt AI models to their unique business needs, forcing them to choose between generic models with average performance or expensive, complex customization that requires specialized talent, infrastructure, and risky data movement. Reinforcement fine-tuning in Amazon Bedrock removes this complexity by making advanced model customization fast, automated, and secure. You can train models by uploading training data directly from your computer or choose from datasets already stored in Amazon S3, eliminating the need for any labeled datasets. You can define reward functions using verifiable rule-based graders or AI-based judges along with built-in templates to optimize your models for both objective tasks such as code generation or math reasoning, and subjective tasks such as instruction following or chatbot interactions. Your proprietary data never leaves AWS’s secure, governed environment during the entire customization process, mitigating security and compliance concerns.
You can get started with reinforcement fine-tuning in Amazon Bedrock through the Amazon Bedrock console and via the Amazon Bedrock APIs. At launch, you can use reinforcement fine-tuning with Amazon Nova 2 Lite with support for additional models coming soon. To learn more about reinforcement fine-tuning in Amazon Bedrock, read the launch blog, pricing page, and documentation.
YouTube
AWS Black Belt Online Seminar (Japanese)
AWS Blogs
AWS Japan Blog (Japanese)
- Trust is mutual: Amazon CloudFront supports mTLS
- Introducing AWS Transform Custom: Eliminating Technical Debt with AI-Powered Code Modernization
- Multi-key support for Amazon DynamoDB global secondary indexes
- Improve data modeling accuracy with the Amazon DynamoDB Data Model Validation Tool
- How Octus reduced infrastructure costs by 85% with a zero-downtime migration to Amazon OpenSearch Service
- Amazon OpenSearch Service Improves Vector Database Performance and Costs with GPU Acceleration and Automated Optimization
- Introducing Cluster Insights: Amazon OpenSearch Service Integrated Monitoring Dashboard for Clusters
- Introducing Amazon OpenSearch lenses for the AWS Well-Architected Framework
- Amazon Kinesis Data Streams Supports 10x Larger Record Sizes: Simplifying Real-time Data Processing
- Amazon MSK Express broker supports intelligent rebalancing, and operation performance is 180 times faster
AWS News Blog
- Amazon Bedrock adds reinforcement fine-tuning simplifying how developers build smarter, more accurate AI models
- New serverless customization in Amazon SageMaker AI accelerates model fine-tuning
- Introducing checkpointless and elastic training on Amazon SageMaker HyperPod