AWS AI, the practical way
An architecture-first reference for the Amazon AI stack as of June 2026. From Amazon Bedrock and the Nova model family, to Bedrock AgentCore for production agents, to SageMaker for custom models. Trade-offs, pricing shape, and risks. No marketing.
AWS's AI story in 2026 has three layers. Amazon Bedrock is the managed gateway to dozens of foundation models (Anthropic Claude, Meta Llama, Mistral, DeepSeek, NVIDIA Nemotron, and Amazon's own Nova) behind one API, with Knowledge Bases, Guardrails, customization, and evaluations. Bedrock AgentCore turned the 2025 agent preview into a managed runtime for production agents. SageMaker - now the unified center for data + analytics + AI - is where you train, fine-tune, and host custom models. If you already run on AWS, the data-gravity, IAM integration, and custom-silicon economics make this stack the path of least resistance.
The AWS AI mental model
What sets AWS apart in 2026
| Differentiator | What it means in practice |
|---|---|
| Widest managed model catalog | Bedrock fronts Anthropic Claude, Meta Llama, Mistral, Cohere, AI21, DeepSeek, NVIDIA Nemotron, Stability, and Amazon Nova behind one API and one bill. Switching models is a parameter change, not a re-architecture. |
| Anthropic relationship + Trainium | Deep Anthropic partnership (Project Rainier Trainium clusters) means frontier Claude models are first-class on Bedrock, often with strong price/perf on AWS silicon. |
| AgentCore as managed runtime | Memory, Gateway (tools/MCP), Identity, Observability, Browser, Code Interpreter, Web Search, and Payments (preview) - framework-agnostic (Strands, LangChain, OpenAI Agents SDK, Claude Agent SDK). |
| Data gravity + IAM | If your data is already in S3/Redshift/Aurora, RAG ground truth and access control are native. No new identity plane. |
| Custom silicon economics | Trainium/Inferentia give a cost lever for training and high-volume inference that pure-GPU clouds cannot match on price. |
Where AWS is weaker (be honest)
How to read this portal
Each flagship service tab has sub-tabs (Overview / Architecture / Capabilities / Pricing / Risks / When to use) with a reference-architecture diagram. If you only read one sub-tab, read Risks. The others tell you what something does; Risks tells you what bites you in production.
What's New - late 2025 through June 2026
Material changes that affect architecture, cost, or risk. Curated, not a press-release dump.
The dominant 2026 theme is agents going to production: Bedrock AgentCore added managed Knowledge Bases, a managed agent harness, native Web Search, and (preview) autonomous Payments. Model breadth widened (NVIDIA Nemotron 3 Super on Bedrock, Nova Forge for Nova customization, Reinforcement Fine-Tuning). And SageMaker was repositioned as the unified data+AI center, with SageMaker Unified Studio now GA and Amazon Q Developer embedded throughout.
| Date | Release | Why it matters |
|---|---|---|
| Dec 2025 | Next-gen SageMaker + Unified Studio (re:Invent) | SageMaker repositioned as the single center for data, analytics, and AI - Glue, EMR, Athena, Redshift, Bedrock, and SageMaker AI in one workspace with a lakehouse. |
| Dec 2025 | Trainium3 announced | Next-gen training/inference silicon; continues AWS's price/perf lever vs pure-GPU stacks. Confirm region/instance availability before designing around it. |
| Jan 2026 | SageMaker Unified Studio GA + Amazon Q Developer GA in Studio | Data professionals get GenAI assistance across the lifecycle; Bedrock and SageMaker AI usable from one IDE. |
| Feb 2026 | Reinforcement Fine-Tuning in Bedrock | Tailor models to narrow tasks with reward signals - higher accuracy on domain workflows without full training. |
| Mar 2026 | NVIDIA Nemotron 3 Super on Bedrock; Nova Forge SDK | Open-weight frontier reasoning model managed; Nova Forge lets enterprises customize Nova on their data and deploy inside Bedrock. |
| Apr 2026 | AgentCore Payments (preview) | Agents can autonomously pay for APIs, MCP servers, web content, and other agents - built with Coinbase and Stripe. New control-plane and audit considerations. |
| May 2026 | Agent Toolkit for AWS; AgentCore managed harness | Declare and run an agent in ~3 API calls, no orchestration code. Lowers time-to-first-agent dramatically. |
| Jun 2026 | AWS Summit NY: Managed Knowledge Bases (Smart Parsing, Agentic Retriever), Web Search on AgentCore (GA), Amazon Quick, S3 Annotations, EC2 G7 (RTX PRO Blackwell) | Fully-managed RAG with multi-format parsing; grounded answers with zero data egress; mutable per-object context in S3; new inference GPU tier. |
Service Map
The AWS AI services worth knowing, grouped by what you do with them.
Managed multi-model API: catalog, Knowledge Bases, Guardrails, Flows, Evaluations, customization, AgentCore.
Amazon's own FM family: Micro, Lite, Pro, Premier, plus Canvas (image), Reel (video), Sonic (speech). Forge to customize.
Runtime, Memory, Gateway, Identity, Observability, Browser, Code Interpreter, Web Search, Payments (preview).
Train, fine-tune, host custom models; HyperPod for FM training; one studio over data+analytics+AI.
Q Developer (coding/ops agent), Q Business (enterprise RAG assistant), Q in QuickSight/Connect, Amazon Quick.
Rekognition, Textract, Comprehend, Transcribe, Polly, Translate, Lex, Kendra, Personalize.
S3 Vectors, OpenSearch vector, Aurora/RDS pgvector, MemoryDB, Kendra GenAI Index, Bedrock Data Automation.
Trainium2/3, Inferentia2, EC2 P5/P6 (Blackwell), G7, UltraClusters, Capacity Blocks.
Content filters, denied topics, PII redaction, contextual grounding, Automated Reasoning checks.
Amazon Bedrock
The managed, serverless gateway to foundation models. One API, one IAM model, one bill, many vendors.
Bedrock exposes many foundation models through a unified API (Converse / InvokeModel). You never manage servers; you pay per token on-demand, reserve capacity with Provisioned Throughput, or run Batch. Around the models sit Knowledge Bases (managed RAG), Guardrails (independent safety), Flows (orchestration), Evaluations, and customization. It is the default starting point for almost any GenAI workload on AWS.
What problem this solves
Most teams don't want to run GPU fleets, manage model weights, stand up a guardrail service, wire a vector store, and negotiate vendor contracts separately. Bedrock's offer is one IAM-governed, serverless surface where you swap Claude / Llama / Nova with a parameter, apply the same Guardrail policy across models, and keep data inside your AWS account. The trade-off is that exact model and feature availability varies by region - confirm before you design.
Two consumption modes, plus batch
| Mode | How you pay | Best for |
|---|---|---|
| On-demand | Per input/output token, no commitment. | Prototyping, variable/low volume, model comparison. |
| Provisioned Throughput | Reserved model units (hourly + term commitments). | Steady high volume needing predictable latency and cost. |
| Batch | Discounted vs on-demand, asynchronous. | Large non-interactive jobs: enrichment, classification, embedding generation. |
Reference architecture
Network and identity
Bedrock is reachable over PrivateLink (VPC endpoints) so model traffic never traverses the public internet. Authorization is IAM: scope policies to specific models, Knowledge Bases, Guardrails, and agents; callers use roles (instance/task/Lambda execution roles). Encrypt with KMS and prefer customer-managed keys for regulated data.
Where the data goes
AWS's stated position is that your prompts and completions are not used to train the base foundation models and stay within your account and region. You control model-invocation logging to CloudWatch/S3. For data residency, pin the region and confirm the model is available there; cross-region inference can move data across regions, so weigh it against compliance needs.
Capability matrix (June 2026)
| Capability | Status | Notes |
|---|---|---|
| Model catalog + Marketplace | ● | First-party + partner FMs; 100+ via Marketplace; import custom weights. |
| Knowledge Bases (RAG) | ● | Managed RAG; 2026 adds Smart Parsing + Agentic Retriever. |
| Guardrails | ● | Content filters, denied topics, PII, contextual grounding, Automated Reasoning checks. |
| Flows | ● | Visual orchestration of prompts, models, KBs, Lambda. |
| Evaluations | ● | Automatic + LLM-as-judge model and RAG evaluation. |
| Customization | ● | Fine-tuning, continued pre-training, distillation, Reinforcement Fine-Tuning. |
| Prompt caching / cross-region | ● | Cut cost/latency on repeated context; auto-route to capacity in other regions. |
| Batch inference | ● | Discounted asynchronous processing at scale. |
| AgentCore | ● | Managed agent runtime - its own tab. |
How Bedrock bills
| Lever | How it bills | Control |
|---|---|---|
| On-demand | Per input/output token, per model. | Right-size model per task; cache prompts; cap output tokens. |
| Provisioned Throughput | Reserved model units (hourly + term). | Commit only after you know the steady load. |
| Batch | Discounted vs on-demand. | Use for non-interactive enrichment. |
| Knowledge Bases | Storage + query + embedding tokens (+ vector store). | Tune chunk size; prune stale docs; pick S3 Vectors for cost. |
- Use Bedrock for almost any GenAI workload on AWS - model optionality, managed RAG/guardrails/evals, no infrastructure.
- Drop to SageMaker only when you need custom training, exotic hosting, or a model not in the catalog.
- Add AgentCore when the workload is an agent heading to production.
- Lead with a small model + prompt caching for cost; reserve Provisioned Throughput once volume is steady.
Foundation Model Catalog
Indicative view of model families on Bedrock in 2026. Exact versions and regions change frequently - confirm in the console.
| Provider | Families | Typical use |
|---|---|---|
| Anthropic | Claude (Opus / Sonnet / Haiku tiers) | Top-tier reasoning, agents, coding, long context. The frontier default on Bedrock. |
| Amazon | Nova Micro / Lite / Pro / Premier; Canvas, Reel, Sonic | Cost/latency-optimized text and multimodal; image, video, speech generation. |
| Meta | Llama (open weights) | Open-weight workloads, customization, on-prem parity. |
| Mistral | Mistral / Mixtral | Efficient European open-weight options. |
| DeepSeek | DeepSeek-R1 and successors | Strong open reasoning at low cost. |
| NVIDIA | Nemotron 3 (Super) | Open-weight frontier reasoning/agentic, hosted managed. |
| Cohere / AI21 / Stability | Command / Embed / Rerank, Jamba, Stable Diffusion / Image | Embeddings, reranking, long-context, image generation. |
Amazon Nova
Amazon's own foundation-model family - optimized for price, latency, and AWS integration.
| Model | Modality | Best for |
|---|---|---|
| Nova Micro | Text | Cheapest, fastest text - classification, routing, simple extraction at scale. |
| Nova Lite | Multimodal (text+image/video in) | Low-cost multimodal understanding, high-volume workloads. |
| Nova Pro | Multimodal | Balanced capability/cost for most enterprise tasks and agents. |
| Nova Premier | Multimodal, most capable | Complex reasoning; also the teacher model for distillation. |
| Nova Canvas | Image generation | Studio-quality images with content credentials/watermarking. |
| Nova Reel | Video generation | Short-form video from text/image prompts. |
| Nova Sonic | Speech-to-speech | Real-time voice interactions with low latency. |
Amazon Bedrock AgentCore GA
The managed runtime for production agents. Framework-agnostic - bring Strands, LangChain, OpenAI Agents SDK, or the Claude Agent SDK.
AgentCore makes the hard parts of running agents - memory, tool auth, networking, identity, tracing, and safety - managed concerns, and standardizes the integration surface (MCP, OpenAPI). The 2026 managed harness lets you declare and run an agent in ~3 API calls with no orchestration code. It is framework-agnostic, so you keep your agent logic and let AWS run the plumbing.
What problem this solves
Hand-built agent loops prototype fast and operate badly: state, retries, tool credentials, private networking, observability, and guardrails all become your code. AgentCore turns those into managed modules you opt into, and adds capabilities most teams can't easily build - a managed Browser, sandboxed Code Interpreter, zero-egress Web Search, and (preview) autonomous Payments.
Module map
| Module | What it gives you | Status |
|---|---|---|
| Runtime / Harness | Managed serverless execution; declare and run an agent in ~3 API calls, no orchestration code. | GA |
| Memory | Short-term and long-term memory so agents retain context across turns and sessions. | GA |
| Gateway | Turn APIs, Lambda, and MCP servers into governed agent tools with auth and access control. | GA |
| Identity | Scoped, least-privilege access for agents; policies verified by Automated Reasoning (same tech as IAM/S3). | GA |
| Observability | Traces of every step and tool call, and where the agent went off track; evaluation vs real traffic. | GA |
| Browser & Code Interpreter | Headless browsing and sandboxed code execution as managed tools. | GA |
| Web Search | Grounded, cited answers from the live web with zero data egress from your AWS environment. | GA |
| Payments | Agents autonomously pay for APIs, MCP servers, content, and other agents (Coinbase/Stripe). | Preview |
- Use AgentCore for any agent heading to production - managed memory, tools, identity, and observability beat hand-rolled glue.
- Adopt modules incrementally - you don't need all of them; start with Runtime + Memory + Gateway + Observability.
- Gate Payments behind budgets and human approval; pilot before trusting autonomous spend.
- Pair with Bedrock Guardrails so safety holds regardless of the model the agent uses.
AWS vs OCI vs Azure vs GCP
A practitioner's quick read. Every cloud does the basics; the differences are in defaults, data gravity, and silicon.
| Dimension | AWS | OCI | Azure | GCP |
|---|---|---|---|---|
| Model breadth (managed) | Widest (Bedrock) | Broad (OCI Gen AI) | Foundry Models (1000+) | Model Garden (200+) |
| Frontier own model | Nova (mid); Claude hosted | None (partners) | OpenAI GPT-5.x | Gemini 3.x |
| Agents | AgentCore | Enterprise AI Agents | Foundry Agent Service | Agent Platform + A2A |
| Custom silicon | Trainium/Inferentia | GPU (NVIDIA) | Maia (emerging) | TPU (Ironwood/8th) |
| Data gravity | S3 / Redshift | Oracle DB 26ai (in-DB vectors) | Fabric / OneLake | BigQuery |
| Best when | Already on AWS; want model choice + silicon economics | Run Oracle DB/EBS; want in-DB vectors + sovereignty | Microsoft shop; want OpenAI + M365 | BigQuery/Workspace central; want Gemini + TPU |
Sources
Primary AWS material used for this portal (June 2026). Verify specifics against current docs before committing - this space moves weekly.
- Top announcements - AWS Summit New York 2026
- Amazon Bedrock AgentCore and AgentCore release notes
- AgentCore Payments (preview)
- Nemotron 3 Super on Bedrock; Nova Forge SDK (Mar 2026)
- SageMaker Unified Studio - GA
- Amazon Bedrock · Bedrock or SageMaker decision guide
Knowledge Bases & RAG
Managed retrieval-augmented generation - the most common enterprise GenAI pattern.
| Use Knowledge Bases when | Build your own when |
|---|---|
| You want managed RAG with minimal code, the corpus is mostly documents, and Smart Parsing / built-in retrieval quality matter. | You need fine control over chunking, hybrid search, custom rerankers, or a vector store you already operate. |
Guardrails
An independent safety layer you apply to any model - first-party or imported.
| Control | What it catches |
|---|---|
| Content filters | Hate, insults, sexual, violence, misconduct, prompt attacks - tunable thresholds. |
| Denied topics | Block subjects out of scope for your application. |
| Sensitive info / PII | Detect and redact or block PII and custom regex patterns. |
| Contextual grounding | Score answers for grounding against source and relevance to the query - reduce hallucination. |
| Automated Reasoning checks | Mathematically verify outputs against encoded policies/rules - high-assurance domains. |
SageMaker AI
Where you train, fine-tune, and host models when Bedrock's managed path isn't enough.
SageMaker AI is the full-control path: managed training jobs, real-time / serverless / async / batch endpoints, JumpStart for one-click model deploy/fine-tune, HyperPod for large-scale FM training, and MLOps (Pipelines, Model Registry, Clarify, Model Monitor). Reach for it when Bedrock's managed surface can't express what you need - custom training, exotic hosting, or a model outside the catalog.
Bedrock vs SageMaker
Bedrock = consume and customize managed models, fast, serverless. SageMaker = own the training, hosting, and MLOps. Many teams use both: Bedrock for the app, SageMaker for the custom model behind it. Start in Bedrock; drop here only when you must.
Reference architecture
| Component | Use |
|---|---|
| JumpStart | One-click deploy/fine-tune of open and partner foundation models. |
| Training & Inference | Managed training jobs; real-time / serverless / async / batch endpoints with autoscaling. |
| HyperPod | Resilient, self-healing clusters for FM pre-training and heavy fine-tuning across thousands of accelerators. |
| Pipelines / Model Registry | Reproducible MLOps pipelines, lineage, approval gates, deployment. |
| Clarify / Model Monitor | Bias/explainability and production drift detection. |
- Use SageMaker for custom training, specialized hosting, large-scale FM training (HyperPod), or models outside the Bedrock catalog.
- Stay in Bedrock for consume/customize-managed; come here only for full control.
- Use Unified Studio as the front door tying data, analytics, and these AI tools together.
SageMaker Unified Studio GA
The single workspace over data, analytics, and AI - the front door to the next-gen SageMaker.
Unified Studio brings EMR, Glue, Athena, Redshift, Bedrock, and SageMaker AI into one IDE on a lakehouse foundation, with Amazon Q Developer embedded for code, troubleshooting, and ETL. It stitches the data and AI lifecycles together so the same governed data powers analytics and model building, and it replaces the older SageMaker Studio Classic experience.
Model Customization
Four ways to make a model better at your task, from cheapest to most involved.
| Technique | When | Cost/effort |
|---|---|---|
| Prompt + RAG | Most tasks - ground the model in your data without changing weights. | Low |
| Fine-tuning | Consistent style/format or narrow task accuracy from labeled examples. | Medium |
| Reinforcement Fine-Tuning | Optimize toward a reward signal where correctness is checkable (2026). | Medium-High |
| Distillation | Teach a small, cheap model from a large one - keep quality, cut cost/latency. | Medium |
| Continued pre-training | Inject large domain corpora; rarely needed for most enterprises. | High |
Amazon Q
AWS's family of GenAI assistants for developers, businesses, and operations.
| Product | What it does |
|---|---|
| Q Developer | Agentic coding and ops assistant - code generation, transformation/modernization, troubleshooting, AWS console help. Embedded in IDEs and SageMaker Unified Studio. |
| Q Business | Enterprise RAG assistant over your apps and documents (40+ connectors), with access controls inherited from the source systems. |
| Amazon Quick | 2026 evolution toward autonomous background agents with specialized expertise; an activity feed across email, messaging, calendar, and tasks. |
| Q in QuickSight / Connect | Natural-language BI and contact-center assistance embedded in those services. |
Applied AI Services
Task-specific managed APIs - no model selection, just call them.
| Service | Task |
|---|---|
| Rekognition | Image/video analysis: labels, faces, moderation, text-in-image. |
| Textract | Document extraction: text, forms, tables, queries from PDFs/images. |
| Comprehend | NLP: entities, sentiment, key phrases, PII, custom classification. |
| Transcribe | Speech-to-text with diarization, custom vocabulary, call analytics. |
| Polly | Text-to-speech with neural and generative voices. |
| Translate | Neural machine translation across many languages. |
| Lex | Conversational bots (the engine behind many IVR/chat flows). |
| Kendra | Enterprise search; the GenAI Index feeds RAG with permission-aware retrieval. |
| Personalize | Real-time recommendations from your interaction data. |
Vectors & Data
Where your embeddings and ground-truth live. Pick by scale, latency, and what you already run.
| Store | Best for |
|---|---|
| S3 Vectors | Cost-optimized vector storage/query at massive scale directly in S3 (2026) - cheapest for large, less latency-sensitive corpora. |
| OpenSearch Serverless (vector) | Low-latency hybrid (keyword + vector) search; the common Knowledge Base default. |
| Aurora / RDS PostgreSQL (pgvector) | Vectors next to relational data with transactional consistency. |
| MemoryDB / DocumentDB / Neptune Analytics | In-memory vectors, document-store vectors, and graph+vector analytics respectively. |
| Kendra GenAI Index | Managed, permission-aware retrieval index purpose-built for RAG. |
Chips & GPUs
The silicon under the stack. AWS's custom chips are the cost lever; NVIDIA GPUs are the compatibility lever.
| Silicon | Role |
|---|---|
| Trainium2 / Trainium3 | AWS training (and increasingly inference) accelerators; Trn2 UltraServers and Project Rainier power large Anthropic/enterprise training at strong price/perf. |
| Inferentia2 | Cost-efficient high-volume inference. |
| EC2 P5 / P6 (NVIDIA Blackwell) | Top-end GPU training/inference; maximum framework compatibility. |
| EC2 G7 (RTX PRO Blackwell) | 2026 graphics/inference tier for cost-effective serving and visual workloads. |
| UltraClusters / Capacity Blocks / HyperPod | Network-dense GPU/Trainium fabrics; reserve capacity windows; resilient FM-training clusters. |
Architecture Patterns
The handful of shapes most AWS GenAI workloads fall into.
Bedrock + Knowledge Base (OpenSearch/S3 Vectors) + Guardrails, fronted by API Gateway/Lambda or Q Business. The default enterprise knowledge assistant.
AgentCore Runtime + Memory + Gateway (tools/MCP) + Identity + Observability. Add Web Search for grounding. Human-in-the-loop on high-impact actions.
SageMaker fine-tune/host (or import to Bedrock) behind a private endpoint; distill to cut cost once quality is proven.
Bedrock Data Automation extracts from docs/images/audio/video into structured output feeding a Knowledge Base or warehouse.
Bedrock batch inference over large datasets in S3 for classification, summarization, or embedding generation at lowest cost.
Q in QuickSight/Connect, or Q Developer in the SDLC - buy the assistant rather than build it.
Decision Matrix
Fast answers to the questions that come up in every design review.
| Question | Default answer |
|---|---|
| Consume a model or train one? | Consume via Bedrock. Train/fine-tune in SageMaker only with evidence the base model can't meet the bar. |
| Which model? | Claude for hardest reasoning/agents; Nova for cost/latency; Llama/Mistral/DeepSeek for open-weight/customization. A/B on the same Bedrock API. |
| Build RAG or use Knowledge Bases? | Knowledge Bases unless you need bespoke chunking/hybrid/rerank control. |
| Bedrock Agents or AgentCore? | AgentCore for anything heading to production - managed memory, tools, identity, observability. |
| Which vector store? | OpenSearch Serverless default; S3 Vectors for cost at scale; pgvector for vectors beside relational data. |
| Buy an assistant or build? | Q Business/Q Developer first; build on Bedrock when you need custom UX/logic. |
| GPU or AWS silicon? | Trainium/Inferentia for cost at volume; NVIDIA for specific framework/CUDA needs. |
Pricing & Cost Control
Shape, not exact numbers - rates change and vary by model/region. Always confirm in the AWS pricing pages.
| Lever | How it bills | Control |
|---|---|---|
| Bedrock on-demand | Per input/output token, per model. | Right-size model per task; cache prompts; cap output tokens. |
| Provisioned Throughput | Reserved model units (hourly). | For steady high volume; commit only after you know the load. |
| Batch inference | Discounted vs on-demand. | Use for non-interactive enrichment jobs. |
| Knowledge Bases / OpenSearch | Storage + query + embedding tokens. | Tune chunk size; prune stale docs; pick S3 Vectors for cost. |
| SageMaker | Training + endpoint instance-hours. | Serverless/async endpoints; autoscale to zero where possible. |
| Agents | Model tokens x steps + tool calls. | Cap loop length; cheap model for routing, strong model only when needed. |
Risks & Gotchas
Read this one. What actually bites teams in production.