AWS AI, the practical way

An architecture-first reference for the Amazon AI stack as of June 2026. From Amazon Bedrock and the Nova model family, to Bedrock AgentCore for production agents, to SageMaker for custom models. Trade-offs, pricing shape, and risks. No marketing.

Refreshed June 2026Architecture-firstEnterprise focusVendor-neutral

Naming, 2026

No big rebrand on AWS this year - the shift is in the agent story: Bedrock Agents (2025) became Bedrock AgentCore, a managed runtime with Memory, Gateway, Identity, Observability, Web Search, and (preview) Payments. SageMaker was repositioned as the unified center for data + analytics + AI, with SageMaker Unified Studio now GA.

TL;DR

AWS's AI story in 2026 has three layers. Amazon Bedrock is the managed gateway to dozens of foundation models (Anthropic Claude, Meta Llama, Mistral, DeepSeek, NVIDIA Nemotron, and Amazon's own Nova) behind one API, with Knowledge Bases, Guardrails, customization, and evaluations. Bedrock AgentCore turned the 2025 agent preview into a managed runtime for production agents. SageMaker - now the unified center for data + analytics + AI - is where you train, fine-tune, and host custom models. If you already run on AWS, the data-gravity, IAM integration, and custom-silicon economics make this stack the path of least resistance.

The AWS AI mental model

Figure 1 - The AWS AI stack is layered. Start at Layer 3/2 (Bedrock); drop to Layer 1 only when you need custom training or chips.

What sets AWS apart in 2026

Differentiator	What it means in practice
Widest managed model catalog	Bedrock fronts Anthropic Claude, Meta Llama, Mistral, Cohere, AI21, DeepSeek, NVIDIA Nemotron, Stability, and Amazon Nova behind one API and one bill. Switching models is a parameter change, not a re-architecture.
Anthropic relationship + Trainium	Deep Anthropic partnership (Project Rainier Trainium clusters) means frontier Claude models are first-class on Bedrock, often with strong price/perf on AWS silicon.
AgentCore as managed runtime	Memory, Gateway (tools/MCP), Identity, Observability, Browser, Code Interpreter, Web Search, and Payments (preview) - framework-agnostic (Strands, LangChain, OpenAI Agents SDK, Claude Agent SDK).
Data gravity + IAM	If your data is already in S3/Redshift/Aurora, RAG ground truth and access control are native. No new identity plane.
Custom silicon economics	Trainium/Inferentia give a cost lever for training and high-volume inference that pure-GPU clouds cannot match on price.

Where AWS is weaker (be honest)

Own frontier model

Amazon Nova is competitive on price/latency and improving fast, but it is not the model you reach for when you need the absolute top of the reasoning leaderboard - that is usually Claude (also on Bedrock) or a competitor's flagship. Amazon's bet is breadth, integration, and silicon economics, not owning the #1 model.

Surface area & sprawl

The catalog of overlapping services (Bedrock vs SageMaker vs Q vs applied-AI, three vector stores, two studios) is large. Picking the right primitive is itself an architecture decision - see the Decision Matrix.

How to read this portal

Each flagship service tab has sub-tabs (Overview / Architecture / Capabilities / Pricing / Risks / When to use) with a reference-architecture diagram. If you only read one sub-tab, read Risks. The others tell you what something does; Risks tells you what bites you in production.

What's New - late 2025 through June 2026

Material changes that affect architecture, cost, or risk. Curated, not a press-release dump.

TL;DR

The dominant 2026 theme is agents going to production: Bedrock AgentCore added managed Knowledge Bases, a managed agent harness, native Web Search, and (preview) autonomous Payments. Model breadth widened (NVIDIA Nemotron 3 Super on Bedrock, Nova Forge for Nova customization, Reinforcement Fine-Tuning). And SageMaker was repositioned as the unified data+AI center, with SageMaker Unified Studio now GA and Amazon Q Developer embedded throughout.

Date	Release	Why it matters
Dec 2025	Next-gen SageMaker + Unified Studio (re:Invent)	SageMaker repositioned as the single center for data, analytics, and AI - Glue, EMR, Athena, Redshift, Bedrock, and SageMaker AI in one workspace with a lakehouse.
Dec 2025	Trainium3 announced	Next-gen training/inference silicon; continues AWS's price/perf lever vs pure-GPU stacks. Confirm region/instance availability before designing around it.
Jan 2026	SageMaker Unified Studio GA + Amazon Q Developer GA in Studio	Data professionals get GenAI assistance across the lifecycle; Bedrock and SageMaker AI usable from one IDE.
Feb 2026	Reinforcement Fine-Tuning in Bedrock	Tailor models to narrow tasks with reward signals - higher accuracy on domain workflows without full training.
Mar 2026	NVIDIA Nemotron 3 Super on Bedrock; Nova Forge SDK	Open-weight frontier reasoning model managed; Nova Forge lets enterprises customize Nova on their data and deploy inside Bedrock.
Apr 2026	AgentCore Payments (preview)	Agents can autonomously pay for APIs, MCP servers, web content, and other agents - built with Coinbase and Stripe. New control-plane and audit considerations.
May 2026	Agent Toolkit for AWS; AgentCore managed harness	Declare and run an agent in ~3 API calls, no orchestration code. Lowers time-to-first-agent dramatically.
Jun 2026	AWS Summit NY: Managed Knowledge Bases (Smart Parsing, Agentic Retriever), Web Search on AgentCore (GA), Amazon Quick, S3 Annotations, EC2 G7 (RTX PRO Blackwell)	Fully-managed RAG with multi-format parsing; grounded answers with zero data egress; mutable per-object context in S3; new inference GPU tier.

Practical read

If you piloted Bedrock Agents in 2025, plan a migration review to AgentCore: the managed Memory, Gateway, Identity, and Observability replace a lot of custom glue. If you run SageMaker Studio (classic), plan the move to Unified Studio.

Service Map

The AWS AI services worth knowing, grouped by what you do with them.

COREAmazon Bedrock

Managed multi-model API: catalog, Knowledge Bases, Guardrails, Flows, Evaluations, customization, AgentCore.

MODELSAmazon Nova

Amazon's own FM family: Micro, Lite, Pro, Premier, plus Canvas (image), Reel (video), Sonic (speech). Forge to customize.

AGENTSBedrock AgentCore

Runtime, Memory, Gateway, Identity, Observability, Browser, Code Interpreter, Web Search, Payments (preview).

BUILDSageMaker AI + Unified Studio

Train, fine-tune, host custom models; HyperPod for FM training; one studio over data+analytics+AI.

ASSISTAmazon Q

Q Developer (coding/ops agent), Q Business (enterprise RAG assistant), Q in QuickSight/Connect, Amazon Quick.

APPLIEDApplied AI

Rekognition, Textract, Comprehend, Transcribe, Polly, Translate, Lex, Kendra, Personalize.

DATAVectors & Data

S3 Vectors, OpenSearch vector, Aurora/RDS pgvector, MemoryDB, Kendra GenAI Index, Bedrock Data Automation.

SILICONChips & GPUs

Trainium2/3, Inferentia2, EC2 P5/P6 (Blackwell), G7, UltraClusters, Capacity Blocks.

GOVERNGuardrails

Content filters, denied topics, PII redaction, contextual grounding, Automated Reasoning checks.

How to read this

The flagship services (Amazon Bedrock, Bedrock AgentCore, SageMaker AI) carry full sub-tabs - Overview / Architecture / Capabilities / Pricing / Risks / When-to-use - with reference-architecture diagrams. Secondary services use a single rich page with the same architecture-first, risk-honest treatment. If you're scoping production, read a service's Risks before its Overview.

Amazon Bedrock

The managed, serverless gateway to foundation models. One API, one IAM model, one bill, many vendors.

Official documentation ↗

Overview

Architecture

Capabilities

Pricing model

Risks & gotchas

When to use

TL;DR

Bedrock exposes many foundation models through a unified API (Converse / InvokeModel). You never manage servers; you pay per token on-demand, reserve capacity with Provisioned Throughput, or run Batch. Around the models sit Knowledge Bases (managed RAG), Guardrails (independent safety), Flows (orchestration), Evaluations, and customization. It is the default starting point for almost any GenAI workload on AWS.

What problem this solves

Most teams don't want to run GPU fleets, manage model weights, stand up a guardrail service, wire a vector store, and negotiate vendor contracts separately. Bedrock's offer is one IAM-governed, serverless surface where you swap Claude / Llama / Nova with a parameter, apply the same Guardrail policy across models, and keep data inside your AWS account. The trade-off is that exact model and feature availability varies by region - confirm before you design.

Two consumption modes, plus batch

Mode	How you pay	Best for
On-demand	Per input/output token, no commitment.	Prototyping, variable/low volume, model comparison.
Provisioned Throughput	Reserved model units (hourly + term commitments).	Steady high volume needing predictable latency and cost.
Batch	Discounted vs on-demand, asynchronous.	Large non-interactive jobs: enrichment, classification, embedding generation.

Rule of thumb

Start on-demand. Move to Provisioned Throughput when sustained traffic makes reserved units cheaper than pay-go and you need latency guarantees. Push bulk, non-interactive work to Batch for the discount.

Reference architecture

Figure - Amazon Bedrock reference shape. PrivateLink + IAM keep traffic off the public internet; Guardrails sit in-line; Knowledge Bases and AgentCore are opt-in around the model call.

Network and identity

Bedrock is reachable over PrivateLink (VPC endpoints) so model traffic never traverses the public internet. Authorization is IAM: scope policies to specific models, Knowledge Bases, Guardrails, and agents; callers use roles (instance/task/Lambda execution roles). Encrypt with KMS and prefer customer-managed keys for regulated data.

Where the data goes

AWS's stated position is that your prompts and completions are not used to train the base foundation models and stay within your account and region. You control model-invocation logging to CloudWatch/S3. For data residency, pin the region and confirm the model is available there; cross-region inference can move data across regions, so weigh it against compliance needs.

Capability matrix (June 2026)

Capability	Status	Notes
Model catalog + Marketplace	●	First-party + partner FMs; 100+ via Marketplace; import custom weights.
Knowledge Bases (RAG)	●	Managed RAG; 2026 adds Smart Parsing + Agentic Retriever.
Guardrails	●	Content filters, denied topics, PII, contextual grounding, Automated Reasoning checks.
Flows	●	Visual orchestration of prompts, models, KBs, Lambda.
Evaluations	●	Automatic + LLM-as-judge model and RAG evaluation.
Customization	●	Fine-tuning, continued pre-training, distillation, Reinforcement Fine-Tuning.
Prompt caching / cross-region	●	Cut cost/latency on repeated context; auto-route to capacity in other regions.
Batch inference	●	Discounted asynchronous processing at scale.
AgentCore	●	Managed agent runtime - its own tab.

How Bedrock bills

Lever	How it bills	Control
On-demand	Per input/output token, per model.	Right-size model per task; cache prompts; cap output tokens.
Provisioned Throughput	Reserved model units (hourly + term).	Commit only after you know the steady load.
Batch	Discounted vs on-demand.	Use for non-interactive enrichment.
Knowledge Bases	Storage + query + embedding tokens (+ vector store).	Tune chunk size; prune stale docs; pick S3 Vectors for cost.

Cost surprise

On-demand token pricing varies 10-30x between a flagship and a small model. A chatty agent loop on a flagship is the classic surprise bill. Set budgets, cache, and route small models for routine steps.

Model/region availability

Not every model is in every region. Confirm the exact model+region before you design around it; cross-region inference helps but has data-residency implications.

Cost blowouts

Flagship models in agent loops dominate bills. Budget per conversation, right-size the model per task, and cache repeated context.

Guardrails are opt-in

Guardrails are not applied unless you attach them. Make them part of the deployment, not an afterthought, and validate prompts and responses at the platform layer.

Quotas

Default token/request quotas throttle real workloads. Request increases early; design for throttling and backoff.

Use Bedrock for almost any GenAI workload on AWS - model optionality, managed RAG/guardrails/evals, no infrastructure.
Drop to SageMaker only when you need custom training, exotic hosting, or a model not in the catalog.
Add AgentCore when the workload is an agent heading to production.
Lead with a small model + prompt caching for cost; reserve Provisioned Throughput once volume is steady.

Foundation Model Catalog

Indicative view of model families on Bedrock in 2026. Exact versions and regions change frequently - confirm in the console.

Official documentation ↗

Provider	Families	Typical use
Anthropic	Claude (Opus / Sonnet / Haiku tiers)	Top-tier reasoning, agents, coding, long context. The frontier default on Bedrock.
Amazon	Nova Micro / Lite / Pro / Premier; Canvas, Reel, Sonic	Cost/latency-optimized text and multimodal; image, video, speech generation.
Meta	Llama (open weights)	Open-weight workloads, customization, on-prem parity.
Mistral	Mistral / Mixtral	Efficient European open-weight options.
DeepSeek	DeepSeek-R1 and successors	Strong open reasoning at low cost.
NVIDIA	Nemotron 3 (Super)	Open-weight frontier reasoning/agentic, hosted managed.
Cohere / AI21 / Stability	Command / Embed / Rerank, Jamba, Stable Diffusion / Image	Embeddings, reranking, long-context, image generation.

Embeddings + rerank

For RAG, pair an embedding model (Amazon Titan Text Embeddings, Cohere Embed) with a reranker (Cohere Rerank) for a quality lift at low engineering cost.

Pin model IDs

Models rev and deprecate. Pin a specific model ID in production, watch deprecation notices, and evaluate before auto-upgrading.

Amazon Nova

Amazon's own foundation-model family - optimized for price, latency, and AWS integration.

Official documentation ↗

Model	Modality	Best for
Nova Micro	Text	Cheapest, fastest text - classification, routing, simple extraction at scale.
Nova Lite	Multimodal (text+image/video in)	Low-cost multimodal understanding, high-volume workloads.
Nova Pro	Multimodal	Balanced capability/cost for most enterprise tasks and agents.
Nova Premier	Multimodal, most capable	Complex reasoning; also the teacher model for distillation.
Nova Canvas	Image generation	Studio-quality images with content credentials/watermarking.
Nova Reel	Video generation	Short-form video from text/image prompts.
Nova Sonic	Speech-to-speech	Real-time voice interactions with low latency.

Nova Forge (2026)

Forge SDK lets you customize Nova on domain data (fine-tune/distill) and deploy directly within Bedrock - useful when you want Nova's economics with your own task accuracy.

Positioning

Use Nova where cost and latency dominate and the task is well-scoped. For the hardest reasoning, A/B it against Claude on the same Bedrock API before committing.

Amazon Bedrock AgentCore GA

The managed runtime for production agents. Framework-agnostic - bring Strands, LangChain, OpenAI Agents SDK, or the Claude Agent SDK.

Official documentation ↗

Overview

Architecture

Modules

Risks & gotchas

When to use

TL;DR

AgentCore makes the hard parts of running agents - memory, tool auth, networking, identity, tracing, and safety - managed concerns, and standardizes the integration surface (MCP, OpenAPI). The 2026 managed harness lets you declare and run an agent in ~3 API calls with no orchestration code. It is framework-agnostic, so you keep your agent logic and let AWS run the plumbing.

What problem this solves

Hand-built agent loops prototype fast and operate badly: state, retries, tool credentials, private networking, observability, and guardrails all become your code. AgentCore turns those into managed modules you opt into, and adds capabilities most teams can't easily build - a managed Browser, sandboxed Code Interpreter, zero-egress Web Search, and (preview) autonomous Payments.

Migrate 2025 pilots

If you built on Bedrock Agents + custom glue in 2025, AgentCore's managed Memory, Gateway, Identity, and Observability replace most of that scaffolding. Move for the operational maturity alone.

Module map

Figure - AgentCore modules. Mix and match; you don't have to adopt all of them.

Module	What it gives you	Status
Runtime / Harness	Managed serverless execution; declare and run an agent in ~3 API calls, no orchestration code.	GA
Memory	Short-term and long-term memory so agents retain context across turns and sessions.	GA
Gateway	Turn APIs, Lambda, and MCP servers into governed agent tools with auth and access control.	GA
Identity	Scoped, least-privilege access for agents; policies verified by Automated Reasoning (same tech as IAM/S3).	GA
Observability	Traces of every step and tool call, and where the agent went off track; evaluation vs real traffic.	GA
Browser & Code Interpreter	Headless browsing and sandboxed code execution as managed tools.	GA
Web Search	Grounded, cited answers from the live web with zero data egress from your AWS environment.	GA
Payments	Agents autonomously pay for APIs, MCP servers, content, and other agents (Coinbase/Stripe).	Preview

Agentic payments = new risk class

An agent that can spend money needs hard budget caps, human-in-the-loop thresholds, and immutable audit. Treat AgentCore Payments as a controlled pilot, not a default.

Runaway loops & actions

Unbounded loops and broad tool access cause cost blowouts and unintended actions. Enforce step caps, budgets, least-privilege Identity, and human approval on high-impact tools.

Tool/data egress

Gateway tools and Web Search can move data. Vet tools, prefer scoped auth, and keep agents on private networking with egress controls. Web Search is zero-egress by design - confirm other tools are too.

Use AgentCore for any agent heading to production - managed memory, tools, identity, and observability beat hand-rolled glue.
Adopt modules incrementally - you don't need all of them; start with Runtime + Memory + Gateway + Observability.
Gate Payments behind budgets and human approval; pilot before trusting autonomous spend.
Pair with Bedrock Guardrails so safety holds regardless of the model the agent uses.

AWS vs OCI vs Azure vs GCP

A practitioner's quick read. Every cloud does the basics; the differences are in defaults, data gravity, and silicon.

Dimension	AWS	OCI	Azure	GCP
Model breadth (managed)	Widest (Bedrock)	Broad (OCI Gen AI)	Foundry Models (1000+)	Model Garden (200+)
Frontier own model	Nova (mid); Claude hosted	None (partners)	OpenAI GPT-5.x	Gemini 3.x
Agents	AgentCore	Enterprise AI Agents	Foundry Agent Service	Agent Platform + A2A
Custom silicon	Trainium/Inferentia	GPU (NVIDIA)	Maia (emerging)	TPU (Ironwood/8th)
Data gravity	S3 / Redshift	Oracle DB 26ai (in-DB vectors)	Fabric / OneLake	BigQuery
Best when	Already on AWS; want model choice + silicon economics	Run Oracle DB/EBS; want in-DB vectors + sovereignty	Microsoft shop; want OpenAI + M365	BigQuery/Workspace central; want Gemini + TPU

Honest take

The cloud you already run is usually the right one for GenAI - data gravity and IAM beat a marginally better model. AWS's edge is the widest model catalog plus custom-silicon economics; its tax is service sprawl.

Sources

Primary AWS material used for this portal (June 2026). Verify specifics against current docs before committing - this space moves weekly.

Accuracy note

Compiled by Brijesh Gogia for expertoracle.com. Independent and not affiliated with Amazon/AWS. Model names, availability, and pricing change frequently - treat this as orientation, confirm in the AWS console/docs before designing.

Knowledge Bases & RAG

Managed retrieval-augmented generation - the most common enterprise GenAI pattern.

Official documentation ↗

Figure - Bedrock Knowledge Bases. 2026 adds Smart Parsing (multi-format prep) and an Agentic Retriever for multi-step queries; Guardrails check grounding on the way out.

Use Knowledge Bases when	Build your own when
You want managed RAG with minimal code, the corpus is mostly documents, and Smart Parsing / built-in retrieval quality matter.	You need fine control over chunking, hybrid search, custom rerankers, or a vector store you already operate.

Bedrock Data Automation

For multimodal corpora (documents, images, audio, video), BDA extracts structured output to feed a Knowledge Base - cleaner than rolling your own parsers.

Retrieval is the failure point

Most "the model hallucinated" incidents are retrieval misses. Tune chunking, add a reranker, enable contextual-grounding Guardrails, and evaluate retrieval before blaming the LLM.

Guardrails

An independent safety layer you apply to any model - first-party or imported.

Official documentation ↗

Control	What it catches
Content filters	Hate, insults, sexual, violence, misconduct, prompt attacks - tunable thresholds.
Denied topics	Block subjects out of scope for your application.
Sensitive info / PII	Detect and redact or block PII and custom regex patterns.
Contextual grounding	Score answers for grounding against source and relevance to the query - reduce hallucination.
Automated Reasoning checks	Mathematically verify outputs against encoded policies/rules - high-assurance domains.

Apply at the platform layer

Guardrails sit between the app and the model, so the same policy applies regardless of which model the agent picks. Validate prompts and responses here, not only in app code.

Opt-in

Guardrails do nothing until attached to an invocation or agent. Make attaching them part of the deployment template.

SageMaker AI

Where you train, fine-tune, and host models when Bedrock's managed path isn't enough.

Official documentation ↗

Overview

Architecture

Components

When to use

TL;DR

SageMaker AI is the full-control path: managed training jobs, real-time / serverless / async / batch endpoints, JumpStart for one-click model deploy/fine-tune, HyperPod for large-scale FM training, and MLOps (Pipelines, Model Registry, Clarify, Model Monitor). Reach for it when Bedrock's managed surface can't express what you need - custom training, exotic hosting, or a model outside the catalog.

Bedrock vs SageMaker

Bedrock = consume and customize managed models, fast, serverless. SageMaker = own the training, hosting, and MLOps. Many teams use both: Bedrock for the app, SageMaker for the custom model behind it. Start in Bedrock; drop here only when you must.

Reference architecture

Figure - SageMaker AI. Data to training to registry to endpoints, wrapped in MLOps pipelines with bias/drift monitoring.

Component	Use
JumpStart	One-click deploy/fine-tune of open and partner foundation models.
Training & Inference	Managed training jobs; real-time / serverless / async / batch endpoints with autoscaling.
HyperPod	Resilient, self-healing clusters for FM pre-training and heavy fine-tuning across thousands of accelerators.
Pipelines / Model Registry	Reproducible MLOps pipelines, lineage, approval gates, deployment.
Clarify / Model Monitor	Bias/explainability and production drift detection.

Use SageMaker for custom training, specialized hosting, large-scale FM training (HyperPod), or models outside the Bedrock catalog.
Stay in Bedrock for consume/customize-managed; come here only for full control.
Use Unified Studio as the front door tying data, analytics, and these AI tools together.

SageMaker Unified Studio GA

The single workspace over data, analytics, and AI - the front door to the next-gen SageMaker.

Official documentation ↗

Unified Studio brings EMR, Glue, Athena, Redshift, Bedrock, and SageMaker AI into one IDE on a lakehouse foundation, with Amazon Q Developer embedded for code, troubleshooting, and ETL. It stitches the data and AI lifecycles together so the same governed data powers analytics and model building, and it replaces the older SageMaker Studio Classic experience.

LakehouseGlue / EMR / AthenaRedshiftBedrockQ DeveloperGovernance / catalog

Migration

If you run Studio Classic, plan the move to Unified Studio - newer Bedrock and governance features land here first.

Model Customization

Four ways to make a model better at your task, from cheapest to most involved.

Official documentation ↗

Technique	When	Cost/effort
Prompt + RAG	Most tasks - ground the model in your data without changing weights.	Low
Fine-tuning	Consistent style/format or narrow task accuracy from labeled examples.	Medium
Reinforcement Fine-Tuning	Optimize toward a reward signal where correctness is checkable (2026).	Medium-High
Distillation	Teach a small, cheap model from a large one - keep quality, cut cost/latency.	Medium
Continued pre-training	Inject large domain corpora; rarely needed for most enterprises.	High

Order of operations

Exhaust prompt engineering and RAG first. Fine-tune only with evidence the base model can't hit your accuracy/format bar. Distill once a fine-tuned large model proves out, to cut run-cost.

Amazon Q

AWS's family of GenAI assistants for developers, businesses, and operations.

Official documentation ↗

Product	What it does
Q Developer	Agentic coding and ops assistant - code generation, transformation/modernization, troubleshooting, AWS console help. Embedded in IDEs and SageMaker Unified Studio.
Q Business	Enterprise RAG assistant over your apps and documents (40+ connectors), with access controls inherited from the source systems.
Amazon Quick	2026 evolution toward autonomous background agents with specialized expertise; an activity feed across email, messaging, calendar, and tasks.
Q in QuickSight / Connect	Natural-language BI and contact-center assistance embedded in those services.

Build vs buy

For internal knowledge assistants, pilot Q Business before building custom RAG - the connectors and permission inheritance save real engineering. Build on Bedrock when you need bespoke UX or logic Q can't express.

Applied AI Services

Task-specific managed APIs - no model selection, just call them.

Official documentation ↗

Service	Task
Rekognition	Image/video analysis: labels, faces, moderation, text-in-image.
Textract	Document extraction: text, forms, tables, queries from PDFs/images.
Comprehend	NLP: entities, sentiment, key phrases, PII, custom classification.
Transcribe	Speech-to-text with diarization, custom vocabulary, call analytics.
Polly	Text-to-speech with neural and generative voices.
Translate	Neural machine translation across many languages.
Lex	Conversational bots (the engine behind many IVR/chat flows).
Kendra	Enterprise search; the GenAI Index feeds RAG with permission-aware retrieval.
Personalize	Real-time recommendations from your interaction data.

Trend

Several classic tasks (doc extraction, classification, summarization) are increasingly done with Bedrock + a multimodal model or Bedrock Data Automation. Use the applied service when it is cheaper, lower-latency, or compliance-certified for that exact task; reach for Bedrock when you need flexibility.

Vectors & Data

Where your embeddings and ground-truth live. Pick by scale, latency, and what you already run.

Official documentation ↗

Store	Best for
S3 Vectors	Cost-optimized vector storage/query at massive scale directly in S3 (2026) - cheapest for large, less latency-sensitive corpora.
OpenSearch Serverless (vector)	Low-latency hybrid (keyword + vector) search; the common Knowledge Base default.
Aurora / RDS PostgreSQL (pgvector)	Vectors next to relational data with transactional consistency.
MemoryDB / DocumentDB / Neptune Analytics	In-memory vectors, document-store vectors, and graph+vector analytics respectively.
Kendra GenAI Index	Managed, permission-aware retrieval index purpose-built for RAG.

Default

Most teams start with a Bedrock Knowledge Base on OpenSearch Serverless. Move to S3 Vectors for cost at scale, or pgvector when vectors must sit beside operational rows.

Chips & GPUs

The silicon under the stack. AWS's custom chips are the cost lever; NVIDIA GPUs are the compatibility lever.

Official documentation ↗

Silicon	Role
Trainium2 / Trainium3	AWS training (and increasingly inference) accelerators; Trn2 UltraServers and Project Rainier power large Anthropic/enterprise training at strong price/perf.
Inferentia2	Cost-efficient high-volume inference.
EC2 P5 / P6 (NVIDIA Blackwell)	Top-end GPU training/inference; maximum framework compatibility.
EC2 G7 (RTX PRO Blackwell)	2026 graphics/inference tier for cost-effective serving and visual workloads.
UltraClusters / Capacity Blocks / HyperPod	Network-dense GPU/Trainium fabrics; reserve capacity windows; resilient FM-training clusters.

Architect's lever

For high-volume inference, benchmark Inferentia2/Trainium against GPU instances - the price difference can dominate TCO. Keep GPUs where you need a specific CUDA/framework path.

Architecture Patterns

The handful of shapes most AWS GenAI workloads fall into.

1. Managed RAG assistant

Bedrock + Knowledge Base (OpenSearch/S3 Vectors) + Guardrails, fronted by API Gateway/Lambda or Q Business. The default enterprise knowledge assistant.

2. Production agent

AgentCore Runtime + Memory + Gateway (tools/MCP) + Identity + Observability. Add Web Search for grounding. Human-in-the-loop on high-impact actions.

3. Custom model service

SageMaker fine-tune/host (or import to Bedrock) behind a private endpoint; distill to cut cost once quality is proven.

4. Multimodal pipeline

Bedrock Data Automation extracts from docs/images/audio/video into structured output feeding a Knowledge Base or warehouse.

5. Batch enrichment

Bedrock batch inference over large datasets in S3 for classification, summarization, or embedding generation at lowest cost.

6. Embedded BI/ops assistant

Q in QuickSight/Connect, or Q Developer in the SDLC - buy the assistant rather than build it.

Decision Matrix

Fast answers to the questions that come up in every design review.

Question	Default answer
Consume a model or train one?	Consume via Bedrock. Train/fine-tune in SageMaker only with evidence the base model can't meet the bar.
Which model?	Claude for hardest reasoning/agents; Nova for cost/latency; Llama/Mistral/DeepSeek for open-weight/customization. A/B on the same Bedrock API.
Build RAG or use Knowledge Bases?	Knowledge Bases unless you need bespoke chunking/hybrid/rerank control.
Bedrock Agents or AgentCore?	AgentCore for anything heading to production - managed memory, tools, identity, observability.
Which vector store?	OpenSearch Serverless default; S3 Vectors for cost at scale; pgvector for vectors beside relational data.
Buy an assistant or build?	Q Business/Q Developer first; build on Bedrock when you need custom UX/logic.
GPU or AWS silicon?	Trainium/Inferentia for cost at volume; NVIDIA for specific framework/CUDA needs.

Pricing & Cost Control

Shape, not exact numbers - rates change and vary by model/region. Always confirm in the AWS pricing pages.

Lever	How it bills	Control
Bedrock on-demand	Per input/output token, per model.	Right-size model per task; cache prompts; cap output tokens.
Provisioned Throughput	Reserved model units (hourly).	For steady high volume; commit only after you know the load.
Batch inference	Discounted vs on-demand.	Use for non-interactive enrichment jobs.
Knowledge Bases / OpenSearch	Storage + query + embedding tokens.	Tune chunk size; prune stale docs; pick S3 Vectors for cost.
SageMaker	Training + endpoint instance-hours.	Serverless/async endpoints; autoscale to zero where possible.
Agents	Model tokens x steps + tool calls.	Cap loop length; cheap model for routing, strong model only when needed.

The agent cost trap

Agent loops multiply token cost by the number of steps. A 10-step loop on a flagship model is the most common surprise bill. Budget per-conversation, log token usage, and route to small models for routine steps.

Risks & Gotchas

Read this one. What actually bites teams in production.

Model/region drift

Models and versions change and aren't uniform across regions. Pin model IDs, monitor deprecations, and test before auto-upgrading.

Runaway agent cost & actions

Unbounded loops and tool access cause both cost blowouts and unintended actions. Enforce step caps, budgets, least-privilege Identity, and human approval on high-impact tools. For AgentCore Payments, treat spend as a first-class control.

Data egress & residency

Cross-region inference and external tools (web search, third-party MCP) can move data. Confirm residency; prefer zero-egress options where compliance requires.

Service sprawl & lock-in

Mixing Bedrock, SageMaker, Q, and three vector stores creates operational complexity and AWS-specific coupling. Standardize on a few primitives; keep prompts/eval portable.

Guardrails are opt-in

Retrieval quality and missing guardrails, not the model, are usually the failure. Attach Guardrails by default and use contextual grounding + rerankers before blaming the LLM.

Quotas

Default account quotas (tokens/min, requests/min, concurrent training) throttle real workloads. Request increases early; design for backoff.