Most "AWS operational best practices" content gives you a nicely formatted summary of the Well-Architected Framework's design principles. You could have read that in the AWS docs. What engineering teams actually need is a map from those principles to the specific AWS services, team structures, and automation patterns that make those principles real on a Tuesday afternoon.
This guide is organized around the four Operational Excellence areas - Organize, Prepare, Operate, Evolve - and includes an actionable checklist at the end you can use as a self-assessment or share with your team. I've implemented these patterns across multi-account AWS environments for engineering teams at various stages, and this is the playbook I wish existed when I started.
Whether you're prepping for a Well-Architected Review, onboarding a new workload to production, or trying to diagnose why your on-call rotation is unsustainable, this guide covers the practical steps, specific AWS tooling, and decision frameworks that turn the framework into action.
What Are AWS Operational Best Practices?
The Well-Architected Operational Excellence pillar defines operational excellence as a commitment to build software correctly while consistently delivering a great customer experience. The goal, put plainly, is to get new features and bug fixes to customers quickly and reliably.
That definition sounds like marketing copy until you translate it: operational excellence is what separates teams that ship confidently from teams that treat every production deployment as a small crisis. It is the difference between having runbooks in version control vs. in someone's head, and between compliance drift being caught automatically vs. discovered during an audit.
The Well-Architected Operational Excellence Pillar
AWS organizes the Operational Excellence pillar into four best practice areas:
- Organization - Team structure, roles, culture of learning, incident response processes
- Prepare - Workload design for operational readiness, DR, security, and scalability
- Operate - Monitoring, logging, automated deployment, and CI/CD
- Evolve - Regular workload reviews, automated testing, IaC, and continuous improvement
Security and operational excellence are the two pillars that AWS explicitly states are not traded off against other pillars. Everything else gets balanced - cost vs. performance, reliability vs. cost. Security and ops? Those are baselines.
This matters because it means the work in this guide is not optional optimization. It is table stakes for running production workloads on AWS responsibly.
The Eight Design Principles (Practical Translations)
AWS documents eight design principles for operational excellence. Most content lists five or six and stops there. Here are all eight, with the translation your team actually needs:
-
Organize teams around business outcomes - KPIs and operational goals must be aligned at every level, with leadership committed to the operating model. If your team is measured on deployment frequency but penalized for incidents, you have a misalignment problem before you have a technical problem.
-
Implement observability for actionable insights - Gain comprehensive understanding of workload behavior: performance, reliability, cost, and health. Observability means you can ask "why is this happening?" not just "is something happening?"
-
Safely automate where possible - The word "safely" is doing a lot of work here. Automation with guardrails (rate control, error thresholds, approval gates) is the goal. Automation without guardrails is how you turn a small incident into a large one.
-
Make frequent, small, reversible changes - Smaller changes mean smaller blast radius. If something breaks, you know exactly what changed, and you can reverse it fast. This is not a nice-to-have; it is the operational philosophy behind blue/green deployments, canary releases, and feature flags.
-
Refine operations procedures frequently - As workloads evolve, so do their failure modes. Hold regular reviews to validate that procedures still match reality and that teams actually know them.
-
Anticipate failure - Drive failure scenarios proactively to understand the workload's risk profile before customers experience it. This is the philosophical foundation for chaos engineering.
-
Learn from all operational events and metrics - Post-incident analysis is not just about preventing recurrence. It is how mature engineering organizations build institutional knowledge and improve faster than their competitors.
-
Use managed services - Every managed service you adopt is operational burden you do not carry. This is why you are not running your own Kafka cluster when MSK exists, and not managing Kubernetes control planes when EKS exists.
Use these eight as a diagnostic. Which ones does your team consistently skip? That is where to start.
Organize: Build the Right Foundation Before You Ship
You cannot operate what you have not structured properly. The Organize area covers everything that happens before a workload is deployed: how your team is structured, how your AWS accounts are governed, how resources are tracked, and what security guardrails are in place.
Getting this foundation wrong is expensive to fix later. I've seen teams spend months retrofitting multi-account governance onto an existing org because they skipped it at the start. Start here.
Choose Your Operating Model
AWS documents three broad operating model topologies, and the choice shapes everything downstream.
Decentralized DevOps gives each workload team full ownership: they build it, they run it. This is Amazon's famous "two-pizza team" model - a team small enough to be fed by two pizzas that owns a workload end-to-end. It works well when teams have the skills and headcount. The catch: governance not explicitly delegated to application teams must be enforced centrally via AWS Organizations and AWS Control Tower, or it simply does not happen.
Distributed DevOps / COPE (Cloud Operations and Platform Enablement) separates application engineering from infrastructure. A platform team builds a thin layer of shared capabilities - standard networking, security baselines, approved container images - and grants application teams self-service access through AWS Service Catalog. Application teams get autonomy within guardrails. This is the right model for most growing engineering organizations: it scales without requiring every team to be a cloud expert.
Centralized operations keeps all infrastructure management with a central team. This can work for smaller organizations but becomes a bottleneck as engineering scales.
My recommendation: if you have fewer than five engineering teams, start with decentralized DevOps and enforce governance centrally via SCPs. If you have more than five teams, invest in a platform team before the coordination overhead becomes painful.
Multi-Account Architecture for Operational Isolation
This is the section most "AWS operational best practices" guides skip entirely, and it is the most consequential architectural decision you will make for operational governance.
AWS accounts are hard security and operational boundaries. Not soft boundaries. Not "well, it should be separate but we can work around it." Hard boundaries. A misconfiguration in one account cannot cascade to another. A compromise in one account does not automatically expose another. Billing is isolated. Quotas are isolated. This is why AWS recommends separating production workloads from non-production, and assigning a single or small set of related workloads to each production account.
I keep running into the same pattern with clients: everything in one account, production and dev sharing IAM policies, a single CloudTrail log that no one actually reviews. The multi-account structure feels like overhead until the day you really need it.
The AWS multi-account design principles from AWS gives you a solid starting point:
- Security OU - Audit account and Log Archive account (set up automatically by AWS Control Tower)
- Infrastructure OU - Shared services and networking accounts
- Workloads OU - Production workload accounts
- Sandbox OU - Development and test environments
- Policy Staging OU - Test proposed policy changes here before promoting to production OUs
- Suspended OU - Closed or decommissioned accounts
A critical design principle: organize accounts based on security and operational needs, not your org chart. The accounts should reflect how security controls and compliance requirements differ across workloads, not how your reporting structure happens to look today.
AWS Control Tower sets up your landing zone with default guardrails, enables CloudTrail with KMS encryption in all provisioned accounts, and creates the Security OU automatically. Control Tower also scans managed SCPs, RCPs, and Declarative Policies daily to verify compliance drift has not occurred.
Service Control Policies (SCPs) are your enforcement layer. They restrict which services, regions, and actions are allowed at the organization, OU, or account level. You can also use Resource Control Policies (RCPs), Tag Policies, Backup Policies, and Declarative Policies. This is how you prevent an engineer in a dev account from accidentally enabling a service in a region that violates your data residency requirements.
For a deep dive on multi-account strategy, that post covers the architecture in full.
Tagging Strategy as an Operational Control Plane
Tags sound boring. They are not. A missing or inconsistent tag is a cost you cannot attribute, a compliance gap you cannot audit, and an automation target you cannot reach.
AWS tagging serves four operational purposes:
- Cost allocation - Finance teams use tags to track costs across services, features, accounts, and teams. Tags enable both showback (visibility into costs per team) and chargeback (actual internal billing).
- Operations and support - Manage and discover resources, perform administrative tasks, improve troubleshooting.
- Data security and access control - Attribute-based access control (ABAC) uses resource tags in IAM policies to grant or restrict access.
- Automation filtering - Tags let automation scripts target specific resources without hardcoding identifiers.
The three common cost allocation models, in order of effort: account-based (least effort, clear per-account visibility), team-based (moderate effort, works across teams), and tag-based (most effort, highest accuracy for showbacks and chargebacks).
Here is the operational principle that matters most: use proactive tagging, not reactive cleanup. An SCP that denies resource creation without required tags prevents technical debt from accumulating in the first place. A Lambda function that scans for untagged resources after the fact is always losing.
Standard tagging format: all lowercase, hyphens separating words, organization prefix followed by a colon. For example: company:environment, company:team, company:workload. Activate tags in the Billing and Cost Management console to enable cost allocation reports.
A CDK pattern for an SCP that denies EC2 instance creation without the company:environment tag:
import * as organizations from 'aws-cdk-lib/aws-organizations';
const taggingPolicy = new organizations.CfnPolicy(this, 'RequiredTagsPolicy', {
name: 'RequireEnvironmentTag',
type: 'SERVICE_CONTROL_POLICY',
content: JSON.stringify({
Version: '2012-10-17',
Statement: [{
Sid: 'DenyEC2WithoutEnvTag',
Effect: 'Deny',
Action: ['ec2:RunInstances'],
Resource: 'arn:aws:ec2:*:*:instance/*',
Condition: {
'Null': { 'aws:RequestTag/company:environment': 'true' }
}
}]
}),
description: 'Deny EC2 creation without required environment tag',
targetIds: [workloadsOuId],
});
This is the difference between a tagging policy that exists in a wiki and one that actually works.
IAM Guardrails and Security Baseline
IAM users and roles have no permissions by default. Root user access has full unrestricted access and should be locked down immediately after account creation.
The baseline you need in place before any workload ships:
- Human users: require federation with AWS IAM Identity Center (formerly AWS SSO) for temporary credentials. No long-term access keys for humans, ever.
- Workloads: use IAM roles with temporary credentials. Do not put access keys in code, environment variables, or Lambda configuration.
- MFA: enable for all users regardless of privilege level. This is not optional.
- IAM Access Analyzer: use it to generate least-privilege policies from actual CloudTrail activity, verify public and cross-account access, and validate policy syntax before deployment.
- SCPs: set permissions guardrails at the OU level. This is your safety net for the entire account structure.
- Permissions boundaries: use for delegated access management, where application teams need to create IAM roles but should not be able to escalate their own privileges.
The IAM best practice that gets skipped most often: establish a user lifecycle policy and perform regular permission reviews. Most over-permissioned IAM entities in production exist because no one has a process for detecting and removing permissions that are no longer used.
Prepare: Make Your Workloads Operationally Ready
Prepare work happens before an incident. It is the difference between having a plan for when things go wrong and improvising at 2 AM. This section covers IaC, deployment strategies, operational readiness reviews, and disaster recovery.
Infrastructure as Code with AWS CDK
IaC treats infrastructure the same as application code: version-controlled, peer-reviewed, testable, and deployable through a pipeline. The goal AWS defines for programmatic deployment is that the version of software tested is identical to the version deployed. That sounds obvious until you see a team manually clicking through the console to "just make a quick change."
The benefits beyond consistency: automatic rollback capability, cryptographic proof of what was deployed, reduced human error, and increased release confidence.
My tool recommendation for teams with software engineering skills: AWS CDK best practices covers this in depth, but the short version is that CDK's TypeScript or Python constructs let you build reusable components that encode operational best practices by default. An L2 or L3 construct for an S3 bucket can enforce encryption, versioning, and lifecycle policies automatically every time someone instantiates it - rather than relying on every developer to remember those configurations.
The AWS IaC tool options, briefly:
| Tool | Best For |
|---|---|
| AWS CDK | Teams with strong software engineering skills who want reusable constructs |
| AWS CloudFormation | Teams with regulatory requirements needing strict promotion workflows and highly stable tooling |
| AWS SAM | Serverless applications - Lambda, API Gateway, Step Functions |
| Terraform | Multi-cloud deployments or enterprises already standardized on HashiCorp tooling |
IaC best practices regardless of tool:
- Store all templates in source control (not in someone's AWS account)
- Deploy through CI/CD - manual deployments should trigger a conversation about why
- Test in lower environments before production; never promote untested infrastructure
- Run schema validation and linting -
cdk-nagfor CDK; cfn-lint for CloudFormation - Enable drift detection and treat drift as an incident, not a footnote
Safe Deployment Strategies
The Well-Architected best practice OPS06-BP03 is clear: the goal is a CI/CD system that automates safe rollouts, with teams required to use appropriate strategies. "Required" is the key word. Safe deployment strategies should not be optional.
AWS defines six safe deployment strategies:
| Strategy | Description | Best For |
|---|---|---|
| Feature flags | Enable/disable features without code deployment | Business-level segmentation; gradual rollout |
| One-box | Deploy to a single unit first | Limiting initial blast radius |
| Rolling/Canary | Incrementally deploy to a percentage of fleet | Reducing risk with metric observation |
| Immutable | Deploy new instances; terminate old ones | Consistent environments; no in-place mutations |
| Traffic splitting | Shift traffic percentages between versions | Progressive validation with real user traffic |
| Blue/green | Two identical environments; switch traffic at cutover | Near-zero-downtime releases with instant rollback |
Common anti-patterns to avoid: deploying all of production at once (when it fails, it fails everywhere), requiring extensive approval processes for every release (this pressures teams to batch changes and increases blast radius), and mutable deployments where you update running instances in place (you lose the ability to roll back).
For ECS deployments specifically, AWS CodeDeploy supports: deployment alarms that halt a rollout when error thresholds are breached, configurable bake time to validate stability, canary configuration for progressive traffic shifts, deployment circuit breakers for automatic rollback, and lifecycle hooks for custom pre/post-deployment validation.
"Make frequent, small, reversible changes" is not just a design principle. It is the operational practice that makes everything else easier.
Operational Readiness Reviews (ORR)
The Operational Readiness Review (ORR) is distinct from a Well-Architected Review, and this distinction matters. The Well-Architected Review covers broad architectural best practices across all six pillars. The ORR uses data from your organization's own post-incident analyses to generate organization-specific best practices - the ones that emerged from your incidents, not generic guidance.
AWS created the ORR program to distill learnings from AWS operational incidents into curated questions with best practice guidance. The stated outcome: "shorter, fewer, and smaller incidents."
The six ORR checklist domains:
- Architectural recommendations
- Operational processes
- Event management
- Release quality
- Security
- Governance and compliance
ORRs should be conducted before a workload launches and periodically throughout its lifecycle to catch any drift from best practices. The process: gather stakeholders, create a checklist based on your post-incident learnings, identify the workload to review, and address discoveries made during the review.
The leverage point is automating ORR findings into detection. Once you've identified a class of issue in an ORR, build a Config rule, Security Hub finding, or Control Tower guardrail that catches it automatically in future. An ORR finding that lives only in a spreadsheet will be forgotten. An ORR finding that triggers a Config rule will not.
Disaster Recovery: Defining RTO and RPO
Two definitions that must be documented for every workload before it goes to production:
RTO (Recovery Time Objective): The maximum acceptable delay before a service is restored after a failure. How long can this be down?
RPO (Recovery Point Objective): The maximum acceptable amount of data loss since the last recovery point. How much data can we afford to lose?
The Well-Architected best practice REL13-BP01 is explicit: engage business stakeholders to understand the monetary cost per minute of downtime. This is not a technical decision. Engineers propose the strategy; business stakeholders define the acceptable risk.
AWS commonly sees three DR tiers:
| Tier | Workload Type | RTO | RPO |
|---|---|---|---|
| Tier 1 | Mission-critical | 15 minutes | Near-zero |
| Tier 2 | Important, non-mission-critical | 4 hours | 2 hours |
| Tier 3 | All other applications | 8-24 hours | 4 hours |
The five Well-Architected DR best practices (REL13):
- Define recovery objectives with business stakeholders (REL13-BP01)
- Use defined recovery strategies to meet those objectives (REL13-BP02)
- Test the implementation - measure actual Recovery Time Capability, not assumed (REL13-BP03)
- Manage configuration drift at the DR site using AWS Config and Systems Manager Automation (REL13-BP04)
- Automate recovery processes (REL13-BP05)
The most skipped step is REL13-BP03. Teams define RTO/RPO, configure multi-region failover, and then never actually test whether it works. Assumed recovery capability is not recovery capability.
One more distinction worth noting: availability engineering focuses on redundancy within a workload's components (multiple AZs, Auto Scaling, ALB health checks). Disaster recovery focuses on discrete full copies of the entire workload that can be activated when the primary copy is unrecoverable. Both matter. They solve different problems.
Operate: Run Workloads With Confidence
Prepare work is done before incidents happen. The Operate phase is what you do when things are running - and when they go wrong. This covers observability, incident management, compliance automation, and audit logging.
Observability: Logs, Metrics, and Traces
Observability is built on three data sources: logs, metrics, and traces. Monitoring is reactive - it tells you something is wrong. Observability is proactive - it tells you why something is wrong before your users notice.
The core AWS observability stack:
| Service | Primary Use |
|---|---|
| Amazon CloudWatch | Metrics, logs, alarms, dashboards, anomaly detection |
| AWS X-Ray | Distributed tracing, application request debugging |
| AWS CloudTrail | API call history, governance, compliance |
| VPC Flow Logs | Network traffic visibility |
| Amazon EventBridge | Event-driven automation and routing |
| Amazon Managed Grafana | Visualization and dashboards |
| Amazon Managed Service for Prometheus | Metrics for containerized workloads |
| AWS Distro for OpenTelemetry (ADOT) | Open-source telemetry collection |
CloudWatch is the center of gravity for AWS observability. Key capabilities beyond the standard metrics/logs/alarms:
- CloudWatch Application Signals - Auto-discovers and visualizes application topology without requiring instrumentation changes
- CloudWatch Synthetics - Runs synthetic monitoring checks to verify availability before real users hit errors
- CloudWatch RUM (Real User Monitoring) - Visibility into actual user experience on web, iOS, and Android
- CloudWatch Investigations - AI-powered interactive incident analysis with a "5 Whys" investigation workflow (announced re:Invent 2025)
- CloudWatch Lambda Insights and Container Insights - Specialized performance monitoring for serverless and containerized workloads
Three questions to ask before creating any alert:
- Why am I monitoring this metric?
- Who gets notified when the threshold is breached?
- What is the business impact of a breach?
If you cannot answer all three, the alert is not ready. Alerts should fire before a problem affects users, notify teams (not individuals), and include human-readable diagnostic information in the alert body itself. Use composite alarms to group related alerts and reduce noise - alert fatigue is a real operational problem that degrades on-call reliability over time.
For multi-account environments, use cross-account and cross-region log centralization to consolidate logs from all accounts into a single destination account. This was announced in 2025 and directly addresses one of the most common operational pain points at scale.
Incident Management Lifecycle
Define your incident response process before an incident, not during one. Impact ratings in particular should be agreed upon in advance - not debated at 2 AM with production down.
The four incident lifecycle phases:
-
Alert and Engage - CloudWatch metrics and EventBridge alerts detect anomalies and trigger automated incident creation and escalation. Response plans define who gets paged and at what threshold.
-
Triage - Responders assess impact using pre-defined impact ratings. Critical means full application failure impacting most customers. Establish these categories now, at the top end: Critical, High, Medium, Low, No Impact (urgent but customers not currently affected).
-
Investigate and Mitigate - Work through runbooks, review timelines, correlate metrics. This is where Systems Manager Automation executes remediation steps.
-
Post-Incident Analysis - Reflect on the incident, identify contributing factors, define preventative actions, communicate learnings. This feeds directly into the Evolve phase.
A note on AWS Systems Manager Incident Manager: it is no longer open to new customers. For new implementations, the recommended pattern is CloudWatch Alarms and EventBridge for detection and triggering, Systems Manager Automation for response execution, Amazon SNS for routing notifications (email for low-priority, SMS/pager for high-priority), and PagerDuty or OpsGenie as the incident workflow and on-call management layer.
Keep all runbooks, alarms, and configuration in version control. Operational artifacts that live in someone's head or an informal Slack channel are not operational artifacts at all.
Compliance Automation with AWS Config
AWS Config monitors and records resource configurations, detects configuration drift, and can invoke Systems Manager Automation to fix noncompliant resources. This is your automated compliance enforcement layer - the mechanism that ensures the security and operational standards you define are actually maintained.
The most powerful Config feature that almost no third-party content covers: Config Conformance Packs. A conformance pack is a collection of Config rules and remediation actions deployed as a single YAML template across your entire organization. AWS provides pre-built conformance packs for the Well-Architected Reliability and Security pillars - you can deploy them directly.
Auto-remediation closes the loop: a Config rule detects a noncompliant resource, triggers an SSM Automation document to fix it, and logs the remediation. No human required. This is what "operations as code" looks like in practice.
The AWS Control Tower integration is worth understanding: Control Tower uses Config for configuration history and snapshots in all provisioned accounts, and scans managed SCPs, RCPs, and Declarative Policies daily for compliance drift. You get Config coverage in every Control Tower-managed account automatically.
Audit Logging with CloudTrail
CloudTrail is enabled by default on account creation, but the default event history alone is insufficient for compliance or security purposes. To maintain an ongoing record, create trails in one or all AWS regions.
Non-negotiable CloudTrail configuration:
- Enable in all AWS regions (not just your primary region)
- Enable log file integrity validation - this detects if logs have been modified, deleted, or forged
- Encrypt logs with KMS
- Ingest into CloudWatch Logs for real-time monitoring and alerting on specific API activity
- Centralize logs from all accounts into a single S3 bucket in the Log Archive account
- Apply lifecycle policies for long-term retention
- Prevent disabling via SCP - no account should be able to turn off CloudTrail
The SCP approach is underused. If an attacker compromises an account, one of the first things they do is disable logging. An SCP that denies cloudtrail:StopLogging and cloudtrail:DeleteTrail at the OU level removes that option entirely.
For analytics: CloudTrail Lake is a fully managed, serverless data lake for immutable CloudTrail event storage with SQL querying capability. It supports multi-account, multi-region, and even multicloud/multisource event data.
A 2025 update worth knowing: CloudTrail now offers event aggregation and insights for data events, consolidating high-volume API activity into 5-minute summaries with automatic anomaly detection. This is particularly useful for S3 and Lambda data event activity that would otherwise generate overwhelming log volume.
CloudTrail participates in AWS compliance programs for SOC, PCI, FedRAMP, and HIPAA.
Evolve: Build Operational Maturity Over Time
You can operate excellently from day one and still accumulate operational debt over time. The Evolve phase is how you prevent that - and how you distinguish between organizations that get better after incidents and ones that repeat the same failures.
Post-Incident Analysis and the Correction of Errors (COE) Process
The COE (Correction of Errors) process is AWS's internal post-incident analysis methodology, and it is worth adopting as your own. After every customer-impacting event: identify the contributing factors (not just the proximate cause), define preventative actions, and communicate what happened with affected teams.
The key operational discipline here is treating COE findings as backlog items, not post-mortems that expire in a PDF. Continuous improvement means improvement work is in the sprint, not a separate "project" that never gets prioritized.
Three practices that separate mature engineering orgs from reactive ones:
- Blameless post-mortems - The system failed, not the person. When teams fear blame, incidents go under-reported and root causes go unaddressed.
- Feedback loops in procedures - Runbooks should be updated based on what actually happened in an incident, not what you assumed would happen when you wrote them.
- Cross-team retrospective analysis - Look for systemic patterns across incidents, not just individual event root causes. If three different teams hit the same class of failure in three months, that is a platform problem, not three separate team problems.
Dedicate explicit work cycles to continuous improvement. Improvement work placed in a "someday" backlog column does not happen.
Chaos Engineering and Game Days with AWS FIS
The "anticipate failure" design principle has a practical implementation: chaos engineering. AWS Fault Injection Service (AWS FIS) is the native tool for running fault injection experiments on AWS resources.
The underlying idea: if you design for resilience but never verify that resilience under real conditions, your confidence is assumed, not earned. AWS FIS lets you run controlled experiments that answer the question before customers answer it for you.
The chaos engineering flywheel:
- Define steady state - what does normal behavior look like, measured quantitatively?
- Form a hypothesis - what do you expect to happen when this specific fault is injected?
- Run the experiment by injecting the fault
- Verify the hypothesis by measuring actual outputs
- Improve workload design if steady state was not maintained
- Run experiments regularly as part of the CI/CD pipeline (treat experiments as code)
AWS FIS supports experiments on EC2, ECS, EKS, and RDS. Fault types include termination, failover, resource stressing, and latency injection. Third-party integrations are available for Chaos Mesh, Litmus Chaos, Gremlin, and Chaos Toolkit.
Always include guardrails and stop conditions. An experiment that escapes its intended scope and impacts production traffic has achieved the opposite of its goal.
Game days are the structured version of chaos engineering: a scheduled exercise where the entire team practices responding to a simulated failure. Game days validate that runbooks actually work, that teams know their roles, and that the on-call handoff process functions under pressure. Run them at least quarterly for critical workloads.
Anti-patterns to avoid: designing for resilience without ever running experiments, running experiments only in staging (production behavior is different), and not using past post-incident analyses to inform which faults to inject next.
Well-Architected Reviews and Trusted Advisor
The AWS Well-Architected Tool is free in the AWS console. The formal architecture review should happen at least annually per OPS11-BP01. For the complete 57-question checklist, see the Well-Architected Review post. Running a Well-Architected Review before your next major production launch is always better than running one after your first major incident.
The post-review process that most teams skip:
- 1 day after: recap email to all stakeholders
- 2-3 days after: HRI (High Risk Issue) prioritization meeting
- 1 week after: improvement plan initiated, with a 90 or 180 day duration
- Routine follow-up meetings to review improvement actions
The Well-Architected Tool supports custom lenses, which let organizations incorporate their own internal best practices alongside the standard framework. This is how you formalize the learnings from your COE process into a repeatable review.
Trusted Advisor provides automated checks across six categories: cost optimization, performance, security, fault tolerance, service limits, and operational excellence. Full check access requires Business Support+ or higher. Trusted Advisor Priority, available with Enterprise Support or the Unified Operations plan, creates prioritized recommendations proactively.
Worth noting: Developer, Business, and Enterprise On-Ramp support plans are being discontinued January 1, 2027. The successor plans are Business Support+, Enterprise Support, and Unified Operations. If your support plan contracts reference the old plan names, plan for this transition.
For operational data trend analysis: export log data to S3, use AWS Glue to discover and prepare the data, store metadata in the Glue Data Catalog, query with Amazon Athena using standard SQL, and visualize with Amazon QuickSight. This is the standard AWS analytics stack for operational insights.
Cost Optimization as an Operational Discipline
Cost management is not a quarterly finance meeting. It is a daily operational responsibility. Cost overruns are operational failures just like availability failures.
The four AWS Cloud Financial Management pillars:
- See - Establish visibility through account structure and resource tagging. AWS Control Tower, Cost Explorer, and the Cost and Usage Report support this.
- Save - Optimize through pricing model selection: EC2 Savings Plans (up to 72% vs. On-Demand), Spot Instances (up to 90% for fault-tolerant workloads), rightsizing recommendations.
- Plan - Improve forecasting with AWS Budgets and Cost Explorer.
- Run - Manage billing controls using the Billing Console, IAM policies, and SCPs.
Cost Anomaly Detection deserves special attention: it monitors for unexpected spend spikes and alerts before a misconfiguration or runaway process becomes a budget crisis. Set it up for every production account. The alert threshold should be calibrated to your normal spend variance - too sensitive means noise, too loose means surprises.
Cost Optimization Hub consolidates cost optimization opportunities across accounts and regions, offering over 15 recommendation types including EC2 rightsizing, Graviton migration candidates, idle resource detection, and Savings Plans recommendations. Review it monthly as part of your operational cadence.
Establish cost ownership at the team level. A platform team that owns cost visibility without application team accountability will never create a cost-aware culture. Teams should see their own cost trends, not just receive aggregated reports.
AWS Operational Best Practices Checklist
Turning the practices above into a self-assessment is where the real value is. Here are the highest-priority items per OE area to get you started:
Organize - OU structure deployed with Security, Infrastructure, Workloads, Sandbox, and Policy Staging OUs. SCPs applied at the OU level. Tagging policy enforced via SCP. IAM Identity Center configured for all human access. Root user access keys deleted and MFA enabled across all accounts.
Prepare - All infrastructure deployed via IaC with CI/CD enforcement. Safe deployment strategy (canary, blue/green, or rolling) configured for production workloads. ORR completed before every production launch. DR tiers assigned with tested RTO/RPO.
Operate - CloudWatch dashboards per workload with composite alarms. CloudTrail enabled in all regions with logs centralized to the Log Archive account and SCP preventing disabling. Config conformance packs deployed org-wide with auto-remediation.
Evolve - COE process documented and practiced for P1/P2 incidents. Well-Architected Review completed in the last 12 months. AWS FIS experiments run for mission-critical workloads. Cost Optimization Hub reviewed monthly.
That is a starting point, not the full picture. Our AWS Operations Checklist contains 150+ expert-verified checks organized by OE area, with specific AWS service mappings and priority ratings for every item. If you are serious about closing operational gaps, that is the tool to use.
Get the Full AWS Operations Checklist
150+ expert-verified operational checks organized by Organize, Prepare, Operate, and Evolve. Each item maps to specific AWS services with priority ratings so you know exactly where to start.
Conclusion
AWS operational best practices are not a single checklist you complete once. They are a continuous cycle: Organize your team and governance, Prepare your workloads for production, Operate with real-time observability and automated responses, and Evolve based on data from real incidents.
The biggest gaps I see consistently across AWS environments:
- No multi-account governance structure - production and dev sharing an account, SCPs absent, CloudTrail not centralized
- No incident response process - no documented impact ratings, no automated alerting, no post-incident analysis feeding back into improvement
- No programmatic compliance enforcement - AWS Config conformance packs deployed in zero of the accounts, drift detection disabled, Config rules unenforced
If you want to act on this immediately: use the checklist above to identify your three biggest gaps. Pick the one that would have the highest blast radius if it failed, and fix that first. For most teams, that is either the multi-account governance structure or the absence of centralized CloudTrail logging.
The Well-Architected Tool is free. Run a review before your next production launch, not after your first major incident. For a full architectural deep dive, the multi-account strategy post covers the OU design in complete detail. For the formal architecture review process, the Well-Architected Review checklist covers all 57 questions.
What is the biggest operational gap you have seen in AWS environments? Drop it in the comments - I am curious whether it shows up on this checklist.
