diff --git a/README.md b/README.md index 7f87cf9..77bdcdd 100644 --- a/README.md +++ b/README.md @@ -38,6 +38,7 @@ AWS cost tooling is powerful, but day-to-day cost visibility can still feel frag - AWS account onboarding through a standardized `AssumeRole` flow - Persisted reporting model backed by PostgreSQL rather than live Cost Explorer requests on every view - Budget alerts and notification workflows backed by worker processes +- EventBridge-scheduled Lambda execution for recurring verified-account cost sync - SES-backed auth and alert email delivery - Terraform-managed infrastructure for DNS, SES, CI/CD bootstrap, ECS, RDS, S3, and CloudFront @@ -189,6 +190,7 @@ npm test - the current production pattern uses: - `underflow.` for the web frontend - `api.underflow.` for the API +- scheduled cost sync runs through EventBridge + Lambda, while the ECS worker stays focused on alert evaluation - Terraform can provision: - Route 53 hosted zone, SES identity, DKIM, MAIL FROM, and DMARC records - bootstrap CI/CD infrastructure such as Terraform remote state and GitHub OIDC diff --git a/docs/architecture.md b/docs/architecture.md index 7721336..1b42ab1 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -71,6 +71,7 @@ Infrastructure code provisions the production AWS footprint and supporting integ - A scheduled Lambda can run cost sync across all verified AWS accounts on a fixed interval - Reporting endpoints expose summary, by-service, timeseries, and sync history views - The frontend presents this through workspace-scoped dashboards and detail pages +- Manual syncs and scheduled syncs share the same persistence path and use advisory locks to avoid duplicate per-account work ### Alerts and notifications @@ -78,6 +79,17 @@ Infrastructure code provisions the production AWS footprint and supporting integ - The ECS worker evaluates active alerts on a schedule - Notification delivery and status are persisted and surfaced in the frontend feed +## Runtime Ownership + +- ECS API + - handles browser/app HTTP traffic + - owns auth, workspace management, AWS account onboarding, reporting APIs, and manual sync triggers +- ECS worker + - handles scheduled alert evaluation and related background work +- Lambda + EventBridge + - handles recurring verified-account cost sync every 6 hours + - writes CloudWatch invocation logs and DB-backed sync history through existing `cost_sync_runs` + ## Email / SES Integration Boundary Email is treated as a real integration boundary rather than a mocked afterthought. @@ -93,6 +105,9 @@ Email is treated as a real integration boundary rather than a mocked afterthough - Background processing is intentionally split by responsibility: - Lambda handles scheduled cost sync - ECS worker handles alert evaluation +- Runtime configuration is intentionally split as well: + - shared DB/AWS/logging config is used by API, worker, and Lambda + - auth/cookie-specific config is validated only in the API runtime - Some cloud integrations are fully wired but still benefit from live-account validation before they should be considered fully hardened ## What A Reviewer Should Notice diff --git a/docs/production-deployment.md b/docs/production-deployment.md index fbd7a34..81b9a23 100644 --- a/docs/production-deployment.md +++ b/docs/production-deployment.md @@ -77,6 +77,8 @@ npm install npm run build:lambda ``` +The Lambda artifact is built from the API codebase and must exist before Terraform can package it locally. + ### 4. Run the first production apply The deploy workflow is designed to own the ongoing rollout, but it is still useful to understand the shape: @@ -155,6 +157,7 @@ Add these as `production` environment secrets in GitHub: - applies Terraform with the image URI - runs migrations through ECS - waits for API and worker services to stabilize +- updates the scheduled sync Lambda code package and handler configuration when Lambda-related changes are present ### `deploy-web.yml` @@ -187,6 +190,7 @@ For the split-domain setup, prefer leaving `AUTH_COOKIE_DOMAIN` empty so auth co - writes visible execution history through existing `cost_sync_runs` rows - emits invocation-level logs to CloudWatch - syncs all verified AWS accounts while relying on advisory locks to avoid duplicate per-account work +- validates only the shared runtime env needed for DB/AWS/logging rather than API-only auth/cookie config ### Web @@ -213,5 +217,53 @@ Run these checks immediately after the first deployment: ## Rollback Guidance - roll back API/worker by redeploying the previous image tag +- roll back the scheduled sync Lambda by applying the previous Terraform/code revision if the issue is limited to recurring sync - redeploy the previous web build if the issue is frontend-only - if a migration introduced the problem, stop and restore from backup rather than improvising production SQL + +## Lambda Troubleshooting + +If the scheduled sync Lambda fails in production: + +1. Inspect the Lambda invocation response with tail logs: + +```bash +aws lambda invoke \ + --region us-west-2 \ + --function-name underflow-prod-scheduled-cost-sync \ + --log-type Tail \ + response.json \ + --query 'LogResult' \ + --output text | base64 --decode + +cat response.json +``` + +2. Check the configured handler: + +```bash +aws lambda get-function-configuration \ + --region us-west-2 \ + --function-name underflow-prod-scheduled-cost-sync \ + --query 'Handler' \ + --output text +``` + +Expected handler: + +```text +dist/jobs/scheduled-cost-sync-handler.handler +``` + +3. If local `terraform apply` is being used, rebuild the Lambda artifact first: + +```powershell +cd apps\api +npm run build:lambda +``` + +4. If the Lambda still fails before structured app logs appear, look for bootstrap errors such as: + +- `Runtime.HandlerNotFound` +- missing module/package errors +- missing required runtime environment variables diff --git a/docs/production-operations.md b/docs/production-operations.md index 12c5f24..570cd8c 100644 --- a/docs/production-operations.md +++ b/docs/production-operations.md @@ -13,6 +13,7 @@ This document is the minimum viable runbook for operating Underflow with a first - scheduled cost sync Lambda - runs periodic verified-account sync every 6 hours - writes invocation logs to CloudWatch and sync history through existing DB tables + - uses shared runtime DB/AWS/logging config rather than the API-only auth/cookie config - `apps/web` - static frontend served separately from the API - PostgreSQL @@ -55,6 +56,13 @@ Production defaults and expectations: 6. Deploy the frontend. 7. Verify health and a basic authenticated page load. +If deploying locally through Terraform instead of GitHub Actions, rebuild the Lambda artifact before `plan` or `apply`: + +```powershell +cd apps\api +npm run build:lambda +``` + ### Roll back 1. Roll back the API and worker to the last known good image/build. @@ -87,6 +95,8 @@ Recommended counters to track from logs: - successful sync count - failed sync count +- scheduled Lambda invocation count +- scheduled Lambda failure count - successful alert delivery count - failed alert delivery count - failed auth email delivery count @@ -112,6 +122,23 @@ Check: - the selected account/date filters in the UI are correct - sync history does not show Cost Explorer permission or data-availability errors +### Scheduled sync Lambda fails before app logs appear + +Check: + +- the deployed Lambda handler matches: + - `dist/jobs/scheduled-cost-sync-handler.handler` +- the latest Lambda artifact was rebuilt before Terraform applied it +- the Lambda invoke response includes tail logs from: + - `aws lambda invoke --log-type Tail ...` +- required shared runtime env is present: + - `DATABASE_URL` + - `DATABASE_SSL_ENABLED` + - `DATABASE_SSL_REJECT_UNAUTHORIZED` + - `AWS_SES_REGION` + - `COST_SYNC_LOOKBACK_DAYS` + - `LOG_LEVEL` + ### SES / email failures Check: