🏭 PRODUCTION SETUP FAQ¶

Context: Questions related to deploying the Traylinx Core infrastructure to a live production environment.

Q: What are the possibilities for running Traylinx Core in production, and what is the best setup?¶

To determine the best setup, we analyzed three primary architectural approaches:

Description: Running the existing docker-compose.yml on a large EC2/Compute Engine instance.
Pros: Extremely simple (same as local dev), cheap, fast to deploy.
Cons: Single Point of Failure (SPOF). If the VM dies, the entire network goes down. No zero-downtime updates. Hard to scale specific components (e.g., just the Router).
Verdict: Good for Staging or POC, unsuitable for Production.

Description: Deploying each service as a containerized serverless function.
Pros: Zero infrastructure management, pay-per-use (great for low traffic).
Cons: Connection Pooling issues with PostgreSQL (requires PgBouncer/RDS Proxy). "Cold starts" can affect the real-time latency required for high-frequency Agent-to-Agent (A2A) communication.
Verdict: Viable for the Router, but adds complexity for the Registry and Subscription services due to database connections.

Description: Running the services as Pods in a managed Kubernetes cluster.
Pros:
- Native Fit: Traylinx is built as "Kubernetes for Agents"; running it on Kubernetes is the natural choice.
- Resilience: Self-healing (auto-restart crashed pods).
- Scalability: The Router Agent can auto-scale (HPA) independently to handle traffic spikes.
- Zero-Downtime Deployments: Rolling updates for all services.
Verdict: The Best Production Setup. It provides the robustness required for an "Operating System" layer.

Orchestration: Managed Kubernetes (GKE AutoPilot or EKS).
Database: Managed PostgreSQL (AWS RDS or Google Cloud SQL).
- Do NOT run Postgres inside K8s for production. Let the cloud provider handle backups, replication, and patching.
- Provision separate logical databases: registry, subscription, cortex.
Redis: NOT REQUIRED (Following the successful caching refactor).

Each microservice maps to a Kubernetes workload type:

Service	K8s Workload	Scaling Strategy	Storage Needs
Agent Registry	`Deployment`	Static (2-3 replicas) for HA.	None (Stateless App, state in DB).
Router Agent	`Deployment`	Horizontal Pod Autoscaler (HPA). Scale on CPU/RAM.	None (Totally Stateless).
Subscription Service	`Deployment`	Static (2 replicas).	None (State in DB).
Traylinx Cortex	`StatefulSet`	Single replica (if holding memory state) or Deployment (if using external DB).	PVC for caching (optional).

Ingress Controller: NGINX Ingress or AWS ALB Controller to manage external access.
TLS Termination: usage of cert-manager with Let's Encrypt for automatic HTTPS.
Authentication (Sentinel):
- Keep using the managed Traylinx Sentinel (api.makakoo.com) as the Identity Provider (IdP).
- Inject TRAYLINX_CLIENT_ID and TRAYLINX_CLIENT_SECRET via Kubernetes Secrets.

Before deploying to this architecture, the following gaps must be addressed:

Missing Docker Assets:
- The traylinx_router_agent is missing a production Dockerfile.
Configuration Management:
- Services currently default to localhost DB strings. We need to ensure DATABASE_URL is overrideable via environment variables (this appears supported but verify).
Observability:
- Production needs centralized logging (Fluentd/Datadog) and metrics (Prometheus/Grafana) to trace A2A calls across the cluster.