I've spent 15 years inside data systems - agencies, ad tech, big tech - watching the same failure mode repeat. Someone pulls paid media data from three platforms, joins it with CRM exports, runs attribution logic in a notebook, and builds a slide deck. Next quarter they do it again from scratch because nothing was reproducible. The pipeline was a person.
Most marketing data teams don't need another dashboard. They need reliable ingestion, transformation, enrichment, and activation - without hiring a platform engineer or maintaining fragile pipelines stitched together from Airflow jobs and cron scripts.
I started building Ca$ino to understand what it would actually take to make that workflow reliable. Not a demo. Not a notebook. An agent that could take messy multi-source data, normalize it, analyze it, visualize it, and produce a report - and do it again next quarter without someone rebuilding the pipeline.
I built it twice.
The first version is a local data science agent. One user, one machine, one workspace. You give it a dataset, it thinks out loud, writes code, makes charts, saves reports. It runs Python in a subprocess on your laptop. That version is open source.
The second version is what powers the live demo on this site. I pushed it further to understand the production questions: What happens when you need client isolation? Reproducibility across sessions? Workflows that don't break when someone else runs them? That version isn't open source - it's full of infrastructure wiring and AWS-specific config that only makes sense in context.
The Workflow That Keeps Breaking
To make this concrete, here's the recurring pattern I kept seeing:
You're pulling paid media performance from multiple ad platforms, CRM lead data, web analytics, and campaign metadata. You need to normalize the schemas, de-duplicate across sources, enrich with AI classification, run statistical analysis, and generate structured outputs for reporting.
Most teams duct-tape Python scripts, dbt, and a prayer. Every quarter someone rebuilds the pipeline because the person who wrote it left. The notebook has 47 cells and nobody remembers which order to run them in.
The architecture I landed on uses isolated, reproducible, agent-driven workflows. The agent remembers the workspace state. The specialists know their jobs. The artifacts persist. Run it again next quarter and you get the same pipeline, not a new adventure. Whether that's the right approach at every scale is an open question - but at the scale I've been testing it, it works.
The Local Architecture
Here's Ca$ino on your laptop. Simple on purpose.
One FastAPI process. One Strands agent created per request. Code runs in a subprocess with a 30-second timeout. Artifacts land on the local filesystem. No auth. No isolation. No queues. No state beyond what's on disk.
This is fine for personal analysis. It's a disaster for teams.
What Breaks When Teams Use It
When I started stress-testing for multi-user scenarios, these are the gaps that surfaced. I haven't had ten concurrent users yet - but preparing for it exposed how much the local architecture assumes about trust, isolation, and state.
1. Code execution is a security nightmare. The agent runs LLM-generated Python directly on the host. Locally, you trust yourself. In production with client data, you trust nobody. This is why the production version uses sandboxed code execution.
2. Client data bleeds across workspaces. If two analysts name a dataset campaign_performance.csv, they overwrite each other. There's no tenant boundary. Agencies managing multiple clients can't tolerate this.
3. The agent blocks the event loop. Ten concurrent users means ten agents competing for the same CPU and memory. There's no backpressure. The analyst running a 50MB dataset join starves the one checking a simple metric.
4. There's no state management. No server-side session means no way to resume a workflow if the connection drops. In a 20-minute multi-step analysis, that's not a minor inconvenience.
5. Secrets are environment variables. One .env file for the whole server. Every user shares the same LLM API key. You can't meter usage per client, and you can't let teams bring their own keys.
The Production Architecture
Here's what Ca$ino looks like when it needs to serve real teams with real client data.
Every layer that was implicit locally becomes an explicit, isolated component. The whole system runs in a single Docker container: FastAPI, Uvicorn, the orchestrator, all 26 specialist agents. This is a deliberate tradeoff - one container, easy to deploy, easy to reason about. The agents themselves are lightweight routing logic. The heavy compute happens externally in sandboxed environments. CloudFront handles CDN delivery for visualizations. Redis holds sessions. DynamoDB handles workspace metadata lookups in under 200ms.
People will have opinions about the monolith. More on that below.
Workspace Isolation: Where the Real Engineering Is
The hardest part of going to production isn't scaling compute. It's isolation. If you're running workflows across multiple clients or teams, data, artifacts, and outputs can't bleed across boundaries. I've seen this go wrong at agencies, at enterprise marketing teams, and at platform companies. You need three layers.
Client/Tenant Isolation
A tenant is a client or organization. Everything they produce is invisible to every other tenant.
- Storage: Artifacts are stored under a tenant- and user-scoped S3 prefix:
s3://bucket/{tenant_id}/{user_id}/{workspace_id}/.... Access is enforced by server-side authorization and S3 IAM policies scoped to the caller's tenant/user context. The prefix hierarchy is an organizational convention; the enforcement comes from IAM + authz. - Workspace lookup: DynamoDB uses
tenant_idas the partition key, which makes correctly-scoped queries fast and naturally tenant-bounded. Authorization enforces that callers can only access records for their tenant. - Credentials: Data source credentials are encrypted with KMS using tenant-scoped controls. API keys are stored as non-reversible hashes for verification. No shared secrets across tenants.
User Isolation
Within a client account, team members share data but need boundaries:
- RBAC: Owner, editor, viewer roles per workspace. An editor can run analyses and upload data. A viewer can see results but can't execute code or modify datasets.
- Audit log: Every agent action is logged with the user ID, timestamp, and input/output hash. When a client asks "who ran this analysis and when," you can answer.
- Quotas: Per-user rate limits on LLM calls and compute minutes. One analyst can't burn through the client's API budget.
Execution Isolation
The agent executes arbitrary Python code generated by an LLM. In production, Ca$ino uses a dual-tier sandbox system.
Bedrock Code Interpreter is the default. Each execution runs in a fresh, ephemeral environment. Isolated compute, no network access, no persistent filesystem. Pre-installed with pandas, numpy, sklearn, matplotlib, plotly, bokeh. The tradeoff: 5-10 second cold start, no seaborn, no pip install, 25MB payload limit.
Daytona is the fallback for complex work. Sub-90ms sandbox creation. Full seaborn support. Pip install available. When Bedrock can't handle it (large datasets, custom packages), the code executor routes to Daytona automatically.
Both sandboxes write results back to the tenant-scoped S3 workspace. The sandbox is stateless. The workspace is the state.
The live demo uses the same isolation model. Each browser session auto-registers a per-session user identity. All artifacts are written under that user's scoped workspace prefix, and the backend enforces that requests can only access data owned by that identity. This gives strong logical isolation between demo visitors; production tenants use the same model, with stricter IAM/KMS policies and audit controls.
The local version runs subprocess.run(["python3", script]) with a 30-second timeout. The production version has two sandboxes. Bedrock for standard work, Daytona for anything that needs more flexibility. Same interface from the agent's perspective: code in, results out. The code executor picks the right sandbox based on the task.
Governance note: Agent systems that execute arbitrary code on client data create audit surface area that most compliance frameworks aren't designed for. Every tool call, every code execution, every artifact write in this system is logged with the user ID, timestamp, and input/output hash. Any production agent system that doesn't have this is accumulating governance debt that compounds silently until an incident forces it into the open. If you're handling client data in a regulated industry, this isn't optional.
The Swarm Pattern
The production version doesn't use a single agent. It uses a swarm.
A single agent is good at following instructions. But a real data workflow isn't one instruction. It's: ingest data, clean it, explore it, run statistics, visualize it, write a report. Each step has different expertise. Asking one agent to do all of that is like asking one person to be the data engineer, statistician, designer, and writer simultaneously. They'll compromise on everything.
A swarm breaks this into specialized agents that coordinate.
How it works in production
1. The orchestrator handles the request. User says "analyze this campaign data and tell me why Q3 ROAS dropped." The orchestrator can handle simple tasks directly. If the task is complex, it calls escalate_to_swarm.
2. The strategic planner decomposes the work. This is the brain of the swarm. It breaks the request into steps, routes to specialists, tracks progress, and handles errors. If a specialist fails twice with the same error, the planner gets the error history injected into its context and tries a different approach.
3. Specialists do focused work. The data operations agent handles SQL queries (via DuckDB), pandas transforms, and statistics. The visualization agent writes matplotlib, plotly, or bokeh code and sends it to the sandbox. The Google Analytics agent pulls GA4 data. The MongoDB agent runs aggregations. 26 specialists total, each with a narrow system prompt and a focused tool set.
4. Specialists share a workspace, not a conversation. Agents don't pass messages to each other directly. They read from and write to a shared workspace on S3. The data operations agent saves a cleaned CSV. The viz agent finds it and makes charts. The workspace is the communication layer.
5. The planner decides when it's done. After each specialist completes, the planner evaluates: is there more work? Did something fail? When the full workflow is satisfied, it presents the final results.
Stateless agents, shared state
Every agent in the swarm is stateless. They boot, do their job, write to the workspace, and die. No long-lived processes. No in-memory state to lose.
The workspace is the coordination protocol.
This means you can add a new specialist agent without touching any existing agent's code. A "media mix modeling" agent that reads campaign data and writes attribution outputs. An "audience segmentation" agent that clusters CRM data. Each one just reads from the workspace and writes back to it. Plug it into the orchestrator's routing table and it's live.
This artifact-based communication pattern maps cleanly to how marketing teams actually work. The data engineer hands off cleaned exports. The analyst produces findings. The strategist creates the deck. Nobody sends messages to each other - they share files in a common workspace. The swarm pattern isn't a technical novelty. It's a formalization of how cross-functional marketing work already happens.
Why swarms beat a single agent
A single agent with access to all tools can do everything. But it does it poorly at scale because:
- Context window pressure. A complex analysis might involve 20+ tool calls. By the end, the agent has consumed most of its context window with tool results, and its reasoning degrades.
- Role confusion. An agent that's simultaneously a data engineer, statistician, visualization designer, and report writer will compromise on all of them. Specialists with focused system prompts produce better output.
- Parallelism. The stats agent and the viz agent can work simultaneously. A single agent is sequential.
- Failure isolation. If the viz agent crashes, the data loading and analysis work is preserved. With a single agent, a late failure means retrying everything.
- Model optimization. Each specialist can use the model best suited to its task. You don't need Opus for data cleaning. You don't want Haiku writing your executive summary.
How this scales (and where it doesn't)
The agents are stateless and the compute is external. The container itself is mostly routing logic. S3 and DynamoDB scale on their own. Sandboxes are ephemeral, created on demand.
Need a new capability? Write a new specialist agent and register it with the orchestrator. No existing agent changes. Each new specialist makes the swarm more capable without making any existing agent more complex.
But the single container is a monolithic architecture, and the tradeoffs are real:
If one agent misbehaves, it can affect the others. A visualization agent that leaks memory during a heavy chart generation run puts pressure on every other agent in the same process. There's no process-level isolation between agents.
No independent scaling. If the visualization agent is getting hammered, you can't scale just that one. You scale the whole container. A Kubernetes-style microservice-per-agent architecture would let you do this, but the operational complexity of running 26 microservices is its own nightmare. That's not a problem I need right now.
Resource contention is real with concurrent users. Pandas alone on a moderately sized dataset can eat 3-4x the file size in memory. A 10MB CSV becomes 30-40MB just to load. Then analysis on top, visualizations streaming, multiple users. The ceiling arrives faster than you'd think.
The defense is legitimate: for early production, this is completely reasonable. A lot of serious production systems start exactly here and only decompose when a specific pain point forces it. I don't have that pain point yet. When I do, I'll know because the telemetry will show it.
SSE streaming is actually helping here more than it looks. Users aren't sitting on an open connection waiting for a batch response - they see output as it comes. Perceived performance is better than actual performance. The system can be slower than you'd expect and users won't feel it because something is always happening on screen.
What This Costs to Run
Part of the reason I build things like this is to understand what production actually feels like - not the architecture diagram version, but the invoice version. I work in cloud infrastructure during the day. Building on it at night with my own money is a different education.
Here's what the monthly bill looks like:
- ECS Fargate (4GB / 1 vCPU, always-on): ~$120/month
- S3 (workspace storage + CloudFront): ~$5-15/month
- DynamoDB (on-demand, metadata lookups): ~$5-10/month
- Redis (ElastiCache, session cache): ~$15-25/month
- Bedrock (code interpreter sessions): ~$20-50/month depending on usage
- LLM API calls (Anthropic/OpenAI): the variable that actually matters - $50-500+/month depending on model choice and query volume
That's $200-700+/month for a single instance. Before horizontal scaling. Before monitoring. Before the ALB and WAF and all the other managed service overhead.
I'm not building a business here. This is a learning platform and a portfolio piece. But the cost profile is real. Anyone evaluating this space should know what production agent infrastructure actually costs before they start. Production-grade agent infrastructure is expensive. Multi-tenant isolation, sandboxed execution, managed databases, LLM API calls - these are enterprise costs. Running them personally to learn how they compose is a choice, and it's not a cheap one.
That's actually part of the lesson. You don't understand the cost pressure of a production agent system until you're paying the bill yourself.
What telemetry will tell me
I don't know yet where the real bottlenecks are. I have hypotheses - the visualization agent doing concurrent matplotlib renders, memory pressure from large DataFrames, context window overflow on complex multi-step analyses. But hypotheses aren't data.
The production system has OpenTelemetry hooks on every agent turn, tool execution, and LLM call. When I have enough production traffic to see patterns, the telemetry will tell me:
- Which agents consume the most memory and for how long
- Where latency actually accumulates (LLM calls vs. sandbox boot vs. S3 I/O)
- Whether context window pressure correlates with quality degradation
- Which agents would benefit from independent scaling vs. which are fine sharing a process
That data will drive the decomposition decisions. I'm not going to prematurely split into microservices because a design document says I should. I'll split when the telemetry shows a specific agent is the bottleneck. Until then, the monolith is simpler to operate and easier to debug.
The Problem This Keeps Solving
I keep running into the same pattern, at every scale:
- The quarterly rebuild. An analyst builds a pipeline for Q3 reporting. Q4 comes and they rebuild it from scratch because nothing was saved in a reproducible way. The pipeline was a person, not a system.
- Client data bleeding. An agency runs the same analysis for five clients and keeps them separated with folder naming conventions and good intentions. One mistake and Client A sees Client B's numbers.
- The fragile stack. A growth team has Airflow, dbt, a Jupyter notebook, and three cron jobs. One breaks and nobody knows which. The person who built it left.
- AI as a toy. Someone demos an LLM analyzing a CSV. Leadership is impressed. Then they ask "can we run this on real client data, with audit trails, for 10 users?" and the answer is no.
This isn't a product pitch. It's the set of problems I kept hitting that motivated the architecture. The system exists because I got tired of watching the same failure modes repeat across teams, and I wanted to understand what it actually takes to solve them.
What Stayed the Same
I built both the local and production versions. The open source version is the one you can clone and run in 30 seconds. The production version is the one serving the live demo. Here's what stayed identical:
- Workspace as a directory structure, not a database schema. Locally it's
./workspace/{id}/datasets/. In production it'ss3://bucket/{tenant}/{workspace}/datasets/. Same shape. - Stateless agents created per request. No shared state between requests means horizontal scaling is trivial.
- Tools communicate through the workspace, not through agent memory. This is what makes the swarm pattern possible. Swap one agent for five, and they all read/write the same workspace.
- Provider abstraction via Strands SDK. The local version uses Anthropic, OpenAI, Gemini, Mistral, or Ollama. The production version uses Bedrock. Same agent code.
- SSE streaming from the start. The frontend contract doesn't change whether it's one agent or a swarm behind the endpoint.
The gap between local and production is real. It's auth, isolation, sandboxing, observability, and infrastructure. But the agent logic, the tools, the prompts, the workspace model: that's the same code.
The lesson: build the workflow right once. Scale the infrastructure around it.
From Analyst to Storyteller
The biggest change between v1 and v2 wasn't infrastructure. It was personality.
The first version of Ca$ino was a data science assistant. Polite. Competent. Forgettable. It would load a dataset, run describe(), spit out a table of statistics, and ask what you wanted next. Every data science agent does this.
I rewrote the persona entirely. Ca$ino is now a data storyteller. It doesn't just analyze data. It interrogates it. It has opinions. It notices when a distribution is suspiciously bimodal and says so. It titles charts with insights, not descriptions. "ROAS drops 40% after Q3 bid strategy change" instead of "Revenue Over Time."
The system prompt tells the agent to think out loud. Share hunches. Be honest about uncertainty. Write like a human, not a textbook.
The output is different. Instead of a correlation matrix and a flat summary, you get a briefing. The agent finds the story in the data and frames it visually. Dark backgrounds, annotated callouts, accent colors that highlight what matters.
In the swarm, this personality lives in the viz agent and the report agent. The data engineer doesn't need to be creative. The stats agent doesn't need to tell stories. But the agents that face the user do. The swarm lets you give each specialist the exact personality its job requires.
The tools and infrastructure are table stakes. The difference is in the output. It doesn't just show you numbers. It tells you what they mean.
Try Both
The local version is open source. Five providers, seven tools, one personality.
git clone https://github.com/KeonCummings/casino.git
cd casino
echo 'LLM_PROVIDER=anthropic' > .env
echo 'LLM_API_KEY=your-key' >> .env
docker compose up --build
The live demo is the production version. Sandboxed execution, tenant-isolated workspaces, the full swarm. Same tools, same workspace model, different infrastructure. Go break it. Connect your own data sources. Bring a Kaggle dataset or point it at your MongoDB. The workspace isolation model works the same whether it's demo data or a production database. But maybe don't wait too long - I've seen the invoice, and this thing is absolutely getting scaled down to a t3.micro and a prayer before summer.