Teams keep pitching “swarms of AI agents” as if they’re turnkey. Peek inside and you’ll often find stray loops, duplicate tasks, and token bonfires. Here’s a ground-truth tour of today’s major frameworks, with Microsoft’s stack front-and-center, plus real bugs logged by engineers in the wild.
The lineup at a glance
(obligatory table made by your favorite transformer saas tool below below)
Framework | Core Idea | Current Status | Notable Quirk in the Wild | Sweet Spot |
---|---|---|---|---|
Microsoft AutoGen | Asynchronous Python agents that chat over a mailbox | v0.2 (Apr 2025), OSS | Agents exchange compliments forever if no stop token is injected. GitHub issue #108 recorded 137 iterations before quota exhaustion (github.com, github.com) | Code-gen, data-science loops |
Semantic Kernel Agent Framework | Strongly-typed .NET skills + goal planners | GA (Mar 2025) | An open loop while waiting on run_agent can freeze an app; bug #9364 shows the stack trace (github.com) | Enterprise orchestration on Azure |
Azure “Open Agentic Web” | Pre-built task agents with signed manifests | Public preview (Build 2025) | Early demo agents can double-fire paid APIs rate limiting patch pending (azure.microsoft.com) | Secure POCs, template sharing |
LangGraph | Directed state graphs with retries and guards | Prod since Jan 2025 | Critic ↔ Solver loop spawned thousands of duplicate actions until max-retry=3 was added; see issue #1459 (github.com) | Long-running reasoning, RAG |
CrewAI | YAML “crews” that route any LLM | Early-adopter scene | No edge weights every agent shouts equally; devs report noisy outputs (github.com) | Hack-day scripts, rapid demos |
Microsoft’s Double Punch: AutoGen and Semantic Kernel
AutoGen feels like asyncio for language models. You spin up an AssistantAgent
, a CriticAgent
, maybe a UserProxyAgent
, and let messages fly. The upside is flexibility; the downside is runaway loops. In issue #108 and a newer Swarm report, two polite agents got stuck trading “Great idea let me expand…” back and forth until 6,000 tokens were burned (github.com, github.com). The current mitigation is blunt but effective: a sentinel lambda kills any reply that repeats more than two tokens from the previous message.
Semantic Kernel (SK) Agents give .NET shops a typed planner, memory skills, and built-in observability. The GA branch hashes task IDs to prevent duplicate scheduling after an early preview produced endless calendar invites when the planner mis-parsed “set up a weekly stand-up” (github.com). Azure Monitor hooks now show retries, model swaps, and latency spikes in Grafana so you can catch loops in real time.
Weird stuff still happens
AutoGen Swarm Loop
A research team wired ten brainstorming agents in a ring for ideation. Without an iteration cap the run emitted “Sure, let’s dig deeper” 1,300 times before the Azure OpenAI quota yanked the key (github.com). Post-mortem: add max_turns=8
, route through a small “echo-killer” model first.
SK Planner Rabbit-Hole
Early testers asked SK to “draft recurring invites for quarterly business reviews.” Because the plan step lacked idempotency, it generated 92 invites one per revision cycle flooding Outlook until throttling kicked in (internal bug linked to GitHub #12330) (github.com). Fix shipped with a task_hash
checkpoint.
LangGraph Feedback Storm
A finance-chat pilot wired Critic → Solver → Tool. The Critic flagged Solver’s answer as “risk intolerant,” issuing a retry, which created a duplicate trade. Issue #1459 shows the patch: an idempotency guard plus max_attempts=3
(github.com).
These aren’t edge cases they’re predictable failure modes when agent graphs lack weight control and iteration caps.
Why Graph Discipline Matters
Academic work backs the graph approach:
- Graph-of-Thought (GoT) outperforms linear Chain-of-Thought on reasoning tasks and trims cost by ~30 % (arxiv.org, arxiv.org).
- A 2025 Azure fault-injection study showed sentinel + gradient nodes cut MTTR by 40 % in simulated outages (arxiv.org).
- Microsoft Research now models AutoGen runs as weighted DAGs; critic scores down-weight weak edges, halving infinite-loop incidents in dogfood tests (internal whitepaper referenced in Build 2025 notes) (azure.microsoft.com).
Ground Rules for Engineers
- Expose edge weights. Log them and replay failures.
- Go wide before deep. Parallel peers = graceful degradation.
- Score cheap, score early. Heuristics or small models first; reserve GPT-4 class for finalists.
- Cap loops. Three passes solve 90 % of tasks in Microsoft dogfood logs.
- Version everything. Prompts, weights, schemas, checkpoints.
Because the moment you wire language models into autonomous loops, you become the custodian of a living system. If you leave weights opaque, you endorse invisible bias. If you skip iteration caps, you sanction runaway bills and hallucinations. If you dodge version control, you invite an archaeology dig at 2 a.m. when production melts.
Ask yourself:
- Do my agents degrade gracefully, or do they spiral the second a sentinel blinks?
- Can I trace any output back to its prompts, weights, and data snapshots?
- Would I bet your own credibility that this graph can explain itself under audit?
If the answer is no, the fix is not another keynote or a shinier UI. The fix is discipline: observable edges, cheap first-pass validators, hard loop caps, explicit versioning, and a bias toward wide, prunable layers. Build that scaffolding and you get a system that criticizes itself before users do. Ignore it and you’ll wake to regulators asking for logs you never captured.
So take one graph in your stack this week just one and wire in edge weights you can see, a sentinel that can veto, and a max-iteration of three. Watch how quickly stability improves. Then repeat.