Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Ok it might sound crazy but I actually got the best quality of code (completely ignoring that the cost is likely 10x more) by having a full “project team” using opencode with multiple sub agents which are all managed by a single Opus instance. I gave them the task to port a legacy Java server to C# .NET 10. 9 agents, 7-stage Kanban with isolated Git Worktrees.

Manager (Claude Opus 4.5): Global event loop that wakes up specific agents based on folder (Kanban) state.

Product Owner (Claude Opus 4.5): Strategy. Cuts scope creep

Scrum Master (Opus 4.5): Prioritizes backlog and assigns tickets to technical agents.

Architect (Sonnet 4.5): Design only. Writes specs/interfaces, never implementation.

Archaeologist (Grok-Free): Lazy-loaded. Only reads legacy Java decompilation when Architect hits a doc gap.

CAB (Opus 4.5): The Bouncer. Rejects features at Design phase (Gate 1) and Code phase (Gate 2).

Dev Pair (Sonnet 4.5 + Haiku 4.5): AD-TDD loop. Junior (Haiku) writes failing NUnit tests; Senior (Sonnet) fixes them.

Librarian (Gemini 2.5): Maintains "As-Built" docs and triggers sprint retrospectives.

You might ask yourself the question “isn’t this extremely unnecessary?” and the answer is most likely _yes_. But I never had this much fun watching AI agents at work (especially when CAB rejects implementations). This was an early version of the process that the AI agents are following (I didn’t update it since it was only for me anyway): https://imgur.com/a/rdEBU5I





Every time I read something like this, it strikes me as an attempt to convince people that various people-management memes are still going to be relevant moving forward. Or even that they currently work when used on humans today. The reality is these roles don't even work in human organizations today. Classic "job_description == bottom_of_funnel_competency" fallacy.

If they make the LLMs more productive, it is probably explained by a less complicated phenomenon that has nothing to do with the names of the roles, or their descriptions. Adversarial techniques work well for ensuring quality, parallelism is obviously useful, important decisions should be made by stronger models, and using the weakest model for the job helps keep costs down.


My understanding is that the main reason splitting up work is effective is context management.

For instance, if an agent only has to be concerned with one task, its context can be massively reduced. Further, the next agent can just be told the outcome, it also has reduced context load, because it doesn't need to do the inner workings, just know what the result is.

For instance, a security testing agent just needs to review code against a set of security rules, and then list the problems. The next agent then just gets a list of problems to fix, without needing a full history of working it out.


Which, ultimately, is not such a big difference to the reason we split up work for humans, either. Human job specialization is just context management over the course of 30 years.

> Which, ultimately, is not such a big difference to the reason we split up work for humans,

That's mostly for throughput, and context management.

It's context management in that no human knows everything, but that's also throughput in a way because of how human learning works.


I’ve found that task isolation, rather than preserving your current session’s context budget, is where subagents shine.

In other words, when I have a task that specifically should not have project context, then subagents are great. Claude will also summon these “swarms” for the same reason. For example, you can ask it to analyze a specific issue from multiple relevant POVs, and it will create multiple specialized agents.

However, without fail, I’ve found that creating a subagent for a task that requires project context will result in worse outcomes than using “main CC”, because the sub simply doesn’t receive enough context.


So two things.. Yes this helps with context and is a primary reason to break out the sub-agents.

However one of the bigger things is by having a focus on a specific task or a role, you force the LLM to "pay attention" to certain aspects. The models have finite attention and if you ask them to pay attention to "all things".. they just ignore some.

The act of forcing the model to pay attention can be acoomplished in alternative ways (defined process, commitee formation in single prompt, etc.), but defining personas at the sub-agent is one of the most efficient ways to encode a world view and responsibilities, vs explicitly listing them.


What do you think context is, if not 'attention'?

You can create a context that includes info and instructions, but the agent may not pay attention to everything in the context, even if context usage is low.

IMO "Attention" is an abstraction over the result of prompt engineering, the chain reaction of input converging the output (both "thinking" and response).

Context is the information you give the model, attention is what parts it focuses on.

And this is finite in capacity and emergent from the architecture.


So attention is based on a smaller subset of context?

I suppose it’s could end up being an LLM variant of Conway’s Law.

“Organizations are constrained to produce designs which are copies of the communication structures of these organizations.”

https://en.wikipedia.org/wiki/Conway%27s_law


If so, one benefit is you can quickly and safely mix up your set of agents (a la Inverse Conway Manoeuvre) without the downsides that normally entails (people being forced to move teams or change how they work).

I think it's just the opposite, as LLMs feed on human language. "You are a scrum master." Automatically encodes most of what the LLM needs to know. Trying to describe the same role in a prompt would be a lot more difficult.

Maybe a different separation of roles would be more efficient in theory, but an LLM understands "you are a scrum master" from the get go, while "you are a zhydgry bhnklorts" needs explanation.


This has been pretty comprehensively disproven:

https://arxiv.org/abs/2311.10054

Key findings:

-Tested 162 personas across 6 types of interpersonal relationships and 8 domains of expertise, with 4 LLM families and 2,410 factual questions

-Adding personas in system prompts does not improve model performance compared to the control setting where no persona is added

-Automatically identifying the best persona is challenging, with predictions often performing no better than random selection

-While adding a persona may lead to performance gains in certain settings, the effect of each persona can be largely random

Fun piece of trivia - the paper was originally designed to prove the opposite result (that personas make LLMs better). They revised it when they saw the data completely disproved their original hypothesis.


Persona’s is not the same thing as a role. The point of the role is to limit what the work of the agent, and to focus it on one or two behaviors.

What the paper is really addressing is does key words like you are a helpful assistant give better results.

The paper is not addressing a role such as you are system designer, or you are security engineer which will produce completely different results and focus the results of the LLM.


Aside from what you said about applicability, the paper actually contradicts their claim!

In the domain alignment section:

> The coefficient for “in-domain” is 0.004(p < 0.01), suggesting that in-domain roles generally lead to better performance than out-domain roles.

Although the effect size is small, why would you not take advantage of it.


How well does such llm research hold up as new models are released?

Most model research decays because the evaluation harness isn’t treated as a stable artefact. If you freeze the tasks, acceptance criteria, and measurement method, you can swap models and still compare apples to apples. Without that, each release forces a reset and people mistake novelty for progress.

In a discussion about LLMs you link to a paper from 2023, when not even GPT-4 was available?

And then you say:

> comprehensively disproven

? I don't think you understand the scientific method


Fair point on the date - the paper was updated October 2024 with Llama-3 and Qwen2.5 (up to 72B), same findings. The v1 to v3 revision is interesting. They initially found personas helped, then reversed their conclusion after expanding to more models.

"Comprehensively disproven" was too strong - should have said "evidence suggests the effect is largely random." There's also Gupta et al. 2024 (arxiv.org/abs/2408.08631) with similar findings if you want more data points.


A paper’s date does not invalidate its method. Findings stay useful only when you can re-run the same protocol on newer models and report deltas. Treat conclusions as conditional on the frozen tasks, criteria, and measurement, then update with replication, not rhetoric.

...or even how fast technology is evolving in this field.

One study has “comprehensively disproven” something for you? You must be getting misled left right and centre if that’s how you absorb study results.

It shows me that there doesn’t appear to be an escape from Conway’s Law, even when you replace the people in an organisation with machines. Fundamentally, the problem is still being explored from the perspective of an organisation of people and it follows what we’ve experienced to work well (or as well as we can manage).

Developers do want managers actually, to simplify their daily lives. Otherwise they would self manage themselves better and keep more of the share of revenues for them

Unfortunately some managers get lonely and want a friendly face in their org meetings, or can’t answer any technical questions, or aren’t actually tracking what their team is doing. And so they pull in an engineer from their team.

Being a manager is a hard job but the failure mode usually means an engineer is now doing something extra.


i guess, as a human it’s easier to reason about a multi-agent system when the roles are split intuitively, as we all have mental models. but i agree - it’s a bit redundant/unnecessary

I do think there is some actual value in telling an LLM "you are an expert code reviewer". You really do tend to get better results in the output

When you think about what an LLM is, it makes more sense. It causes a strong activation for neorons related to "code review", and so the model's output sounds more like a code review.


For those ignorant, CAB is Change-advisory board

https://en.wikipedia.org/wiki/Change-advisory_board


Thank you for the link and the compliment.

Subagent orchestration without the overhead of frameworks like Gastown is genuinely exciting to see. I’ve recorded several long-running demos of Pied-Piper, which is a Subagents orchestration system for Claude Code and ClaudeCodeRouter+OpenRouter here: https://youtube.com/playlist?list=PLKWJ03cHcPr3OWiSBDghzh62A...

I came across a concept called DreamTeam, where someone was manually coordinating GPT 5.2 Max for planning, Opus 4.5 for coding, and Gemini Pro 3 for security and performance reviews. Interesting approach, but clearly not scalable without orchestration. In parallel, I was trying to do repeatable workflows like API migration, Language migration, Tech stack migration using Coding agents.

Pied-Piper is a subagent orchestration system built to solve these problems and enable repeatable SDLC workflows. It runs from a single Claude Code session, using an orchestrator plus multiple agents that hand off tasks to each other as part of a defined workflow called Playbooks: https://github.com/sathish316/pied-piper

Playbooks allow you to model both standard SDLC pipelines (Plan → Code → Review → Security Review → Merge) and more complex flows like language migration or tech stack migration (Problem Breakdown → Plan → Migrate → Integration Test → Tech Stack Expert Review → Code Review → Merge).

Ideally, it will require minimal changes once Claude Swarm and Claude Tasks become mainstream.


Personally, I'm fascinated by the opening for protocol languages to become relevant.

The previous generations of AI (AI in the academic sense) like JASON, when combined with a protocol language like BSPL, seems like the easiest way to organize agent armies in ways that "guarantee" specific outcomes.

The example above is very cool, but I'm not sure how flexible it would be (and there's the obvious cost concern). But, then again, I may be going far down the overengineering route.


Share your code of the “actual best quality “ or this is just another meaningless and suspicious attempt to get users to put the already expensive AI in a for-loop to make it even more expensive

I have been using a simpler version of this pattern, with a coordinator and several more or less specialized agents (eg, backend, frontend, db expert). It really works, but I think that the key is the coordinator. It decreases my cognitive load, and generally manages to keep track of what everyone is doing.

I’ve been messing around with the BMAD process as well which seems like a simpler workflow than you described. My only concern is that it’s able to get 90% of the way there for productionized ready code, but the last 10% is starts to fail at when the tech debt gets too large.

Have you been able to build anything productionizable this way, or are you just using this workflow for rapid prototyping?


Can you share technical details please? How is this implemented? Is it pure prompt-based, plugins, or do you have like script that repeatedly calls the agents? Where does the kanban live?

Not the OP, but this is how I manage my coding agent loops:

I built a drag and drop UI tool that sets up a sequence of agent steps (Claude code or codex) and have created different workflows based on the task. I'll kick them off and monitor.

Here's the tool I built for myself for this: https://github.com/smogili1/circuit


Cool, thanks for sharing!

This is genuinely cool, the CAB rejecting implementations must be hilarious to watch in action. The Kanban + Git worktree isolation is smart for keeping agents from stepping on each other.

I've been working on something in this space too. I built https://sonars.dev specifically for orchestrating multiple Claude Code agents working in parallel on the same codebase. Each agent gets its own workspace/worktree and there's a shared context layer so they can ask each other questions about what's happening elsewhere (kind of like your Librarian role but real-time).

The "ask the architect" pattern you described is actually built into our MCP tooling: any agent can query a summary of what other agents have done/learned without needing to parse their full context.



How much does this setup cost? I don't think a regular Claude Max subscription makes this possible.

Can't you just use time-sharing and let the entire task run over night?

Very cool! A couple of questions:

1. Are you using a Claude Code subscription? Or are you using the Claude API? I'm a bit scared to use the subscription in OpenCode due to Anthropic's ToS change.

2. How did you choose what models to use in the different agents? Do you believe or know they are better for certain tasks?


> due to Anthropic's ToS change.

Not a change, but enforcing terms that have been there all the time.


Could you share some details? How many lines of code? How much time did it take, and how much did it cost?

You might as well just have planner and workers, or your architecture essentially echos to such structure. It is difficult to discern how semantics can drive to different behavior amongst those roles, and why planner can't create those prompts the ad-hoc way.

This now makes me think that the only way to get AI to work well enough to actually actually replace programmers will probably be paying so much for compute that it's less expensive to just have a junior dev instead.

I was getting good results with a similar flow but was using claude max with ChatGPT. unfortunately not an option available to me anymore unless either I or my company wants to foot the bill.

What are the costs looking like to run this? I wonder whether you would be able to use this approach within a mixture-of-experts model trained end-to-end in ensemble. That might take out some guesswork insofar the roles go.

Interesting that your impl agents are not opus. I guess having the more rigorous spec pipeline helps scope it to something sonnet can knock out.

What are you building with the code you are generating?

Do you mind sharing the prompts? Would be greatly appreciated

Is it just multiple opencode instances inside tmux panels or how do you run your setup?

Is this satire?

Nope it isn’t. I did it as a joke initially (I also had a version where every 2 stories there was a meeting and if a someone underperformed it would get fired). I think there are multiple reasons why it actually works so well:

- I built a system where context (+ the current state + goal) is properly structured and coding agents only get the information they actually need and nothing more. You wouldn’t let your product manager develop your backend and I gave the backend dev only do the things it is supposed to and nothing more. If an agent crashes (or quota limits are reached), the agents can continue exactly where the other agents left off.

- Agents are ”fighting against” each other to some extend? The Architect tries to design while the CAB tries to reject.

- Granular control. I wouldn’t call “the manager” _a deterministic state machine that is calling probabilistic functions_ but that’s to some extent what it is? The manager has clearly defined tasks (like “if file is in 01_design —> Call Architect)

Here’s one example of an agent log after a feature has been implemented from one of the older codebases: https://pastebin.com/7ySJL5Rg


Thanks for clarifying - I think some of the wording was throwing me off. What a wild time we are in!

What OpenCode primitive did you use to implement this? I'd quite like a "senior" Opus agent that lays out a plan, a "junior" Sonnet that does the work, and a senior Opus reviewer to check that it agrees with the plan.

You can define the tools that agents are allowed to use in the opencode.json (also works for MCP tools I think). Here’s my config: https://pastebin.com/PkaYAfsn

The models can call each other if you reference them using @username.

This is the .md file for the manager : https://pastebin.com/vcf5sVfz

I hope that helped!


This is excellent, thank you. I came up with half of this while waiting for this reply, but the extra pointers about mentioning with @ and the {file} syntax really helps, thanks again!

> [...]coding agents only get the information they actually need and nothing more

Extrapolating from this concept led me to a hot-take I haven't had time to blog about: Agentic AI will revive the popularity of microservices. Mostly due to the deleterious effect of context size on agent performance.


Why would they revive the popularity of microservices? They can just as well be used to enforce strict module boundaries within a modular monolith keeping the codebase coherent without splitting off microservices.

And that's why they call it a hot take. No, it isn't going to give rise to microservices. You absolutely can have your agent perform high-level decomposition while maintaining a monolith. A well-written, composable spec is awesome. This has been true for human and AI coders for a very, very long time. The hat trick has always been getting a well-written, composable spec. AI can help with that bit, and I find that is probably the best part of this whole tooling cycle. I can actually interact with an AI to build that spec iteratively. Have it be nice and mean. Have it iterate among many instances and other models, all that fun stuff. It still won't make your idea awesome or make anyone want to spend money on it, though.

In a fresh project that is well documented and set up it might work better. Many issues that Agents have in my work is that the endpoints are not always documented correctly.

Real example that happened to me, Agent forgets to rename an expected parameter in API spec for service 1. Now when working on service 2, there is no other way of finding this mistake for the Agent than to give it access to service 1. And now you are back to "... effect of context size on agent performance ...". For context, we might have ~100 services.

One could argue these issues reduce over time as instruction files are updated etc but that also assumes the models follow instructions and don't hallucinate.

That being said, I do use Agents quite successfully now - but I have to guide them a bit more than some care to admit.


> In a fresh project that is well documented and set up it might work better.

I guess this may be dependent on domain, language, codebase, or soke combination of the 3. The biggest issues I've had with agents is when they go down the wrong path and it snowballs from there. Suddenly they are loading more context unrelated to the tasks and getting more confused. Documenting interfaces doesn't help if the source is available to the agent.

My agentic sweet spot is human-designed interfaces. Agents cannot mess up code they don't have access to, e.g. by inadvertently changing the interface contract and the implementation.

> Agent forgets to rename an expected parameter in API spec for service 1

Document and test your interfaces/logic boundaries! I have witnessed this break many times with human teams with field renames, change in optionality, undocumented field dependencies, etc, there are challenging trade-offs with API versioning. Agents can't fix process issues.


Isn't all this a manual implementation of prompt routing, and, to a lesser extent, Mixture of Experts?

These tools and services are already expected to do the best job for specific prompts. The work you're doing pretty much proves that they don't, while also throwing much more money at them.

How much longer are users going to have to manually manage LLM context to get the most out of these tools? Why is this still a problem ~5 years into this tech?


quite a storyteller

I'm confused when you say you have a manager, scrum master, archetech, all supposdely sharing the same memory, do each of those "employees" "know" what they are? And if so, based on what are their identities defined? Prompts? Or something more. Or am I just too dumb to understand / swimming against the current here. Either way, it sounds amazing!

Their roles are defined by prompts. Only memory are shared files and the conversation history that’s looped back to stateless API calls to an LLM.

It's not satire but I see where you're coming from.

Applying distributed human team concepts to a porting task squeezes extra performance from LLMs much further up the diminishing returns curve. That matters because porting projects are actually well-suited for autonomous agents: existing code provides context, objective criteria catch more LLM-grade bugs than greenfield work, and established unit tests offer clear targets.

I guess what I'm trying to say is that the setup seems absurd because it is. Though it also carries real utility for this specific use case. Apply the same approach to running a startup or writing a paid service from scratch and you'd get very different results.


I don't know about something this complex, but right this moment I have something similar running in Claude Code in another window, and it is very helpful even with a much simpler setup:

If you have these agents do everything at the "top level" they lose track. The moment you introduce sub-agents, you can have the top level run in a tight loop of "tell agent X to do the next task; tell agent Y to review the work; repeat" or similar (add as many agents as makes sense), and it will take a long time to fill up the context. The agents get fresh context, and you get to manage explicitly what information is allowed to flow between them. It also tends to mean it is a lot easier to introduce quality gates - eg. your testing agent and your code review agent etc. will not decide they can skip testing because they "know" they implemented things correctly, because there is no memory of that in their context.

Sometimes too much knowledge is a bad thing.


Humans seem to be similar. If a real product designer would dive into all the technical details and code of a product, he would likely forget at least some of the vision behind what the product is actually supposed to be.

Doubt it. I use a similar setup from time to time.

You need to have different skills at different times. This type of setup helps break those skills out.


why would it be? It's a creative setup.

I just actually can't tell, it reads like satire to me.

to me, it reads like mental illness

maybe it's a mix of both :)

Why would it be satire? I thought that's a pretty stranded Agentic workflows.

My current workplace follows a similar workflow. We have a repository full of agent.md files for different roles and associated personas.

E.g. For project managers, you might have a feature focused one, a delivery driven one, and one that aims to minimise scope/technology creep.


I mean no offence to anyone but whenever new tech progresses rapidly it usually catches most unaware, who tend to ridicule or feel the concepts are sourced from it.

yeah, nfts, metaverse, all great advances

same people pushing this crap


ai is actually useful tho. idk about this level of abstraction but the more basic delegation to one little guy in the terminal gives me a lot of extra time

Maybe that's because you're not using your time well in the first place

bro im using ai swarms, have you even tried them?

bro wanna buy some monkey jpegs?

100% genuine


[flagged]


> Laughing about them instead of creating intergenerational wealth for a few bucks?

it's not creating wealth, it's scamming the gullible

criminality being lucrative is not a new phenomenon


Are you sure that yours would sell for $80K, if you aren't using it to launder money with your criminal associates?

If the price floor is 80k and there are thousands then it means that even if just one was legit it would sell for 80k

Weird Im getting downvoted for just stating facts again


I think many people really like the gamification and complex role playing. That is how GitHub got popular, that is how Rube Goldberg agent/swarm/cult setups get popular.

It attracts the gamers and LARPers. Unfortunately, management is on their side until they find out after four years or so that it is all a scam.


I've heard some people say that "vibe coding" with chatbots is like slot machines, you just keep "propmting" until you hit the jackpot. And there was some earlier study that people _felt_ more productive even if they weren't (caveat that this was with older models), which aligns with the sort of time-dilation people feel when gambling.

I guess "agentic swarms" are the next evolution of the meta-game, the perfect nerd-sniping strategy. Now you can spend all your time minmaxing your team, balancing strengths/weaknesses by tweaking subagents, adding more verifiers and project managers. Maybe there's some psychological draw, that people can feel like gods and have a taste of the power execs feel, even though that power is ultimately a simulacra as well.


Extending this -- unlike real slot machines, there is no definite state of won or not for the person prompting, only if they've been convinced they've won, and that comes down to how much you're willing to verify the code it has provided, or better, fully test it (which no one wants to do), versus the reality where they do a little light testing and say it's good enough and move on.

Recently fixed a problem over a few days, and found that it was duplicated though differently enough that I asked my coworker to try fixing it with an LLM (he was the originator of the duplicated code, and I didn't want to mess up what was mostly functioning code). Using an LLM, he seemingly did in 1 hour what took me maybe a day or two of tinkering and fixing. After we hop off the call, I do a code read to make sure I understand it fully, and immediately see an issue and test it further only to find out.. it did not in fact fix it, and suffered from the same problems, but it convincingly LOOKED like it fixed it. He was ecstatic at the time-saved while presenting it, and afterwards, alone, all I could think about was how our business users were going to be really unhappy being gaslit into thinking it was fixed because literally every tester I've ever met would definitely have missed it without understanding the code.

People are overjoyed with good enough, and I'm starting to think maybe I'm the problem when it comes to progress? It just gives me Big Short vibes -- why am I drawing attention to this obvious issue in quality, I'm just the guy in the casino screaming "does no one else see the obvious problem with shipping this?" And then I start to understand, yes I am the problem: people have been selling eachother dog water product for millenia because at the end of the day, Edison is the person people remember, not the guy who came after that made it near perfect or hammered out all the issues. Good enough takes its place in history, not perfection. The trick others have found out is they just need to get to the point that they've secured the money and have time to get away before the customer realizes the world of hurt they've paid for.


I don't think so.

You probably implemented gastown.

The next stage in all of this shit is to turn what you have into a service. What's the phrase? I don't want to talk to the monkey, I want to talk to the organ grinder. So when you kick things off it should be a tough interview with the manager and program manager. Once they're on board and know what you want, they start cracking. Then they just call you in to give demos and updates. Lol

Congratulations on coming up with the cringiest thing I have ever seen. Nothing will top this, ever.

Corporate has to die


Scrum masters typically do not assign tickets.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: