When I explain AI agents and AI agentic teams to leaders and teams, I use an example that is quite "old" in terms of AI timelines, but illustrates very interesting patterns we have not adjusted to yet. It is from a Stanford research published in November 2024.
In November 2024, a group at Stanford led by James Zou published the Virtual Lab. What makes it useful as an example, eighteen months later, is that it uses GPT-4o (frontier model at the time, now considered fairly weak) and still ends up with experimentally validated nanobodies that bind a new variant of SARS-CoV-2.
If the lesson of the paper depended on the model being unusually strong, the lesson would have expired by now. But the lesson is in the design of the collaboration, and that design gets stronger when you swap in a more capable model. James Zou explains this in his recorded talks as well: as the models improve, the Virtual Lab improves with them.
The setup
The research team built a platform they called the Virtual Lab. It is a small team of AI agents standing in for a research lab:
- A Principal Investigator (PI) agent. The professor.
- A Scientific Critic agent. A standing reviewer whose job is to push back on the team's decisions.
- A set of specialist agents that the PI agent recommended and defined for the specific problem. For this project, the PI created an Immunologist, a Computational Biologist, and a Machine Learning Specialist.
The human researchers talk almost entirely with the PI agent. In the published breakdown, humans spoke about 1% of the time. The PI spoke about 20%. The Computational Biologist, the agent with the most domain work, spoke most. The Critic also spoke a lot.
The agents work through two meeting types. Group meetings, where the PI sets an agenda and everyone goes around the virtual table. And one-on-one meetings, where a single agent sits with the PI on a specific sub-task. A meeting takes about a minute. Hundreds of them run in a single afternoon. None of the code or experimental design is written by the humans. The agents write their own code. They learn to use the relevant tools and they call those tools themselves.
Four design choices to consider
Many leadership conversations about AI in R&D are still stuck on whether to use it. The Virtual Lab puts a more interesting set of design choices on the table. Four of them are worth pulling out.
1. The AI picked its own team
The human researchers told the PI agent the problem: design binders for the recent SARS-CoV-2 variant. The PI agent decided what kinds of expertise the team needed and instantiated the specialist agents. For a finance problem, or a climate science problem, the PI would have created a different team.
That step, composing the team, is a step we normally treat as deeply human. It involves judgement about what skills a problem needs, who is available, who works well with whom. In the Virtual Lab, the team-composition step is delegated to the AI.
Delegating team-composition reshuffles which decisions the humans are making. The humans now choose the problem and the constraints; the AI chooses who is on the team to work on it. Whether you want that division of labour depends on what you are doing. An open-ended research problem, where the right team is part of what you are discovering, may suit it well. A regulated or political environment may not. Either way, the choice is now yours to make explicitly, and that is the part that matters.
2. Five parallel meetings, on purpose
This is the move I emphasize when presenting this to leaders.
The Virtual Lab runs the same meeting five times in parallel for any significant decision. Because the underlying LLM is non-deterministic, the five runs don't produce the same outcome. They explore different possibilities. The PI agent then reads the five summaries and synthesises a consensus, or the most interesting conclusions.
There is no equivalent of this in human teams. You cannot reset the same meeting five times, with the same people, with no memory of the earlier runs. The non-determinism of LLMs, usually treated as a problem, turns into a sampling feature. So the same property critics use as a knock on LLMs ("you ask the same question twice and get different answers") becomes the engine of a more reliable system, once you stop trying to make any single run authoritative.
I think this is the move that most changes what kind of work humans should reserve for themselves. The exploration step in a research conversation, where many directions are floated and most discarded, is now something AI can do at a scale humans cannot. The framing step (what is the question worth running five hundred meetings on) is not. This is the same asymmetric-collaboration point I make in this issue's leadership section on Project Hail Mary. Divide the work by what each side is built for.
3. The agents have hands. The humans have different hands.
The Virtual Lab agents aren't advisors that produce text for a human to act on. They have tools and they use them. The Computational Biologist agent doesn't recommend that someone run AlphaFold-Multimer; it runs AlphaFold-Multimer. It reads the output, feeds it into Rosetta, iterates. The Computer Scientist agent writes the pipeline code. None of that happens on a human's keyboard.
What humans bring is different but irreducible: they formulate the problem, they sense-check the agents' choices against tacit knowledge about the field, and they make the molecules physically and run the wet-lab assays. Without that step, the Virtual Lab produces interesting-looking proposals. With it, the lab produces an experimental result. We will get back to this when looking at self-driving labs in this issue's terminology section, where even the wet-lab step starts to be automated.
4. Compute time replaces calendar time
The Virtual Lab ran hundreds of meetings in a single afternoon. Five parallel runs per decision. Multi-round iteration on a workflow. A custom scoring function emerging out of the discussions. None of that fits on a human calendar.
Or rather, the version of it that would have fit on a human calendar would have taken months and would have looked different.
The constraint that used to bind R&D was how many people we could get in the room, for how long. The constraint that now binds it is how much compute we can afford to spend on the question. That constraint rewards different decisions about what to investigate, how broadly, and how many directions to keep open. It also changes the type of things worth doing at all: questions that were too expensive to investigate because they would have tied up a team for two months are not too expensive any more.
For most working researchers, the entry point into this new economics is not building a Virtual Lab from scratch. It is using off-the-shelf research tools like SciSpace, covered in this issue's tools section, that already package the cheaper end of multi-agent research into a usable interface.
What I tell executives in workshops
The argument I make, and the reason this 2024 paper is still a great workshop example in mid-2026, is that the design choices the Virtual Lab makes are not bound to drug discovery. Composing its own team, sampling decisions in parallel, giving agents real tools, defining the human role around what is irreducible: these are the design choices any organization working with AI agents will end up making.
The question is whether you make them deliberately, while the cost of a wrong answer is small, or whether you discover them later, when the team you have already built runs into work it wasn't shaped for.
The Virtual Lab worked with GPT-4o, on a hard scientific problem, in late 2024. Whatever the next eighteen months bring in model capability, the structural moves in that paper will probably look better in hindsight, not worse.
Your action step
Pick one decision your team made last month that involved real judgement and several rounds of debate. Imagine running the same conversation five times in parallel, with the same context, between agents shaped around the same roles, and synthesising the five outcomes the next morning. Two questions:
- Would the five-run version have surfaced options the one-run version did not?
- Where would the irreducible human work have sat, and is your team currently spending its time there?
If you answer yes to the first and no to the second, you have just located the highest-leverage redesign on your roadmap.
If you are designing how your R&D, strategy, or product teams work alongside AI agents and want a structured way to make the four design choices above on purpose rather than by accident, that is the work I do in AI strategy advisory engagements and working backwards sessions with leadership teams.