In Conversation

Building at the principal grade

Columbus's lead developer on what makes principal-grade engineering different, the technical debates the public conversation has not caught up with, and the surprises of building at the apex.

By Benjamin Deve · 2,100 words

Editorial Board. Benjamin, we want to talk about the engineering work itself: what it actually means to build at principal grade, and where the public conversation gets it wrong. Let us start with the basic question. What is technically different about engineering for a sovereign-grade buyer versus a general enterprise buyer?

Benjamin Deve. The honest answer is almost everything, but the difference that surprises people most is evaluation. Most enterprise AI engineering is evaluated against benchmarks: accuracy on a held-out test set, response time, cost per token, whatever the procurement criteria specify. Those are real metrics and they do matter. But they are not what determines whether a system is acceptable to the buyer at the level we work at.

Deve. What determines acceptability at principal grade is whether the principal recognises themselves in the system. That is not a benchmark. It is a judgement, made by a specific person, against criteria they cannot fully articulate in advance. A system can pass every published evaluation and fail this test, and a system can fail several published evaluations and pass it. The engineering discipline of getting from the first to the second is the discipline most general-enterprise vendors have not yet built.

Editorial. Can you make that more concrete? What specifically do you do differently?

Deve. Sure. Take calibration. In a general enterprise deployment, calibration usually means tuning the system to perform well on a corpus of representative tasks. You optimise for aggregate performance. The system gets better on average. That is fine for most enterprise work.

Deve. For a Twin, calibration means something different. We hold out thirty to fifty real decisions the principal has actually taken, the model has not seen, and we ask: for each one, what would the system have done, and why? Then the principal, not us, not a benchmark, the principal personally, rates each response on two dimensions. Did the system reach the same directional conclusion? Did the reasoning behind the conclusion sound like how they would explain it?

Deve. The first dimension is comparatively straightforward. Most well-trained systems will reach plausible directional conclusions on most decisions. The second dimension is where the real work is, and where most systems fail. The principal looks at a piece of reasoning and either says yes, that is how I think, or no, that is not how I think, and the difference is often invisible to anyone but them. We iterate against that signal until alignment is consistently in the high eighties to low nineties on novel decisions. That is the threshold below which we will not ship.

Deve. The practical consequence is that we throw away a lot of work. A general enterprise team would call something done at seventy-five per cent and ship. We have shipped systems where the seventy-five-per-cent version felt close, the principal said no, this is not me, and we went back to the elicitation and rebuilt. That discipline is what makes the difference, and it is not glamorous.

Editorial. Where in the stack does the bulk of that work actually happen?

Deve. Mostly in the Mind layer, and within that, mostly in the parts of it that are not strictly model training. Most of the engineering effort at principal grade is not in fine-tuning a base model. It is in the retrieval, prompt construction, context architecture, and reasoning scaffolding that surround the model and shape how it responds.

Deve. This is one of the things the public AI conversation is currently confused about. The public narrative treats the base model as the thing that matters: GPT-5, Claude 4, Gemini 3, whatever. The base model matters, and we are extremely careful about which one we use for which task. But the base model is, at most, perhaps thirty per cent of what determines whether a Twin sounds like the principal. The remaining seventy per cent is the architecture around the model: how the principal's corpus is structured, how it is retrieved, how the prompt assembles context, how the reasoning is scaffolded, how the constitutional layer enforces values, how outputs are post-processed.

Deve. That is the work. It is unglamorous. It is also the reason a large general-purpose model used naïvely will produce a generic executive voice, and a smaller model wrapped in disciplined architecture will produce a recognisable principal. The architecture matters more than the model, in our experience, and the public conversation has the ratio inverted.

Editorial. What is a technical debate inside the field that you think the public conversation has not caught up with?

Deve. Probably the question of how much of a Twin should be frozen versus continually learning.

Deve. The intuitive answer is that a Twin should learn continuously and every decision the principal makes from now on should refine the model. That sounds correct and is, in some carefully bounded ways, correct. But it is also dangerous if implemented naïvely, and the dangers are not well understood outside the field.

Deve. The risk is drift. If the Twin is updated on every new decision, the Twin you have in year three is not the Twin you authorised in year one. It has drifted. A drifted Twin may still be useful, but it is no longer the artefact the principal originally signed off on. From a governance perspective, that matters enormously. The principal needs to be able to authorise specific versions of the Twin and revoke them. If the Twin is in continuous flux, that authorisation chain is much harder to maintain.

Deve. The position we have arrived at, and this is genuinely contested across the field, is that Twins should have governed update cycles rather than continuous learning. New material is collected continuously. But it is not incorporated into the live Twin without an explicit governance review, with the principal or their authorised reviewer signing off on what gets ingested. This is slower and operationally heavier than continuous learning. It is also auditable, revocable, and defensible to a regulator or a board, which continuous learning is not.

Deve. There are smart people in the field who disagree with us on this, and the debate is going to play out over the next two or three years. I think we are right. Time will tell.

Editorial. What surprised you most when you started doing this work?

Deve. How much of the difficulty is non-technical.

Deve. I came into this expecting that the hard problems would be in the model layer. Better fine-tuning. Better retrieval. Better reasoning. Those are real problems and we work on them. But the harder problems, the ones that have actually defined what the work is, have been elicitation and governance.

Deve. Elicitation is the discipline of getting a principal to articulate things they cannot ordinarily articulate, in a structured way that produces material the engineering can use. It looks like interviews on the surface. It is something more specific than that. It is more like depositional questioning, except the goal is not to extract testimony but to surface decision frameworks the principal applies but does not consciously hold. The skill of doing this well is rare, and it sits closer to a discipline like clinical psychology or executive coaching than to anything in software engineering.

Deve. Governance is the discipline of building the audit, consent, charter, and revocation infrastructure that a sovereign-grade deployment requires, and integrating it into the system so that no output can be produced outside the boundaries the principal authorised. This is more software-like, but it is software in service of legal and regulatory requirements that most engineering teams have never had to take seriously.

Deve. Most of the engineering challenge at our level has been in those two areas. The model and infrastructure work is hard, but it is not the bottleneck. The bottleneck is whether the elicitation has produced material rich enough for the engineering to encode, and whether the governance is robust enough to defend in front of a regulator or a board. If those two are right, the rest is achievable. If those two are wrong, the rest does not save the project.

Editorial. What do developers entering this field underestimate?

Deve. The amount of taste required.

Deve. Most developers, myself included, when I started, assume the work is mostly technical. You learn the stack, you learn the patterns, you become skilled at the engineering, and the work follows. That is true for most software. It is only partly true here.

Deve. What the work actually requires, beyond technical skill, is taste in a specific sense. The judgement to recognise when a system is producing output that is technically correct but humanly wrong. The willingness to throw away work that benchmarks well but fails the principal-recognition test. The patience to listen to a principal explain why something is not quite right, when they cannot articulate exactly what is wrong, and to derive from their dissatisfaction the engineering changes that need to be made.

Deve. This is not a skill that engineering education currently teaches. It is closer to the skill a senior editor has, or a portrait photographer, or a courtroom interpreter: a sustained discipline of paying attention to a specific human and serving their actual judgement, not their stated preferences. Developers entering principal-grade work need to know that this skill is real and not optional. The engineers we have hired who succeed are the ones who can develop it. The ones who cannot, even if they are technically excellent, do not stay at this level of work.

Editorial. Where do you think the field is going in the next five years?

Deve. Two predictions, made with appropriate humility.

Deve. The first is that principal-grade will become a recognised category, with its own evaluation standards, its own procurement specifications, and its own regulatory framework. Right now it is a niche term that we and a handful of others use. In five years it will be a procurement requirement at every sovereign and Tier-1 level deployment. The institutions that have not built for it by then will be at a substantial disadvantage.

Deve. The second is that the bottleneck on the work will shift. Right now the bottleneck is a combination of base model capability, elicitation discipline, and governance maturity. Within five years base model capability will be sufficient that the bottleneck moves entirely to elicitation and governance, which means the field will become much more about people who can do this work than whether the technology is good enough. The companies that have built deep elicitation and governance practices by then will be in a structurally stronger position than the companies that have relied on technology to carry them.

Deve. I think Columbus is positioned for this. I think most of our visible competitors are not. We will see.

Editorial. Final question. What gets you up in the morning?

Deve. Honestly? The puzzle. The work is genuinely hard in interesting ways. Most software engineering is well-defined enough that the path from problem to solution is mostly execution. This work is not. The path from "this principal needs a Twin" to "the Twin sounds like them at high fidelity" goes through eight or nine engineering, design, elicitation, and governance decisions, each of which has multiple defensible answers, and the right combination is genuinely not obvious in advance. Figuring it out is the work.

Deve. There is also, and I will admit this, a quiet satisfaction in being early. The category we work in is going to be much bigger and much more contested in five years than it is today. The engineering practices we are developing now will, I think, be the practices the field eventually adopts. That is a rare position to be in as a developer, and I appreciate it.

Conducted by The Columbus Editorial Board, May 2026.