Type signatures as the human-AI contract: a year of FP-first AI-native Scala

TL;DR

After a year of building production Scala systems where AI agents (Claude, Codex) write 90%+ of the implementation code, I’ve found that FP’s decades-old “weakness” — implementation complexity — becomes irrelevant when AI writes it, while FP’s strength — honest, expressive type signatures — becomes the primary interface between humans and AI.

The full article (with code examples): Ming’s Spell Compendium #4 — The Art of Whipping AI Grunts: FP’s Great Comeback?

The core argument

In an AI-native workflow, humans review signatures, AI writes implementations. This flips the traditional readability calculus:

// Human reviews this (1 line, complete information):
def fetchUser(id: UserId): IO[Either[AppError, User]]

// AI writes this (humans never need to read it):
EitherT(fetchUser(id))
  .subflatMap(validate)
  .semiflatMap(user => fetchScore(user.email).map(Profile(user, _)))
  .bimap(toHttpResult, toHttpResult)
  .merge.value

Scala-specific practices covered

  • sealed trait + ADT error enums over exception hierarchies — compiler-enforced exhaustiveness as the contract between AI sessions

  • opaque type (ProjectId, OrgId) — eliminating parameter mix-ups that AI agents are surprisingly prone to

  • EitherT / cats combinators — AI handles the “alien scripture” effortlessly; humans never need to

  • Tagless final discipline — why “always use tagless final” is a slogan, not an executable rule, and what actionable rules look like

  • Metals MCP vs Grep — when to use LSP vs text search (given/implicit resolution, extension methods, overload disambiguation)

  • Rule engineering — writing CLAUDE.md / agent rules with military-grade precision to prevent style drift across stateless AI sessions

The ironic punchline

FP has been criticized for decades as “unreadable without a PhD.” But humans now carefully read signatures — FP’s most readable part. And humans skim implementations — FP’s most off-putting part.

Discussion questions

  1. Has anyone else noticed AI agents performing measurably better with rich type signatures vs stringly-typed code?

  2. For those using Metals MCP or similar LSP tooling with AI agents — what’s your experience with given/implicit resolution?

  3. The article argues helper function extraction should require human approval in AI-native codebases (DRY becomes counter-productive when agents have no shared memory). Controversial — thoughts?

I’d love to hear from anyone experimenting with AI-assisted Scala development, whether you agree or violently disagree.

7 Likes

Now that LLMs have advanced enough so that they can write not-Python competently, I have indeed found that having stronger types is a plus.

However, I haven’t noticed any benefit to it being functional as opposed to anything else. This does mean that the “gosh this can be hard to write” downside is somewhat ameliorated, but the performance downside is still there when it’s there, the type signatures still sometimes are polluted by purity-bookkeeping, and even though the comprehension barrier to code size and code complexity are not in the same place as for people, they’re still there and need to be managed.

So AI + strong typing = substantial win (in my hands). AI + strong typing + functional vs AI + strong typing + not-so-functional seems a wash in general–it really depends which benefits you need. And, honestly, LLMs are quite good with weakly-typed languages like Python and R, too. (Note: LLMs are pretty good without types as long as you have adequate unit testing, which LLMs are also good at writing. Types help but the baseline is already quite good if you insist on robust tests.)

So I can have the AI write Scala in a purity-first style, which is heavily functional. But if I use it to write Rust or Java or Scala-in-a-compact-flexible-style? Still a substantial boost.

Out of the things you mention: yes, enums are great (I had a LLM write a Java project over the weekend and scolded it into using ADTs…much better error experience). Opaque types are great (either a full wrapping, or a lightweight tagging system–it really doesn’t matter what as long as it keeps the AI honest…but you can gain at least half the benefit by telling it to use explicit parameter names). EitherT and tagless final–well–if you need to go that way, sure, the AI will help you write it, and that’s nice to make it easy when you need the kinds of guarantees it offers (but you don’t always need those guarantees).

(Also, I find that some common mistakes are essentially impervious to CLAUDE.md-level rules. I do scientific programming where swallowing errors can lead to incorrect conclusions. It doesn’t really matter how I say it or how many times I repeat it; LLMs still write code now and then that silently swallows errors–fewer with good instructions, but it still happens, so I need to add an explicit review phase afterwards to go catch more of the spots.)

I disagree about DRY. It’s very important for maintaining a codebase, whether AI or human. I have a largish Python project that is mostly LLM created, and even though I constantly interrupted the LLM to insist on better abstraction, more DRY, when it came time to refactor, it was incredibly much more painful than it needed to be because it was still very not-DRY and not stellar at finding all the places that needed fixing. Strong types help, because some of the refactoring can be expressed in types, and then the compiler yells. But for the parts that aren’t expressed in types, DRY is your friend for refactoring. So it’s still very important for larger codebases.

If you are in write-once mode, creating one-off solutions, then I agree: DRY is irrelevant.

1 Like

I hypothesize that the “adequate unit testing” required for untyped languages is an efficiency loss on the context window and agentic compute units. There is overlap between well typing and a subset of unit tests for untyped implementations. I assert that when the overlap shifts over to unit tests due to a lack of typefulness, those unit tests represents a significant inefficient use of agentic compute. (ACUs, tokens, context window consumption etc)

Not just in the session in which the unit tests were created, but in every subsequent session where the code under test is referenced or refactored in any way.

No, mostly not, at least not now. They’re mostly just left there, unread, unless there’s a test failure, just like with humans.

It does of course cost more compute to write them (including ones that would be unnecessary if one had stronger types), and it can be tricky to exercise all the paths well. But once there, they sit pretty much ignored until they break.

It’s really quite similar to unit-test-validated code with human coders in that way. AI can just write more, faster, so maybe you care a bit less. (It also has an annoying habit of “testing” by inventing the same code again as the test case–but humans do too sometimes, and you can reduce the frequency if you at least point out that the goal of an independent way to get to the answer(s) not just duplicating the code is preferable.)

The “tests just sit there unread” model describes human workflows accurately. But agentic workflows have different economics.

First, to be clear: I’m not arguing for types instead of tests. My production systems use e2e tests, integration tests, a human QA team, Kubernetes canary deploys with fast rollback, and myself as the last line of defense. The testing infrastructure is robust. What I’m arguing is that a subset of unit tests — the ones that merely validate constraints expressible in the type system — become redundant when types are sufficiently expressive. Those specific tests can be replaced, not all tests.

Second, even when tests “just sit there,” the feedback loop characteristics differ at the moment they do break:

  • Type error: compiler says “expected NonZeroInt, got Int at line 42.” Agent fixes it in one reasoning step, no additional context load.
  • Test failure: agent reads test → reconstructs what invariant it checked → locates the defect in source → fixes → reruns. Multiple steps, each consuming tokens and reasoning capacity.

Third — and this is from nearly a decade of maintaining production systems, well before the vibe-coding era — I routinely do upside-down-level rewrites of production systems. In my experience, a sufficiently expressive type system works like a circulatory system running through every corner of the codebase. Any inconsistent change gets rejected by the compiler, which expands the blast radius of a refactor (and I consider that a feature, not a bug — I want the compiler to force me to address every affected site). Isolated unit tests have no such hard linkage between them. Test A and Test B can both pass individually while the contract between their subjects is silently broken.

This is why my LLM rules encourage e2e and real-world integration tests, but explicitly forbid writing thousands of lines of unit tests for getters and setters. (I suspect everyone has seen an agent proudly generate 2000 lines of assertEquals(user.getName(), "Alice") — that’s pure token waste with zero safety value that types wouldn’t already provide.)

Absolutely yes. And it’s even worse than an efficiency loss — it’s a recurring tax on every future session, exactly as you point out.

A concrete example:

# The unit test approach: a comment, a runtime check, AND a test
# don't pass 0 for b, PLEASE!!!
def bad_div(a: int, b: int):
    if b == 0:
        raise ArithmeticException(...)
    return a / b

def test_div_zero():
    assert_throws(bad_div(42, 0))
// The type approach: the constraint IS the signature
def goodDiv(a: Int, b: NonZeroInt): Int = a / b
// no comment needed, no runtime check needed, no test needed
// the compiler rejects bad_div(42, 0) before it ever runs

The untyped version requires three artifacts to express one constraint: a comment (for humans), a runtime guard (for production), and a test (for CI). Each one consumes tokens — when written, when read into context, and when maintained across refactors. The typed version encodes it once, in the signature, and the compiler enforces it for free in perpetuity.

Now multiply this by every invariant in a codebase. Each untyped invariant is a comment + guard + test triple that the agent must load, comprehend, and maintain in every session it touches that code. Types collapse that triple into a single zero-cost declaration. Over hundreds of sessions, the cumulative token savings — and more importantly, the reduced surface area for the agent to misunderstand or silently violate an invariant — is substantial.

Thanks for the detailed response — clearly coming from hands-on experience.

On FP vs non-FP being a wash: We agree more than it seems — the real axis is how much information lives in the signature, not FP vs imperative. Rust’s ? is railway semantics in imperative clothing; Result<T, E> is signature honesty regardless of paradigm. But here’s where I’d push further: once you follow the “signature honesty” principle to its logical conclusion — preconditions, postconditions, contracts verified at compile time — the FP ecosystem currently has the more mature tooling. Scala has Stainless (SMT-backed require/ensuring), Rust has Flux-rs. On the imperative side… OpenJML exists, but is anyone actually running it in production? :blush: The principle is language-agnostic, but the ceiling of expressiveness currently tilts toward FP-adjacent type systems.

On weak types + good tests: The gap is really about feedback loop token cost. A type error is instant, zero-token feedback — the agent fixes it in the same reasoning step. A test failure triggers a full cycle: run → read output → locate fault → fix → recompile → rerun. In practice that might mean fixing a dozen+ call sites with long context gaps between each fix attempt. Every gap is an opportunity for the LLM to make new mistakes. Shorter feedback loops don’t just save tokens — they reduce compounding error probability.

On error swallowing being impervious to rules: You’re right, and I said as much — absolute rules are “calibration parameters against training bias,” not a cure. Your “explicit review phase” maps to what I call Layer 4 (human review). That said, I’ve observed significant improvement across model generations: Opus 4.5 already reduced the frequency substantially, and with Opus 4.6 I’m catching maybe one instance every 1-2 days (rough estimate from the past month, not rigorous measurement). Rules reduce frequency; review catches the rest; model evolution is closing the gap. None alone is sufficient, but the trend is encouraging.

On DRY: I think we’re actually saying the same thing. My rule isn’t “never DRY” — it’s “the agent is forbidden from autonomously deciding to extract abstractions.” Extraction requires two conditions simultaneously: human approval AND inscription in the rule file. Without that, stateless sessions produce repeat-DRY: session A extracts HttpHelper, session B doesn’t know it exists and creates ApiClient, session C creates RequestUtil — a dozen sessions, a dozen near-identical abstractions. Your experience of “constantly interrupting the LLM to insist on better abstraction” is exactly the human-directed DRY I endorse. The anti-pattern is unsupervised DRY across memoryless sessions.

On purity bookkeeping polluting signatures: When 90-99% of the code is AI-authored, the calculus flips. The information density a signature carries matters far more than whether it looks clean to a human skimming it. IO[Either[AppError, User]] is noisier than User, sure — but the human’s job is reviewing contracts, not admiring aesthetics. More info in the signature = easier review, more honest for LLM reasoning, and when refactoring causes type-level blast radius, it’s the AI recursively chasing compiler errors, not the human. Humans aren’t frequently “polluted” by the noise because they’re not the ones swimming in it all day.

Well, yes, I fully agree there.

My point was only that you usually don’t need to load a passing test into context at all. So passing tests are free. It’s not an ever-increasing burden like gabriel suggested.

However, once you have a non-passing test, yes, it’s usually a much more context-heavy fix than a typechecking error. It’s a relatively fixed-cost burden–but it is a burden, so why have it if you can make the type system handle it instead (which is cheaper and more robust)?

Now, if you’re going to break a whole bunch of things at once–a major refactor, for instance–then, yes, you’re hit by the whole burden. Tests and major refactors interact poorly for everyone, human or AI: you have to go think through each test and decide whether it is even relevant, and if it is relevant whether the old behavior is still desired or whether you need a new test to cover the new behavior. But unless you’re refactoring in big ways constantly, this isn’t the normal experience. The normal experience is you get hit by test failures with a high per-test cost compared to a type error, but it’s roughly constant (unless you write a bunch of tests that all test the same thing–but that’s a flaw of test creation, not tests vs. types).

I sort of agree, but I don’t think the match between the FP toolset and the ideal set of type tools is that great.

SMT solvers are great! And FP people have deployed them, which is great. But to the extent that they are geared towards, for instance, a deeply monadic approach to computation, I don’t think it hits the key need for having confidence in AI-generated code. So maybe we’re not saying different things…I just don’t think the pure-functional part of FP is the key win, except in those problem domains for which purity maps well onto the most challenging part of the problem for AIs (which isn’t always the same boundary as with humans).

Oh, hm, I’ve never seen that. Granted, I have a line that reads something like, “Use existing abstractions to reduce code duplication when possible. Extend abstractions or create new abstractions when duplication is significant and existing abstractions are inadequate.” Point is–I usually stress making sure existing abstractions are used, which also tends to fail in favor of code repetition. Maybe that’s why I haven’t seen the nested abstraction chains in practice?

Agreed, but on a signature like that I would ask whether or not for this application, IO is doing some heavy lifting. Maybe yes. Maybe no, but you need it to fit into the FP framework. And do we need the explicit AppError or should we swallow that as understood context? So–is it really just Result<User> in Rust-land and that is plenty? Or is it doing less work than it should, because IO abstracts over both sync and async computations, often just for human convenience, even though there can be important differences in behavior? So maybe even though it’s less FP-friendly, we really would be better served with type information if it were Future[Either[AppError, User]]–maybe that’s more precisely what is going on here, and the composition etc. properties you get with IO are actually not being leaned on.

So, yes, richer types. FP types specifically? Not as convinced–you have to convince me separately about (1) FP is the right abstraction for this problem, and (2) these are the right rich types to express the contract.

On test costs: I think we’ve converged — passing tests are free, failing tests have a higher per-incident cost than type errors, and the question is just how often you’re in “change mode” vs “stable mode.” Agreed.

On repeat-DRY: I think the divergence in our experiences maps to different project profiles. The repeat-DRY pattern primarily shows up in AI-native projects — codebases where 90%+ of the code has been AI-authored from day one, often without thorough human expert review of the emerging structure. For established codebases, the pattern is much rarer: project conventions are already well-formed, and tools like Claude’s /init scan the existing structure to learn where helpers live and how abstractions are organized (this has been getting noticeably better since late 2025). I suspect your experience is mostly with agents working within existing well-structured projects, where the failure mode is “agent ignores conventions and writes inline” rather than “agent invents its own conventions from scratch in every session.” Both are real failure modes, just in different contexts.

On FP, pure-FP, and IO: I should clarify a distinction I didn’t make clearly enough. I enforce FP style (no local mutation, globals only with human approval) across all my projects — Scala, Rust, and TypeScript alike. But I only enforce pure-FP (IO monad, tagless final, effect tracking) in Scala, where the ecosystem supports it idiomatically. Rust and TypeScript get FP discipline without the purity machinery.

Why FP style universally? The biggest win isn’t elegance — it’s referential transparency for agents with limited vision. A stateless agent starting a new session — or a sub-agent working from a summarized global context — doesn’t need to reason about hidden state. It reads a function signature, follows 1-2 hops of definitions or call sites, and has everything it needs to reason correctly. No guessing what mutation happened three stack frames up. No wondering if some global was modified by a concurrent operation it can’t see. This matters much less for humans, who build up a mental model of the whole system over months. But for an agent whose “memory” resets every session? Referential transparency is the difference between reliable local reasoning and fragile global guessing.

Why pure-FP only in Scala? Two reasons. First, the toolchain (cats-effect, fs2) is mature and well-represented in training data, so the model reliably produces high-quality code in that style — I’m leveraging existing high-quality patterns in the training set, not asking the model to do in-context learning of novel abstractions. Second, IO carries operationally valuable information beyond what FP-style alone provides: def doSomething[F[_]: {Database, FileSystem}](...) tells the agent exactly which effects are in play. During a production database incident, the agent skips every pure function and focuses on code paths with the relevant capabilities. That’s a significant debugging advantage.

In Rust, I don’t use any pure-FP abstractions beyond Stream<T> and Has<T> — because that’s what’s idiomatic. I’m not fighting the language’s grain; I’m maximizing signature information within each ecosystem’s natural style.

On “(1) FP is the right abstraction (2) these are the right rich types”: You’re right to demand that separation. My claim isn’t “FP is universally correct.” It’s: maximize signature information density, using whatever paradigm has the most mature tooling in your language’s ecosystem. FP style gives you referential transparency everywhere. Pure-FP gives you effect tracking on top — but only where the ecosystem makes it essentially free. In Scala, that’s cats-effect. In Rust, it’s the type system itself (ownership, lifetimes, Result<T, E>). Different mechanisms, same principle.

I’d call it aggressive pragmatism :blush: — FP everywhere for transparency, pure-FP only where the ecosystem makes it free, and always push signature honesty as far as your language allows without fighting its grain.

That last bit is the key part. I don’t produce codebases without human expert review unless they’re tiny, because almost inevitably something goes wrong that is more human effort to fix than it would have been to pay attention along the way.

But I do have AI-authored codebases where 100% of the code is written by AI, and I have seen a lot of ill-advised repetition and essentially no hyperdeep abstraction given what I’ve been using as prompts. Anyway, there’s something different–but it isn’t that the projects started well-structured in my case. It’s either scale (maybe mine are smaller–AI-created projects are all under 100k lines), or that I don’t just go say, “Hey, build me an office suite” and come back and see what happened.

But I don’t find that a win. You put what the AI needs to know in view, as part of the API description. If you have N operations that commute and transform between the same types, RT isn’t going to save you if you forget to do one, or if you do one twice. And that absolutely is a limited vision problem too. A fluent builder is the same problem as a series of maps, save that the latter has more boilerplate in exchange for being more refactorable.

If, in contrast, you can encode successful initialization in types–that’s a win. If part of the point is to be able to play with execution as values, well, yeah, it’s much less error-prone with RT. It’s the same as with people, really.

If you need to keep massive amounts of state in mind at all times in order to operate, it’s tough. If you have clean boundaries where modest amount of state needs to be kept in mind, it’s fine.

On repeat-DRY: Fair point — supervision density is probably the more precise variable than “AI-native vs established.” And honestly, this touches on a real human weakness I’ve experienced firsthand. When AI generates large volumes of stylistically clean code, review vigilance erodes . I got lazy, and it cost me several production incidents in a Kotlin codebase. For anyone running into the same problem, the approach I describe in the article — shifting more of the burden from human review to compiler enforcement — has helped a lot in my case.

This actually cost me several production incidents in a Kotlin codebase (the predecessor to my current Scala system). The code looked fine, passed review, and broke in production. That experience is what drove me to rewrite the entire system in Scala with strict type-signature contracts earlier this year — a move that only became practical as models improved: the Cursor + Sonnet 3.7 through Opus 4.1 era struggled with complex type inference in Scala’s type system, which is why the original was in idiomatic Kotlin. Opus 4.5–4.6 made the leap feasible.

On RT and the limits of what it solves: You’re right — RT doesn’t prevent logical errors. N commuting operations on the same types, RT won’t stop you from forgetting one or doing one twice. That’s a type-level constraint problem, and encoding invariants in types (like successful initialization) is the win there, independent of FP.

I don’t claim RT is a silver bullet — no silver bullet exists. What I’m sharing is that in my production systems, it’s been a better default than the alternatives I’ve tried. It makes the agent’s job easier at the margins: a function with no effect marker can be mechanically skipped during a stateful bug hunt, rather than heuristically skipped. That margin compounds across hundreds of agent sessions in my experience. But it’s a margin, not a guarantee.

Thanks for the post. I do feel the question of how do AIs write & work with Scala code is central to Scala’s future, and a topic worth spending time exploring.

I write functional Scala 3, mostly on top of the Cats/Typelevel library ecosystem. I’m particular about how my code looks and is structured. It is only quite recently, with the Claude 4.6 models, that I have started to successfully achieve a quality level I’m happy with.

In 2026 it seems most of my code is AI-written, whereas in mid 2025 I felt I faced the awkward choice between

  • AI: fast but low quality
  • Hand-written: painstakingly slow compared to an AI, but much higher quality

A lot of that is model improvement, particularly in their ability to understand/reason about type signatures and code semantics. The slur that AIs can only rehash human code they have seen on the web is false in my experience; they can definitely compose novel solutions.

Also, keeping an AGENT.md file filled with positive and negative examples has been really helpful. Every time Claude writes code I’m not happy with, I ask it to copy the example and correction into the guideline file.

To your main point about type signatures guiding AI: broad agreement. One codebase I’m working on has a small, strongly typed domain model that I hand-wrote, encoding a lot of constraints into the type signatures. The bulk of operational code that works with that domain model is AI-written, and the types do help it write better, more correct code (as they would a human).

Also, yes, AIs do write functional code fluently. They don’t get put off by functional style, they know their Cats combinators, and they don’t randomly drop into imperative style in the middle of a method.

1 Like

I would like to share the announcement of a talk I’m excited about that directly applies to the discussion in this thread. Next week (on March 26) Martin Odersky will be talking about How can we trust our agents? at the Scalar Conference 2026.