The Hard Parts Remain Hard

Observations on AI, Specialization, and an Unexpected Shift

Nov 23, 2025

I’ve been thinking lately about something Andrej Karpathy wrote on X back in October, in response to a viral Richard Sutton interview that stirred the debate over whether LLMs are enough to achieve true AGI.

He described LLMs as “people spirits,” stochastic simulations of people in which the simulator is an autoregressive Transformer trained on human data, with a kind of emergent psychology.

When people found the ghost analogy provocative, he elaborated: “it captures how LLMs are purely digital artifacts that don’t interact with the physical world, how they’re a kind of “echo” of the living, a statistical distillation of humanity, and how the process of training them is like summoning a ghost, an elaborate computational ritual on an exotic megastructure.”

Using Cursor and Claude daily, I’ve come to understand what he means.

These tools feel like digital ghosts, present but not quite here, capable yet fundamentally constrained by a strange, jagged intelligence that doesn’t map cleanly onto human capabilities.

The ghost metaphor resonates because it gets at something essential about our current moment. For all the progress we’ve made, these systems still need to be invoked, managed, and steered.

They don’t persist the way human expertise does. They don’t build on their experiences across sessions.

Karpathy calls this “anterograde amnesia,” where they’re like coworkers who don’t consolidate or build long-running knowledge once training is over, relying only on short-term memory.

Every conversation is a fresh start, every context window a temporary scaffold that collapses when the session ends.

What strikes me most, using these tools day after day, is how the hard parts of building software remain stubbornly hard.

I can ask Claude to refactor a complex function or have Cursor generate a new component, or even dispatch an agent to build a complex feature with many nuances, and sometimes the results are startlingly good.

But the fundamental work of understanding systems, making architectural decisions, and debugging the subtle interactions between components still demands human judgment in ways that feel irreducible.

We’ve compressed the search space for solutions and eliminated certain kinds of boilerplate, but we haven’t eliminated the craft's essential difficulty.

The Evolution We’ve Witnessed

In hindsight, the past two years have traced a clear arc. We began with basic generative use cases, the kind where you’d laboriously copy code into ChatGPT, build context through careful prompting, and engineer your requests to elicit the right kind of response.

Cursor’s early appeal was precisely this: it automated the tedious work of context-passing, making it frictionless to ask questions about your codebase. That single innovation, automating the clipboard dance, felt transformative.

But then something shifted. The emergence of reasoning models (test-time compute) changed the game entirely.

Models that could think for extended periods before responding, generating long chains of reasoning rather than immediate answers, suddenly made new kinds of tasks tractable.

Research showed that simply increasing test-time compute, the computational effort allocated during inference, could enable small models to outperform much larger ones, particularly in mathematical and programming tasks. The implications reached far beyond incremental improvement. We were witnessing a different paradigm unfold.

The agentic loop became possible. Models could now learn a system at runtime, explore codebases, call tools to build their own context, and take actions based on that understanding.

DeepSeek-R1, for instance, naturally emerged with powerful reasoning behaviors during training: self-correction, reevaluation of flawed logic, and validation of solutions, all within its chain of thought.

These weren’t explicitly programmed behaviors but emergent properties of the reinforcement learning process.

What I’ve observed in my own work is that it’s becoming less important whether a model “knows” specific libraries or frameworks. As long as it can learn those things at runtime, and as long as it has enough foundational capabilities to process documentation, understand patterns, and apply them contextually, it can solve the problem at hand.

This realization has profound implications.

An Unexpected Turn

Something curious happened recently that crystallized these trends for me. Cursor released its 2.0 version with Composer 1, billing it as their first in-house model.

They claimed it achieved frontier coding results with a generation speed four times faster than similar models.

But within hours, developers noticed something odd. Users discovered Chinese-language text appearing in Cursor-generated code snippets. The model was “speaking Chinese” in its reasoning traces.

Around the same time, another US coding tool, Windsurf, launched SWE-1.5 with similar fanfare. Windsurf confirmed that its core model was provided by Z.ai (formerly Zhipu AI) after the company’s official account reposted Windsurf’s launch announcement.

The pattern became clear: both Cursor’s Composer and Cognition’s SWE-1.5 are speculated to be built on Chinese base models, with strong evidence pointing to GLM, Kimi, Qwen, or DeepSeek.

What interests me most here is what this reveals about the trajectory we’re on. China’s open-source models have caught up in both performance and cost.

For companies like Cursor with substantial deployed user capacity and the infrastructure to collect training data at scale, taking an open-source checkpoint like Qwen, GLM, or DeepSeek and fine-tuning it for specific use cases makes perfect sense.

You get a world-class base model, specialized reinforcement learning on your domain, and you own the entire stack. The economics are compelling: Social Capital founder Chamath Palihapitiya noted that his team migrated their workloads to Moonshot’s Kimi K2 model because it was “way more performant and frankly just a ton cheaper” than offerings from OpenAI and Anthropic.

Speed as the New Frontier

What I find particularly interesting about Composer is how it stakes out a specific position in the capability-speed tradeoff space.

The smartest model might be something like GPT-5 or Claude Sonnet 4.5, and the absolute best coding model might live elsewhere on the spectrum entirely.

But Composer achieves an impressive 250 tokens/second performance, nearly 4x faster than many competitors while maintaining frontier-level results.

Speed opens up a different kind of workflow. When a model can complete most turns in under 30 seconds, you can run multiple attempts with different prompts or strategies, compare results, and arrive at a solution faster than a slower, more deliberate reasoning model might.

The difference feels like having one brilliant but contemplative advisor versus three quick, capable consultants working in parallel.

This matters more than I initially realized.

In my daily work, I’ve found myself gravitating toward faster models for iterative tasks like refactoring, debugging, and implementing well-specified features, while reserving the heavy reasoning models for architectural decisions or complex algorithmic problems.

The jaggedness Karpathy describes becomes something you learn to work with, matching the tool to the task.

From Generalization to Specialization

Generalization no longer feels like the primary trajectory we’re following. That may sound surprising given how much discourse still centers on AGI and frontier models, but the actual patterns I’m observing in the industry point elsewhere. Specialization is emerging as the more material, more practical path.

Consider the evidence. Anthropic is building models with an explicit focus on software engineering capabilities. Google releases Gemini variants optimized specifically for coding tasks. The Chinese ecosystem has produced models like Qwen-3-Coder, specifically tuned for programming; GLM-4.6, excelling at agentic tasks; and Kimi K2 Thinking, shipped and branded as a “thinking agent”.

We’re watching models designed from the ground up or heavily fine-tuned for specific domains, rather than general-purpose systems that happen to perform well across many tasks.

The economics support this shift. Once you have capable open-source base models and the infrastructure to collect domain-specific data, building a specialized model becomes feasible for mid-sized companies.

I suspect we’ll see this pattern repeat across domains. Customer support models. Legal research models. Financial analysis models. Medical diagnostic assistants. Each will be fine-tuned on vast amounts of domain-specific data, each outperforming general-purpose models in their niche while being faster and cheaper to run.

The companies that realize they’re sitting on valuable training data, years of support tickets, code reviews, financial reports, and medical records, will all start building their own reinforcement learning environments and fine-tuning niche models.

The Meta of It All

The situation with Meta’s Llama series feels particularly poignant. Reuters reported in 2024 that many Chinese foundation models relied on Llama models for their training. Meta created the open-source movement in large language models, and the Chinese ecosystem built successfully on that foundation.

But now, in a strange twist, Chinese LLMs have crossed the threshold from experimental to production-grade infrastructure, with Alibaba’s Qwen dominating downloads and being used by companies like Airbnb in production.

Meta’s recent hiring spree and pivot toward newer Llama architectures suggest they recognize the landscape has shifted. But the strategy remains unclear.

Will they double down on open source, accepting that others will build successful businesses on top of their work? Or will they attempt more proprietary offerings?

The company that launched this entire movement now finds itself responding to innovations from models that wouldn’t exist without Llama’s initial gift to the community.

I expect Meta will lean into specialization too, but with their characteristic scale. Smaller, highly optimized models for specific benchmarks rather than trying to dominate every dimension simultaneously. The economics and the technology both seem to be pointing in that direction.

Probably not the best for the fluff of AI optics, but likely the best approach to get material gains from AI productization.

What I Make of This

Looking at Google’s trajectory provides another data point. When I tried Gemini 3 Pro in Cursor recently, I found it underwhelming for coding tasks.

But I remember a similar pattern with Gemini 2.5: the initial release was competent but unexceptional. Then they released a version optimized for code that was remarkably good.

I’d bet we see the same thing here, with Gemini 3 Pro becoming the base for multiple specialized efforts, each distilled and fine-tuned for particular domains.

The philosophical tensions here are worth sitting with. We have Chinese open-source models powering American developer tools. We have reasoning capabilities emerging from pure reinforcement learning without explicit programming. We have speed becoming as important as intelligence. We have companies that can’t afford to train foundation models from scratch, finding ways to compete by specializing existing ones.

None of this follows the narrative we told ourselves about scaling, about bigger models, more parameters, and more compute driving an inevitable march toward AGI.

What we’re actually seeing is something more interesting and more distributed: a cambrian explosion of specialized models, each excellent at particular tasks, each finding economic niches, each building on the open-source foundations that companies like Meta and later Chinese labs have provided.

For me, using these tools daily, the most striking thing is how the fundamental difficulty of software engineering remains intact.

The models help enormously. They compress certain kinds of work and make tedious tasks trivial. But the hard parts, such as understanding complex systems, making good architectural decisions, debugging subtle interactions, and knowing what to build in the first place, remain as challenging as ever.

Perhaps that’s reassuring. Perhaps it means there’s still room for human judgment, for the kind of expertise that comes from years of wrestling with real systems in production.

Or perhaps it just means we’re in an intermediate stage, and the ghosts Karpathy describes will eventually become something more present, more capable of persistence and genuine learning.

What I know is that we’re moving somewhere interesting. The next year will tell us whether specialization wins, whether open source continues to outcompete proprietary models in specific domains, and whether companies like Meta can find a viable strategy in this new landscape.

The Chinese models are forcing everyone to reconsider their assumptions, not through any geopolitical maneuvering but simply by being good, fast, and freely available.

The hard parts remain hard. But the toolkit for addressing them keeps getting better.