Cheating Is All You Need

Source
url	https://about.sourcegraph.com/blog/cheating-is-all-you-need
raw	raw/highlights-cheating-all-you-need.json

TL;DR: Every AI winner will have a data-moat — proprietary data that populates the context window and makes generic models dramatically more useful. The models themselves are commoditizing fast. What you put in front of them isn’t.

What it means

Yegge’s argument cuts through three years of noise about model architecture and training compute. The models themselves are rapidly commoditizing. What isn’t commoditizing is the data you feed them at inference time. “A data moat is having access to some data that others do not.” The context window is the battlefield, and whoever fills it with the best proprietary data wins.

This reframes the AI competitive landscape entirely. The question isn’t “who has the best model?” — it’s “who has the best data to put in front of the model?” This is a cornered-resource play in Helmer’s framework: preferential access to a coveted asset (proprietary data) that competitors literally cannot replicate without going back in time to build the same product and gather the same usage history.

The implications for startups are clear and most founders haven’t internalized them. If you’re building on top of foundation models, your moat isn’t your prompt engineering or your fine-tuning. Your moat is your data pipeline. The companies that win will be the ones that accumulate proprietary datasets through their product’s usage, creating a flywheel where more users generate more data, which makes the AI better, which attracts more users. This is network-effects applied to data, and it may be the defining competitive dynamic of the AI era.

The argument

Context window > model weights. The “cheating” in the title refers to stuffing the context window with relevant data rather than relying on what the model learned during training. This is cheaper, faster, more controllable, and more accurate than fine-tuning for almost every domain-specific task — and it puts the leverage back in the hands of whoever owns the data, not whoever owns the GPUs.

Data moats are the new moats. When model capabilities converge (and they will, fast), differentiation comes from data. This connects directly to software-dev-costs-moats — as AI commoditizes code, the moats shift from technology to data and distribution. The companies that survive the next five years will be the ones whose data was already accumulating before everyone realized data was the moat.

Proprietary data compounds. The best data moats are self-reinforcing. Every interaction generates more data, which improves the product, which generates more interactions. This is the AI-native version of network-effects, and it’s available to any company that designs its product to capture and leverage usage data from day one. The companies that bolt on data collection later have a much harder time, because the early data is what trains the system to know what to collect.

The ChatGPT version

ChatGPT is Yegge’s argument turned into a real business at scale. OpenAI’s data moat isn’t the training data (that’s commoditizing). It’s the conversation history — billions of interactions that no other model has seen, which feed RLHF, fine-tuning, evaluation, and (now, post-memory) personalized context for individual users. Every conversation deposits another row into a dataset that nobody else can replicate without doing the years of consumer-product work over again from scratch. This is the relationship effects story told as a corporate-level data moat.

What it means

The argument

The ChatGPT version

What links here