hello@onechair.io · @onechair_io on X
13 min read

How I Built a $500K Data Synthesis Pipeline for $1,000 in 30 Days with Claude Code

Between January 23rd and February 21st of this year, I vibe coded 88,000 lines of Python into a data synthesis pipeline that scraped 1,861 universities and produced a dataset that even the Department of Education doesn't have — the actual cost of attendance of every graduate and professional degree in America, uncovering a $51.8 billion funding gap that nobody has quantified yet — the kind of output that would have taken a team of 5 people 12-18 months to produce and cost $500K. Instead, I spent $967.08. Powered by copious amounts of Cafe Bustelo, 16-hour work days, and Chinese takeout — this is the story of the most sleepless month of my life.


The tweet

It's mid-December 2025. The Christmas lull is in full swing, people are checked out at work. Me? I'm bored. My mind spiraling in an anxious doom loop with nothing to do. I'm waiting for my Alibaba shipment which is somewhere in the Pacific Ocean. Wondering — should I send my supplier a message just to get a "hello dear 🌹🌹🌹" or a "yes boss"? Nah. Instead I'm mindlessly scrolling Twitter. Then I see this:

"I once did $0 to $60m just affiliate commission in 2 months on an offer."

I sit up in my chair. Like the kids say, I'm locked in 🔒. I'm clicking on Jason's tweet, reading the replies. How did he do it? I pull up Gemini and start investigating. A pattern emerges. COVID-19 tests. PPP loans. It turns out government regulation creates a forced market dislocation. If you move fast, have the right skillset and team, there's a business on the other side.

Was there a modern equivalent, I wondered? I spent days digging, prompting Gemini like a maniac. Then I found it: the One Big Beautiful Bill Act, signed July 4, 2025. It included a provision eliminating Grad PLUS loans for all new graduate borrowers effective July 1, 2026. Unlimited federal borrowing would be replaced with hard caps, meaning the displaced loan origination would have to be absorbed by the private sector. Why was this being done? My best guess was that tuition inflation had gotten out of control — since 2006 the government had guaranteed unlimited borrowing for students up to cost of attendance, and tuition predictably spiraled. Now they were pulling the plug.

By Christmas my Alibaba shipment was forgotten and I was stuck in my office planning my own affiliate business with Gemini as my chief strategist. By January 1st I had a comprehensive business plan laid out and 30+ niche domains purchased to capture the internet real estate of this regulatory change. Next I needed to somehow collect the data that would power this business.


1,861 different websites

The data I needed didn't exist. The feds track who enrolls in graduate school (IPEDS). They don't track what it actually costs. I'd have to collect it myself — from every university in the country.

Collecting this data is the hardest class of scraping problem there is: Deep Traversal with Heterogeneous Targets. Standard scrapers fail here because every university presents the data differently. University A puts tuition in a PDF. University B uses a dynamic dropdown. University C splits it across three different Bursar pages.

To get the cost of a single program at a single university you need to:

Now multiply by 1,861 universities. Every school is its own puzzle.

Claude Opus 4.6 had just come out. I wanted to see what it could do.


The seed phase

This was not my first rodeo scraping public data. The last time I did something like this, about 5 years ago, I discovered some data that let's just say should not have been public. But that's a story for another time.

The first two weeks (late January) were manual. I used Gemini and Claude to scrape universities one at a time, building up a "seed" dataset of ~5,300 rows. I needed to understand the shape of the problem before I could automate it. What do fee structures actually look like? Where do universities hide costs? What are the common traps?

I felt like Michael Burry in The Big Short. Thousands of lines of Excel data. Residence status. Tuition. Fees. Program durations. Credit hours. I watched the movie twice during those two weeks. Not even joking. Ludacris "Money Maker" is still looping in my head.

By February 7th I had enough seed data and enough scar tissue to know exactly what to build.


The sprint

February 8th. I let loose with Claude Code and something clicked. Pure dopamine. 5 hours of sleep a night, 16-hour days. I couldn't stop.

The timestamps don't lie. I ran stat on every file after the project was done:

$ stat --format='%y %n' scraper/**/*.py | sort
Feb 8 · 2:17pm First line of scraper code. By 9:44pm that same day — 7.5 hours later — an 11-step pipeline: 15,795 search queries 9,054 HTML pages 7,000 PDFs (14GB) zero LLM cost
Feb 9 · 2:09am LLM extraction done. 8,972 rows. By 2:32am — validation, entity resolution, first final assembly. Slept a few hours. Back at it by 9am. By end of day: v2 re-extraction, v3 clean dataset, market analysis. One day from raw extraction to finished dataset.
Feb 10–13 Came up for air. Other projects. Planning the niche domains. Staring at the dataset wondering if the numbers were real.
Feb 14 Manual review phase begins. 440 university-specific extraction folders — each a self-contained package with custom crawlers, parsers, and structured output encoding that university's billing quirks.
Feb 17 · 3:47pm All 13 state system batch generators in a single session. SUNY, CUNY, CSU, UW, UC, MS IHL, MN State, FL SUS, UNC. Done by 11:22pm.
Feb 18 · 2:56am 40-page research report v1. By 5:40pm — v4, fully sourced.
Feb 21 · 9:40pm Final PDF generated. Hash signed. Copyright filed.
503 of 505 Python files created in this 13-day window · 82,635 lines · 94% of the entire codebase

$36 for 919 universities in 45 minutes

This is the part that broke my brain.

I fed 11,908 raw documents — HTML pages and PDFs — to Claude Haiku 4.5. Not with templates or regex. Just: "read this page like a human researcher would, extract these fields into this Pydantic schema." It understood that "Tuition & Required Fees" at Texas A&M means something different than "Systemwide Tuition" at UC Berkeley. That a table labeled "Per Semester" at the University of Alabama actually shows annual rates. That the International Student Services page at the University of Arizona shows non-resident rates masquerading as general rates.

$36 8,972 tuition rows · 13,364 fee components · 919 universities · 45 minutes

The edge cases

The pipeline architecture isn't where the complexity lives. It's in the 82+ edge cases Claude and I caught in a living file called LEARNINGS.md. Some favorites:

Texas Statutory Tuition Trap

Texas law sets "Statutory Tuition" at ~$50/credit. Not the real rate — a legislative floor from the 1960s. Real cost: "Designated Tuition," 5-10x higher. The LLM extracted both, inflating every Texas school's costs.

Fee Double-Counting

Texas A&M publishes "Tuition & Required Fees" as one number that already includes campus fees. If you also list those fees as separate components, the pipeline adds them again. This one took a while to catch.

Living Expense Double-Counting

The LLM stored total Cost of Attendance (tuition + living) as the living expense field for 22 universities. Then the pipeline added tuition on top. One law school showed $138K cost of attendance. Correct answer: $72K.

Annual Rates Labeled "Per Semester"

University of Alabama Engineering had annual tuition labeled as per-semester in their fee schedule. The pipeline doubled it.

[SCREENSHOT — Claude reverse-engineering the TAMU Angular SPA]
Claude identifying the assets/rates-cs.json endpoint inside Texas A&M's Angular bundle and fetching it directly.

What came out the other end

The dataset: 7,338 rows. 1,861 universities. 22,296 fee components. 28 degree types. 99.7% coverage of every IPEDS-listed graduate institution in the country. The headline finding: 95.2% of graduate and professional programs exceed the new federal loan caps. Annual funding shortfall: $59.9 billion at sticker price. $51.8 billion after institutional grants.

The product: I fed Claude the report and enough context about the business and it generated the landing page UI in one shot. The corporate website — same thing. One shot. I had to do edits later but the core was there. 12 niche domain websites, each serving a specific professional vertical with interactive calculators and 5,234 programmatic SEO pages. One Next.js codebase. All fed by the same dataset. All of it in 29 days.

February 21st, 9:40pm. Report submitted to the copyright office. I sat back in my chair and thought: what have I just done?


The cost

Me 2021 estimate
People 1 4-5 (data eng, scraping specialist, policy analyst, frontend dev, QA)
Calendar time 29 days 12–18 months
Hours ~200 3,000–4,500
Cost $967.08 $200K–$640K

The $967.08 is everything. Anthropic API and Max subscription. Gemini Ultra. Haiku for extraction ($36). Sonnet for re-extraction (~$95). Vision for OCR. Serper.dev for search. Vercel for hosting. Total. For two months of work.

After the project was done, I ran an experiment. I gave the finished report to a separate Claude instance that knew nothing about the codebase and asked it to estimate the cost. It guessed 450–650 hours, $150K–$300K. When I told it the real number:

"That's honestly staggering."

— Claude (blind estimate, Feb 19)

I then asked Claude Code — the one that actually built the thing — to estimate. It said 500–800 hours with LLMs. It overestimated my hours by 3-4x because it modeled the work as pair programming. What actually happened was delegation. I reviewed deliverables. I didn't write code. That's the difference between using an LLM as a copilot (5x speedup) and using it as a workforce you direct (15–22x).

~500× Cost $500K → $967
~20× Hours 3,000+ → ~200
~15× Calendar 18 mo → 29 days

The calendar compression is the most interesting one. There's an old saying in engineering management: nine women can't make a baby in one month. Some work is inherently sequential — you can't parallelize it by adding people. But one person with an LLM flips that on its head. Instead of adding people (and handoffs, and tickets, and "waiting on the data engineer"), you eliminate the team entirely. Edge case #47 informs extraction #48 instantly, because the same context holds both.


Why this matters

Nobody is talking about this. I'd bet there are people at consulting firms and hedge funds producing work at this scale with LLMs right now. But the economics haven't caught up. Engagements are still scoped, staffed, and billed as if the old production function applies. That's not dishonesty — it's institutional inertia. The pricing models, the team structures, the timelines — none of it has adjusted yet.

And the incentives to stay quiet are obvious. If you're employed, revealing the throughput means your employer needs fewer of you. If you're a consultant, revealing the cost basis changes the conversation. If you're an academic, revealing the LLM contribution raises authorship questions. I don't have those incentives. So here we are.

I don't think most people have internalized what's possible right now. I certainly hadn't before I sat down in that chair in January.