Guide

AI-generated UI quality

Agents write interface code faster than anyone reads it. The failure modes are consistent, and checkable.

If your team ships with Cursor, Claude Code, or Copilot, more of your interface is now written by a model than by any single engineer. It compiles, it renders, the demo looks fine. The work didn't disappear — it moved. Writing the UI used to be the slow part; now judging it is, and most teams haven't staffed the judging.

So the practical review becomes a skim: it renders, approve. That skim is exactly where generated UI fails, because its defects are rarely bugs. They are quality problems — the kind that pass every check and still make the product feel off.

The failure modes are consistent

Review enough vibe-coded UI and the same four defects keep appearing, regardless of which model wrote it:

Accessibility regressions

Generated UI reaches for a div with an onClick before it reaches for a button. Labels, focus management, and keyboard paths are the first things dropped, because they are invisible in a screenshot and the demo mouse never notices.

Design-system drift

The model doesn’t know your tokens exist unless every prompt says so. Hardcoded hex values and one-off spacing land next to a perfectly good theme file. Ten PRs later there are five grays.

Arbitrary spacing and type

Values that are individually plausible and collectively random: 13px here, 15px there, a heading one step off the scale. Each diff looks fine. The accumulated page doesn’t.

The default look

Trained on the median of the internet, models converge on it: the gradient hero, glow shadows, purple-to-pink accents, filler copy in the empty states. The code works. The product looks like everyone’s.

Why review at human speed doesn't hold

An agent produces a working screen in minutes; reading that screen's diff with real attention takes longer than generating it did. When generation outruns review, review gets rationed — the big PRs get skimmed, the small ones get waved through, and the four failure modes above compound quietly across the codebase. None of them throws an error, so nothing forces a stop.

How teams are adding a review layer

·Rules files and prompts. A CLAUDE.md or Cursor rules file that names your tokens and conventions raises the floor of what gets generated. It shapes the input; it doesn't verify the output, and it can't catch what the model does anyway.
·Human design review on UI PRs. The highest-judgment check, and the one that scales worst — the whole problem is that generation now outpaces the people available to read it.
·Automated design review in the PR loop. A reviewer that reads every UI diff the same way, at generation speed. Rams runs as a GitHub App: 118 rules across 8 categories — accessibility, color, typography, spacing, components, UX, motion, and craft, the last built specifically for the patterns above — posted as inline comments with one-click fixes, in about a minute per PR.

Worth calibrating the bar before you assume your generated code clears it: held to the full ruleset, shadcn/ui — hand-built and heavily reviewed — scored 54/100. The scoring methodology is public.

Find out what your agent shipped.

Free on public repos. A design score and the findings behind it, in about a minute.

Review my public repoFree

Frontend design review →UI code review tools →How the score works →Rams vs ESLint →