Why AI Can't Build Crosswords — And How I Fixed It

How I Ran Into the Problem

It started with a homework assignment. A subject-specific class at college, routine task: make a crossword puzzle on a given topic. I'm a student who routes most things through AI, so naturally I fed it to Claude — Opus 4.5 at the time. I assumed this was trivially easy. A grid. Some words. Intersections. How hard could it be?

Very hard, as it turned out.

The model started reasoning. And kept reasoning. Five minutes of extended thinking, hitting the reasoning token limit, then spilling over into the visible chat — still reasoning. Every few exchanges it would catch itself mid-grid, note a conflict, backtrack, try again. The model knew it was making mistakes. It just couldn't stop making them. When it finally declared the crossword done, the grid was inconsistent: words that were supposed to intersect didn't share correct letters, some words floated disconnected from the rest, some cells were double-occupied.

I tested other models out of curiosity. None reached 100% reliability either — except Grok, which sidestepped the issue entirely by generating Python code to build the grid rather than attempting the layout in its own "head." That was the tell.

Why This Is Actually a Hard Problem for LLMs

The failure mode isn't random. It's structural.

Language models have no spatial working memory. When a human places words into a crossword grid, they maintain a live mental map — a 2D structure they can "see" and navigate. Conflicts are visually obvious at a glance.

For an LLM, the grid doesn't exist as a spatial object. It exists as a serialized string of tokens. To check whether two words conflict, the model has to symbolically simulate the grid — recompute coordinates, reason about character-level overlaps across a token sequence — and do this for every candidate placement, every new word, every backtrack step. This isn't what transformer attention is optimized for. The model is solving a 2D constraint-satisfaction problem in a medium that is fundamentally 1D.

The result is predictable: the longer the chain of placements, the more accumulated context the model has to track, and the more likely it is to silently drop a constraint it already established. It doesn't hallucinate out of ignorance — it hallucinates because spatial consistency requires a kind of working memory that the architecture doesn't natively provide.

This is the same reason models struggle with other implicitly spatial tasks: certain geometry problems, ASCII art generation, path-finding in grids. The reasoning isn't wrong in principle; the representation is wrong for the task.

The Solution: Don't Make the LLM Do Spatial Work

Grok's instinct was correct: offload the geometry to code. The LLM is excellent at everything around the spatial problem — generating thematic word lists, writing clues, choosing which words to prioritize, structuring output for a human reader. It should not be computing grid coordinates.

After AgentSkills became available in Claude, I built the crossword skill around this principle. The architecture is a strict division of labor:

LLM responsibilities: topic understanding, word selection (8–12 thematic words), clue writing, language handling, final output formatting.
Python script responsibilities: all spatial logic — placement validation, intersection detection, conflict checking, grid rendering.

The skill exposes four scripts:

Script	Role
`suggest.py`	Given the current grid and a new word, returns all valid placements ranked by intersection count
`validator.py`	Full conflict check on the assembled grid — exits 0 (valid) or 1 (invalid with error list)
`render.py`	Renders the finished grid in puzzle mode (empty cells) or reveal mode (filled letters)
`crossword_core.py`	Shared spatial logic, imported by the other scripts

The LLM never calculates a single coordinate. The workflow is:

Generate words and clues.
Call suggest.py for each word — pick the placement with the highest intersection count.
After all words are placed, call validator.py — if it fails, re-run suggest.py for the conflicting word and choose a different placement.
Call render.py and print verbatim output.

The Python layer is deterministic and correct by construction. The LLM layer is creative and context-aware. Neither is doing the other's job.

Results

Crossword generation went from unreliable to consistent. The model no longer reasons for five minutes and produces a broken grid. It reasons for a moment about what words to use, then delegates the geometry entirely.

The skill also handles multilingual input naturally — Russian, English, German, and other scripts — because normalization (uppercase, Cyrillic/Latin homoglyph mapping) is handled in crossword_core.py, not in the prompt.

The Broader Lesson

When an AI model fails at a task in a reproducible, structured way, the failure usually isn't a prompting problem — it's an architectural mismatch between the task's requirements and what the model's representation can efficiently support. Trying to fix spatial reasoning failures with better prompts is treating a symptom. The actual fix is to not ask the model to do spatial reasoning at all.

The crossword skill is a small example of a general pattern: use LLMs for what they're good at, use deterministic code for what they're not, and build the integration layer between them deliberately.