Text prompts are the wrong interface for design generation - Aktagon – AI engineering for healthcare, finance & compliance

Thirteen recent papers from Cornell, MIT, CMU, Adobe, Apple, Microsoft, and Toyota Research Institute all arrive at the same conclusion. Text prompts are too blunt for design generation. The field is moving on.

This is a literature review of what’s replacing them.

The consensus

The papers span graphic design, layout generation, UI design, CAD engineering, and shape variation. Different domains, different teams, different methods. Six findings repeat across nearly all of them.

1. Text alone is insufficient

Strongest agreement across every paper surveyed.

Designers work beyond language. They draw from visual references, spatial relationships, proportional constraints. A text prompt collapses all of that into a sentence.

The CHI 2026 paper “Bridging Gulfs in UI Generation” documents what happens in practice: users face “gulfs of execution and evaluation.” They can’t articulate intent in text. They can’t interpret why the AI made specific choices. After a few attempts at iterative refinement, they abandon the tool.

DesignPref measured agreement among 20 professional designers evaluating the same outputs. Krippendorff’s alpha: 0.25. Even experts can’t agree on what “good” means when the interface is a text box.

PRISM found that vision-language models’ pretrained style knowledge is “too general and misaligned with specific domain data.” The model knows what design looks like in aggregate. It doesn’t know what design looks like in your domain.

2. Something structured must sit between intent and output

Every paper that improves on text-only generation adds the same thing: a structured intermediate representation. The specifics vary. The pattern doesn’t.

Paper	Intermediate representation
Bridging Gulfs	Hierarchical semantic layer making intent explicit
PRISM	Design knowledge base clustered from real-world designs
PLay	Parametric constraints: element categories, counts, spatial guidelines
Parametric-ControlNet	Joint embedding of parametric specs, image components, and text

The representation carries the information that text can’t. Spatial relationships. Domain constraints. Compositional structure. The generator receives structured conditions, not prose.

3. Diffusion models dominate, but the conditioning mechanism is what matters

Every paper generating visual output uses diffusion models. That’s table stakes. The differentiation is entirely in how they condition the generation.

PLay conditions on parametric layout constraints. DesignDiffusion uses character embeddings with localization loss and DPO fine-tuning. Piece it Together replaces CLIP with IP-Adapter+ space (16x2048 vectors) for part-based visual conditioning. Parametric-ControlNet fuses three encoder types into a single multimodal embedding.

PRISM takes a different approach entirely: RAG-retrieved design knowledge injected into planner prompts. No image conditioning. Domain knowledge as context.

The model architecture is converging. The input architecture is diverging. That’s where the research energy is.

4. Good design is personal, not universal

DesignPref’s core finding: personalized preference models outperform aggregated ones “even when using 20x fewer examples.”

Twenty times fewer training examples. Better results. Because design preferences are individual, not collective.

PRISM corroborates this from a different angle. Within a single style tag like “abstract,” designs vary enormously. A single cluster isn’t granular enough. You need sub-clusters to capture the actual variation in what practitioners consider good work.

The implication is direct: variant generation systems that optimize for a single “best” output are solving the wrong problem. The goal is generating a diverse set covering the space of reasonable interpretations. Then the designer chooses.

5. Sequential iteration is broken; parallel variants fix it

“Bridging Gulfs” identifies a failure mode they call “semantic drift”: successive prompt modifications pull designs away from the original intent. Each edit compounds. After three rounds, the output barely resembles what the designer wanted.

The proposed solution is consistent across the literature. Give designers multiple variants simultaneously. Stop forcing sequential refinement.

PLay generates multiple layout options from the same constraints. PRISM samples variants proportionally to cluster sizes in its knowledge base. Piece it Together generates multiple plausible completions from the same partial inputs.

The model doesn’t iterate. The human selects.

6. Domain-specific priors dramatically outperform general models

PRISM’s learned design knowledge achieves 0.999 style alignment versus 0.847 for the best general baseline. That gap is not incremental.

Piece it Together’s domain-specific IP-Prior models reinterpret arbitrary inputs within that domain’s visual language. Parametric-ControlNet produces functionally valid engineering variants that general text-to-image models cannot.

General models know everything about nothing specific. Domain models know one thing well. For professional design work, the domain model wins every time.

The emerging architecture

The field is converging on a pipeline:

Structured conditions (semantic layer, knowledge base, parametric constraints) feed a conditioned generator (diffusion model, not prompted with raw text) that produces a diverse output set (multiple variants, not one “best” result) for human selection (the designer picks, not the model).

This mirrors how professional designers actually work. They explore options in parallel. They don’t iterate one image through a text box.

The open gap

No paper has closed the feedback loop. Once a designer picks a variant, that selection should inform the next round of generation. DesignPref’s personalization work points toward a solution — personalized models from few examples — but nobody has built this end-to-end in a design tool yet.

The loop is: generate variants, human selects, selection updates the model’s understanding of this user’s preferences, next generation is better calibrated. Each round gets closer with less effort.

That’s where the research stops. It’s where the product work starts.

An ontology-driven implementation

GenAI Studio is a research project that implements this architecture using OWL ontologies as the structured intermediate representation.

A base ontology defines the abstract vocabulary: DesignChoice (composable design decisions), Section (UI form layout), Constraint (output directives with exclusion groups), and OutputRule (assembly pipeline steps). Each design domain — logos, cinematic scenes, illustrations, music, websites — imports the base and extends it with domain-specific classes.

The logo composer, for example, adds LogoType, Style, ColorScheme, and Personality as subclasses of DesignChoice. The scene composer adds Subject, PhotographyStyle, Background, and Accessory. Different domains, same structural pattern.

This maps directly to the research consensus:

Structured conditions, not text. The ontology’s DesignChoice individuals carry promptFragment properties — natural-language fragments that get assembled by OutputRule templates. The designer selects structured options. The system composes the prompt. No one types into a text box.
Domain-specific priors. Each composer ontology encodes domain knowledge as named individuals with constrained properties. A PhotographyStyle carries cameraStyle, angle, shotType, and texture. These aren’t free-text fields. They’re enumerated choices derived from the domain.
UI generated from the model. Sections declare their widgetType (radio, checkbox, tag-picker), gridColumns, maxSelections, and whether they acceptImages. The form is a projection of the ontology, not a hand-coded layout. Change the ontology, the UI follows.

The ontology also models multi-step pipelines. PipelineTemplate defines canvas configurations with TemplateNode placements and TemplateWire connections between composers. A character sheet generated in one composer flows as a reference image into a scene composer’s segment slot. This is the “part-based concepting” pattern from Piece it Together — partial outputs as structured constraints for downstream generation.

The source ontologies are public: studio.aktagon.com/ontology.

Papers surveyed

Graphic and visual design

PRISM (2601.11747) – Design knowledge from data for stylistic improvement. Cornell, Adobe Research.
DesignDiffusion (2503.01645) – Text-to-design with diffusion models. USTC, Microsoft Research Asia.
DesignPref (2511.20513) – Personal preferences in visual design generation. CMU, Apple.
Piece it Together (2503.10365) – Part-based concepting with IP-Priors. Tel Aviv University.

Layout

PLay (2301.11529) – Parametric layout generation with latent diffusion.
LayoutDiffusion (2303.11589) – Discrete diffusion for graphic layout.
Constrained Graphic Layout Generation (2108.00871) – Layout via latent optimization.

Shape and object-level variation

Localizing Object-level Shape Variations (2303.11306) – Localized variants with text-to-image diffusion.
RIVAL (2305.18729) – Real-world image variation by aligning diffusion inversion.

Engineering and CAD

Parametric-ControlNet (2412.04707) – Multimodal control for engineering design. MIT, Toyota Research Institute.
Hierarchical Neural Coding for CAD (2307.00149) – Controllable CAD generation.
Aligning Optimization Trajectories (2305.18470) – Constrained design with diffusion models.

UI design

Bridging Gulfs in UI Generation (2601.19171) – Semantic guidance for UI generation. CHI 2026.