neural embedding for multi-turn image generation

Neural Embedding for Multi-Turn Image Generation: Object Permanence Through Persistent Scene Representations

1. The Problem

Current text-to-image systems treat each generation as independent. A user says "a golden retriever sitting on a wooden porch." They get a good image. They say "now add a tabby cat next to the dog." They get a new image with a new dog, new porch, new lighting, and a cat. The system generated a fresh sample from P(image | "dog and cat on porch") with no memory that a specific dog on a specific porch already existed.

This makes iterative image creation unusable for storyboarding, product design, game asset pipelines, architectural visualization — any workflow requiring continuity across edits.

The cause is architectural. Diffusion models (Stable Diffusion, DALL-E 2, Midjourney v5) map text embeddings to image space through denoising. The text embedding is the only conditioning signal. On a follow-up instruction, the system either generates from new text alone or uses the previous image as a noisy initialization via img2img. The first option loses everything. The second preserves low-frequency structure but mutates details unpredictably. Neither provides object permanence.

The system lacks a persistent structured representation of the scene it created. Without one, it cannot distinguish between things the user wants to change and things that should stay the same.

2. What Would Object Permanence Require?

Object permanence in multi-turn generation means entities introduced in prior turns retain their identity, appearance, spatial position, and relationships unless explicitly modified.

Let S_t be the scene state at turn t, consisting of entities {e_1, ..., e_n} each with appearance a_i, position p_i, and relations R between them. Let c_{t+1} be the user's instruction at turn t+1. The system should produce S_{t+1} such that:

1. Edit compliance: S_{t+1} satisfies c_{t+1}.

2. Identity preservation: For all entities e_i not referenced by c_{t+1}, appearance a_i and position p_i are unchanged.

3. Relational consistency: Relations R update only as necessitated by the edit.

No current system maintains S_t as a structured object. The scene state exists only as pixels or as a latent noise trajectory. Neither decomposes into individually addressable entities.

3. Pixel Conditioning Is Lossy

The standard approach to multi-turn generation feeds the previous image back as conditioning. Img2img in Stable Diffusion passes the prior output through the VAE encoder and uses it as a starting point for the next denoising pass. This gives the appearance of continuity but fails for a structural reason.

An image is a rendering of a scene. When you encode an image into latent space, you get a compressed perceptual representation — roughly, what this looks like. You do not get which regions belong to which objects, what the depth ordering is, what the lighting model is, or which visual features are identity-critical (the specific markings on a dog) versus incidental (the exact shadow angle).

When the model receives "add a cat," it cannot determine which latent features correspond to the dog and must be frozen versus which correspond to the background and can accommodate a new entity. Its only constraint is perceptual similarity to the input, which it trades off against text-alignment to the new prompt. The model treats "looks like the previous image" and "contains a cat" as competing soft constraints. There is no hard guarantee on identity. The dog drifts because the optimization allows it to.

4. Proposed Architecture: Persistent Neural Scene Graphs

We propose replacing pixel-space conditioning with a persistent latent scene graph that maintains structured state across turns.

4.1 Scene Graph Structure

Each entity is represented by:

- Identity embedding z_i^{id}: A fixed-length vector encoding visual identity (species, markings, color, texture, style). Extracted once at creation and frozen across turns.

- State embedding z_i^{state}: Mutable appearance properties (wet/dry, clean/dirty, happy/sad). Allows appearance changes without losing identity.

- Pose embedding z_i^{pose}: Position, orientation, scale. Mutable across turns.

- Relation embeddings r_{ij}: Pairwise spatial and semantic relations (in-front-of, next-to, on-top-of). Updated as entities are added or moved.

The full scene state is S_t = {(z_i^{id}, z_i^{state}, z_i^{pose})}_{i=1}^{n} ∪ {r_{ij}}.

This three-way split matters. "Make the dog wet" changes state but not identity. "Move the dog to the left" changes pose but not identity or state. "Replace the dog with a cat" changes identity. Without the identity/state distinction, any appearance modification forces a new identity embedding, and the system loses track of which entity is which.

A background representation z^{bg} captures global scene properties: lighting, environment, style, camera. Global edits like "make it sunset" modify z^{bg}, but the effect must propagate to every entity. A dog lit by midday sun looks different from one lit by sunset — the fur color shifts, shadows lengthen and warm, specular highlights move. This means the generator cannot render entities independently of z^{bg}. The conditioning mechanism must allow global lighting to modulate per-entity appearance while the identity embedding stays fixed. The identity embedding encodes what the dog is. The rendering of that identity under different lighting is the generator's job, conditioned jointly on z_i^{id} and z^{bg}.

4.2 Entity Extraction

When the system generates an image from the first prompt, it simultaneously produces entity-level decompositions:

1. A panoptic segmentation model identifies entity masks in the generated output.

2. A DINO-based identity encoder produces z_i^{id} for each masked region.

3. Depth and spatial layout estimation produces z_i^{pose} and r_{ij}.

This converts the pixel-space output into a structured scene graph. The graph becomes the persistent state.

Occlusion creates a problem here. If the cat is partially behind the dog, panoptic segmentation returns a partial mask, and the identity encoder computes z_i^{id} from a partial view. When a later edit moves the cat into full view, the embedding lacks information about the previously occluded regions. Two mitigations: the identity encoder can be trained on partial views to produce embeddings that are robust to occlusion (amodal completion in embedding space), or the system can flag partially-visible entities and refine their embeddings when more of the entity becomes visible. Neither is clean. This is a real failure mode that worsens as scenes get more crowded.

4.3 Edit Application

On receiving a new instruction c_{t+1}:

1. An instruction parser maps c_{t+1} to a scene graph edit: which entities to add, remove, or modify, and which fields change.

2. The scene graph updates. New entities get new identity embeddings. Modified entities get updated pose, state, or relation embeddings depending on the edit type. Unmentioned entities are untouched.

3. The updated scene graph conditions a generation pass that renders the new scene.

Identity embeddings of unmodified entities are literally the same vectors. They are not re-inferred, re-encoded, or passed through any stochastic process. This is what makes identity preservation a hard guarantee rather than a soft constraint.

4.4 Conditioned Generation

The generator takes the full scene graph as conditioning, similar architecturally to GLIGEN (2023) with bounding boxes and SpaText (2023) with spatial text maps, but operating on richer per-entity representations.

Entity identity embeddings are injected via cross-attention at spatial locations determined by pose embeddings. Global features condition on z^{bg}. During denoising, each spatial region of the latent is influenced primarily by its assigned entity's identity embedding, which prevents cross-entity interference.

5. What Needs to Be True

Identity and state must be separable. The encoder must produce identity embeddings that capture what an entity is (a golden retriever with specific markings) while excluding how it currently looks (wet, in shadow, wearing a collar). DreamBooth and textual inversion suggest that base identity can be captured in a learned embedding. Whether identity and state can be cleanly disentangled in a single extraction pass — especially for ambiguous cases like "the dog is now old" — is untested.

Scene graph edits must be parseable. "Move the cat behind the dog" or "make the sky more orange" must map reliably to structured graph operations. Current LLMs handle simple edits. Ambiguous references in dense scenes ("change the red one") remain hard.

Conditioned generation from scene graphs must produce coherent images. The generator must render a plausible image from entity embeddings plus spatial layout, handling occlusion, lighting interaction, shadows, and reflections. Layout-conditioned generation (GLIGEN, ControlNet) can place objects in specified regions. Producing physically coherent interactions between independently-specified entities is open.

The extract-update-render loop must not accumulate drift. Each turn requires segmenting the output, updating the graph, and regenerating. If each cycle introduces small identity drift, the entity becomes unrecognizable after enough turns. The identity embedding must be a fixed point of the loop. Even 1% drift per turn compounds to unusability over 20 turns.

6. Related Approaches

Textual Inversion / DreamBooth (Gal et al. 2022, Ruiz et al. 2023) learn per-concept embeddings from reference images. They require multiple images and minutes of fine-tuning per concept, making them unusable for real-time multi-turn generation. They do prove that learned embeddings capture visual identity well enough for faithful re-rendering.

IP-Adapter (Ye et al. 2023) injects image prompt embeddings via decoupled cross-attention, using a frozen identity signal during generation. It operates on a single reference image rather than a structured scene graph, and does not decompose scenes into entities.

GLIGEN (Li et al. 2023) adds spatial grounding to diffusion models via bounding boxes and text labels without retraining the base model. Our proposal extends this from text labels to learned identity embeddings, and from single-turn layout specification to persistent multi-turn state.

Neural Radiance Fields maintain persistent 3D scene state by construction. Editing a NeRF-represented scene (Instruct-NeRF2NeRF) preserves unedited regions because the representation is spatially structured. NeRFs require multi-view input and are too slow for interactive generation. But they demonstrate the core claim: persistent structured representations enable persistent identity. The open question is whether this transfers to 2D latent space without explicit 3D reconstruction.

7. Experiments

Identity drift over sequential edits. Generate a scene with 3 objects. Apply 20 edits that each modify only one object per turn. Measure DINO and CLIP-I similarity of unmodified objects across turns against a pixel-conditioning baseline. If scene graph conditioning holds >0.95 similarity at turn 20 where pixel conditioning drops below 0.8, the thesis is confirmed.

Edit precision benchmark. 500 multi-turn editing sequences with ground-truth expectations. Measure edit compliance and preservation separately. Current systems optimize compliance at the expense of preservation. The interesting number is the size of that gap.

Entity count scaling. Test identity preservation as scenes grow from 1 to 20 entities. Pixel conditioning should degrade with entity count because there are more things to accidentally modify. Scene graph conditioning should be invariant for unmodified entities regardless of total count.

Compositional transfer. Generate a dog in one session and a car in another. Compose them into a single scene using stored identity embeddings. This tests whether the representation is compositional or only works for entities extracted from the same generation.

8. Scope

This proposal addresses object permanence within a single editing session. Style consistency across independent scenes, physical plausibility of arbitrary compositions, and frame-level temporal coherence for video are separate problems with different solutions.

9. Conclusion

Multi-turn image generation fails because the persistent state between turns is either nothing or pixels. Pixels discard the entity-level structure needed for selective editing. A neural scene graph — per-entity identity embeddings, spatial layout, relations — converts multi-turn generation from "generate a new image that resembles the old one" into "render an updated scene graph," providing hard identity guarantees for unmodified entities.

The technical requirements are significant. Identity embeddings must be expressive and disentangled. Scene graph conditioning must produce coherent renderings. The extract-update-render loop must converge rather than drift. But better diffusion models and higher-resolution img2img will not solve this. The bottleneck is the representation between turns.

10/23/2024