Editable Image Elements for
Controllable Synthesis

1University of California San Diego, 2Adobe Research

We propose editable image elements, that can faithfully reconstruct an input image, while enabling various spatial editing operations.

The user simply selects interesting image elements and edits their locations and sizes. Our model automatically decoded the edited elements into a realistic image. The example shows the sequential editing results achieved with our method.


Diffusion models have made significant advances in text-guided synthesis tasks. However, editing user-provided images remains challenging, as the high dimensional noise input space of diffusion models is not naturally suited for image inversion or spatial editing. In this work, we propose an image representation that promotes spatial editing of input images using a diffusion model. Concretely, we learn to encode an input into `image elements' that can faithfully reconstruct an input image. These elements can be intuitively edited by a user, and are decoded by a diffusion model into realistic images. We show the effectiveness of our representation on various image editing tasks, such as object resizing, rearrangement, dragging, de-occlusion, removal, variation, and image composition.



To encode the image, we extract features from Segment Anything Model with equally spaced query points and perform simple clustering to obtain grouping of object parts with comparable sizes, resembling superpixels. Each element is individually encoded with our convolutional encoder and is associated with its centroid and size parameters to form image elements. The user can directly modify the image elements, such as moving, resizing, or removing. We pass the modified image elements to our diffusion-based decoder along with a text description of the overall scene to synthesize a realistic image that respects the modified elements.

Spatial Editing

The user can directly edit the image elements with simple selection, dragging, resizing, and deletion operations. The selected and edited elements are highlighted with red and green dots at the centroid of each element.

Object Removal

The user selects elements to delete (shown in blue), and provide a text prompt pertaining to the background. Our diffusion decoder can generate content in the missing region (in black).

Object Variations

Our method supports object variations by deleting some image elements (shown in black in Target Edit), and performing inpainting guided by text prompt and the remaining image elements, such as the ``beak'' element in the bird example.


         author = {Mu, Jiteng and Gharbi, Michaël and Zhang, Richard and Shechtman, Eli 
                          and Vasconcelos, Nuno and Wang, Xiaolong and Park, Taesung},
         title = {Editable Image Elements for Controllable Synthesis},
         journal={arXiv preprint arXiv:2404.16029},