We are primarily interested in two complementary research directions within the art domain:
First line of investigation. We study the capability of state-of-the-art generative models to replicate and reinterpret artistic styles from the past. This line of work aims to deepen our understanding of the generative process itself, including how conditioning mechanisms influence stylistic outputs and how biases, both in training data and model architectures, affect the resulting representations. Beyond mere stylistic imitation, we are also interested in evaluating whether these models capture higher-level structural and semantic properties that characterize specific artistic movements.
Second line of investigation. We investigate the extent to which multimodal models can provide original and critically grounded interpretations of artworks, based on what they actually perceive. In this context, the visual component is not optional but foundational: the objective is to assess whether LLM-based systems can interpret visual inputs directly, rather than merely re-elaborating prior textual or art-historical knowledge. While current models are known to perform well when leveraging memorized or inferred literature, our focus is on isolating and evaluating their capacity to derive meaning from visual evidence.
The broader goal is to determine whether such systems can exhibit forms of aesthetic judgment grounded in perception, and, if so, to characterize the criteria underlying these judgments. This requires a systematic analysis of their perceptual mechanisms, including how they attend to compositional structure, stylistic features, and fine-grained visual details, and how these elements are integrated into higher-level interpretative reasoning. Particular attention will be given to disentangling perceptual understanding from prior knowledge, in order to assess whether the models genuinely “see” and interpret artworks, rather than simply recalling or reconstructing culturally encoded descriptions.
AI-Pastiche is a carefully curated dataset comprising 953 AI-generated paintings in well-known artistic styles. It was created using 73 manually crafted text prompts, used to test 12 modern image generators, with one or more images generated for each of the selected models. The dataset includes comprehensive metadata describing the details of the generation process.
Prompt: Generate a detailed coastal landscape painting in the Impressionist style movement of the 19th century. The painting depicts a coastal scene with a small boat near the water's edge and distant buildings on the shore. The skyline is dominated by a cloudy sky, suggesting an overcast day. The colors are muted and earthy, conveying a serene yet slightly melancholic atmosphere. The technique used appears to be Impressionism, characterized by short, visible brushstrokes that capture the essence of the scene rather than detailed realism. The focus is on the play of light and color, creating a sense of movement and the transient nature of the landscape.
Metadata Description
Metadata comprise the following columns:
generative model: the model used to generate the image
prompt: the prompt passed as input to the generator
subject: a collection of comma separated tags describing the content (as described in the prompt)
style: the style to be imitated
period: the period the generated image should belong to
generated image: the name of the generated image in the generated_images dataset
Three additional columns provide human metrics collectd through extensive surveys. All values are in the range [0,1].
defects: presence of notable defects and artifacts (0=no evident defects, 1=major problems)
authenticity: perceived authentity of the sample (please, check the article for details about this metric)
adherence: adherence of the sample to the prompt request
Do LLMs perceive Art the same way we do?
Does CLIP perceive art the same way we do? Andrea Asperti, Leonardo Dessì, Maria Chiara Tonetti, Nico Wu. Proceedings of IEEE International Conference on Content-Based Multimedia Indexing (IEEE CBMI 2025), Dublin, Ireland, 22-24 October 2025.
3D UMAP projection of image embeddings of National Gallery of Art Dataset extracted from the CLIP ViT-L/14 model. Each point represents a painting from the National Gallery of Art dataset, colored by artistic style. The visualization reveals the semantic organization of the artworks in the latent space, where similar styles tend to cluster together.
abstract: CLIP has emerged as a powerful multimodal model capable of connecting images and text through joint embeddings, but to what extent does it 'see' the same way humans do - especially when interpreting artworks? In this paper, we investigate CLIP's ability to extract high-level semantic and stylistic information from paintings, including both human-created and AI-generated imagery. We evaluate its perception across multiple dimensions: content, scene understanding, artistic style, historical period, and the presence of visual deformations or artifacts. By designing targeted probing tasks and comparing CLIP's responses to human annotations and expert benchmarks, we explore its alignment with human perceptual and contextual understanding. Our findings reveal both strengths and limitations in CLIP's visual representations, particularly in relation to aesthetic cues and artistic intent. We further discuss the implications of these insights for using CLIP as a guidance mechanism during generative processes, such as style transfer or prompt-based image synthesis. Our work highlights the need for deeper interpretability in multimodal systems, especially when applied to creative domains where nuance and subjectivity play a central role.