Image Reconstruction as a Tool for Feature Analysis

Abstract

Vision encoders are increasingly used in modern applications, from vision-only models to multimodal systems such as vision-language models. Despite their remarkable success, it remains unclear how these architectures represent features internally. Here, we propose a novel approach for interpreting vision features via image reconstruction. We compare two related model families, SigLIP and SigLIP2, which differ only in their training objective, and show that encoders pre-trained on image-based tasks retain significantly more image information than those trained on non-image tasks such as contrastive learning. We further apply our method to a range of vision encoders, ranking them by the informativeness of their feature representations. Finally, we demonstrate that manipulating the feature space yields predictable changes in reconstructed images, revealing that orthogonal rotations — rather than spatial transformations — control color encoding. Our approach can be applied to any vision encoder, shedding light on the inner structure of its feature space.

Key Contributions

🔍 Novel Feature Analysis Method

We introduce a new approach to interpret vision encoder features through direct image reconstruction, providing insights into how these models internally represent visual information.

📊 Model Family Comparison

We reveal that encoders pre-trained on image-based tasks retain significantly more image information compared to those trained only on contrastive learning tasks.

🎨 Feature Space Control

We demonstrate that linear transformations in feature space control color encoding of reconstructed images on three different tasks: colorization, red-blue channel swap, and blue channel suppression.

Method

Feature Reconstruction Framework

Our method enables direct interpretation of vision encoder features through image reconstruction. We train a decoder network that learns to reconstruct original images from their feature representations, providing a quantitative measure of feature informativeness.

Figure 1. Our reconstruction framework trains a decoder to restore images from feature representations, enabling direct assessment of feature informativeness.

Comparative Analysis: SigLIP vs SigLIP2

We compare two related model families that differ only in their training objective: SigLIP (trained with contrastive learning) and SigLIP2 (trained on image-based tasks). This controlled comparison reveals how training objectives influence feature representations.

Figure 2. Reconstruction quality comparison between SigLIP and SigLIP2 across different image resolutions demonstrates that image-based training leads to more informative feature representations.

Feature Space Analysis

Q Matrix: A Tool for Feature Manipulation

We introduce the Q matrix framework that enables controlled manipulation of feature representations. This orthogonal transformation matrix is learned to perform specific image manipulations, revealing how visual attributes are encoded in the feature space.

features_reconstruction_manipulation_train_Q

Figure 3. Q matrix calculation process learns the transformation needed for specific image manipulations.

features_reconstruction_manipulation_eval_Q

Figure 4. Application of Q matrix to feature embeddings enables controlled image manipulation.

Color Manipulation Studies

Using simple linear transformations in feature space, we demonstrate precise control over color attributes of reconstructed image. Our color manipulation studies serve as a validation of the feature interpretation hypothesis, supported by an image reconstruction approach.

Image Colorization

Colorization task - transforming grayscale images to their color counterparts. This problem requires following properties:

1. Semantic Requirement: Successful colorization necessitates that the feature space geometry encodes real-world knowledge about plausible color distributions for objects and scenes.

2. Non-algorithmic Nature: Colorization cannot be achieved through simple pixel-wise transformations but requires understanding of image semantics.

Figure 5. Our method enables controlled colorization through feature space manipulation, demonstrating the structured nature of color encoding.

Red-Blue Channel Swap

Properties we would expect from Red-Blue Channel Swap operator:

Orthogonal — the operator should be orthogonal, meaning it should preserve the norm of the vector.
Self-inverse — double application of Red Blue color swapping is Identity transformation in image space.

Eigenvalues of the operator will be close to +1 and -1.

As we will see further all this properties are somehow preserved even for Linear operator with no strict constraints on this properties.

We trained three different operators:

Orthogonal self-conjugated — as a Procrustes solution with a long-range projection of the operator onto the space of self-conjugated operators.
Orthogonal — as a Procrustes solution.
Linear — as a regression problem. (Note that this solution cannot be directly used with the reconstructor, as it fails to preserve vector norms. Since the reconstructor was trained exclusively on normalized vectors, we first normalize the resulting outputs before feeding them to the reconstructor.)

As shown in Figure 7, the eigenvalues of all operators cluster along the real axis, indicating they primarily represent either eigenvector preservation (near +1) or inversion (near -1). While small deviations from these ideal values exist — revealing noise in the feature space — these perturbations remain relatively weak. Consequently, the feature space geometry largely preserves the properties expected from the pixel-space channel permutation operator.

Figure 6. Red-blue channel swap demonstrates precise control over color channels in feature space.

Figure 7. Eigenvalue analysis reveals that color transformations affect only specific feature dimensions while preserving others.

Blue Channel Suppression

Blue Channel Suppression operator will gradually suppress the blue channel of the image multiplying blue channel by some factor less than 1.

Properties we would expect from Blue Channel Suppression operator:

Asymptotically this operator approaches a projection operator

Eigenvalues of the operator are either 1 or complex values with magnitude strictly less than 1.

We emperically observe this properties.

Figure 8. Selective suppression of the blue channel demonstrates fine-grained control over color attributes.

Figure 9. Eigenvalue distribution for blue suppression shows targeted modification of specific feature dimensions.

Conclusion

Our work introduces a novel approach to understanding vision encoder features through image reconstruction. We demonstrate that:

Training objectives significantly impact how models internally represent visual information
Image-based pre-training leads to more informative feature representations compared to contrastive learning
Color information is encoded through orthogonal rotations in feature space
Our method provides a general framework for analyzing any vision encoder's feature representations

These findings have important implications for model design and provide new tools for understanding and controlling vision encoder behavior. Our approach opens new avenues for feature analysis and manipulation in vision models.

BibTeX

@misc{allakhverdov2025imagereconstructiontoolfeature,
  title={Image Reconstruction as a Tool for Feature Analysis}, 
  author={Eduard Allakhverdov and Dmitrii Tarasov and Elizaveta Goncharova and Andrey Kuznetsov},
  year={2025},
  eprint={2506.07803},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2506.07803},
}