VISAPP 2024 Abstracts


Area 1 - Image and Video Processing and Analysis

Full Papers
Paper Nr: 65
Title:

Investigating Color Illusions from the Perspective of Computational Color Constancy

Authors:

Oguzhan Ulucan, Diclehan Ulucan and Marc Ebner

Abstract: Color constancy and color illusion perception are two phenomena occurring in the human visual system, which can help us reveal unknown mechanisms of human perception. For decades computer vision scientists have developed numerous color constancy methods, which estimate the reflectance of the surface by discounting the illuminant. However, color illusions have not been analyzed in detail in the field of computational color constancy, which we find surprising since the relationship they share is significant and may let us design more robust systems. We argue that any model that can reproduce our sensation on color illusions should also be able to provide pixel-wise estimates of the light source. In other words, we suggest that the analysis of color illusions helps us to improve the performance of the existing global color constancy methods, and enable them to provide pixel-wise estimates for scenes illuminated by multiple light sources. In this study, we share the outcomes of our investigation in which we take several color constancy methods and modify them to reproduce the behavior of the human visual system on color illusions. Also, we show that parameters purely extracted from illusions are able to improve the performance of color constancy methods. A noteworthy outcome is that our strategy based on the investigation of color illusions outperforms the state-of-the-art methods that are specifically designed to transform global color constancy algorithms into multi-illuminant algorithms.
Download

Paper Nr: 85
Title:

Pair-GAN: A Three-Validated Generative Model from Single Pairs of Biomedical and Ground Truth Images

Authors:

Clara Brémond-Martin, Huaqian Wu, Cédric Clouchoux and Kévin François-Bouaou

Abstract: Generating synthetic pairs of raw and ground truth (GT) image is a strategy to reduce the amount of acquisition and annotation by biomedical experts. Pair image generation strategies, from single-input paired images (SIP), focus on patch-pyramid (PP) or on dual branch generator but, resulting synthetic images are not natural. With few-input images, for raw synthesis, adversarial auto-encoders synthesises more natural images. Here we propose Pair-GAN, a combination of PP containing auto-encoder generators at each level, for the biomedical image synthesis based upon a SIP. PP allows to synthesise using SIP while the AAE generator renders most natural the image content. We use for this work two biomedical datasets containing raw and GT images. Our architecture is evaluated with seven state of the art method updated for SIP: qualitative, similitude and segmentation metrics, Kullback Leibler divergences from synthetic and original feature image representations, computational costs and statistical analyses. Pair-GAN generates most qualitative and natural outputs, similar to original pairs with complex shape not produced by other methods, however with increased memory needs. Future works may use this generative procedure for multimodal biomedical dataset synthesis to help their automatic processing such as classification or segmentation with deep learning tools.
Download

Paper Nr: 99
Title:

CaRe-CNN: Cascading Refinement CNN for Myocardial Infarct Segmentation with Microvascular Obstructions

Authors:

Franz Thaler, Matthias F. Gsell, Gernot Plank and Martin Urschler

Abstract: Late gadolinium enhanced (LGE) magnetic resonance (MR) imaging is widely established to assess the viability of myocardial tissue of patients after acute myocardial infarction (MI). We propose the Cascading Refinement CNN (CaRe-CNN), which is a fully 3D, end-to-end trained, 3-stage CNN cascade that exploits the hierarchical structure of such labeled cardiac data. Throughout the three stages of the cascade, the label definition changes and CaRe-CNN learns to gradually refine its intermediate predictions accordingly. Furthermore, to obtain more consistent qualitative predictions, we propose a series of post-processing steps that take anatomical constraints into account. Our CaRe-CNN was submitted to the FIMH 2023 MYOSAIQ challenge, where it ranked second out of 18 participating teams. CaRe-CNN showed great improvements most notably when segmenting the difficult but clinically most relevant myocardial infarct tissue (MIT) as well as microvascular obstructions (MVO). When computing the average scores over all labels, our method obtained the best score in eight out of ten metrics. Thus, accurate cardiac segmentation after acute MI via our CaRe-CNN allows generating patient-specific models of the heart serving as an important step towards personalized medicine.
Download

Paper Nr: 100
Title:

Efficiency Optimization Strategies for Point Transformer Networks

Authors:

Jannis Unkrig and Markus Friedrich

Abstract: The Point Transformer, and especially its successor Point Transformer V2, are among the state-of-the-art architectures for point cloud processing in terms of accuracy. However, like many other point cloud processing architectures, they suffer from the inherently irregular structure of point clouds, which makes efficient processing computationally expensive. Common workarounds include reducing the point cloud density, or cropping out partitions, processing them sequentially, and then stitching them back together. However, those approaches inherently limit the architecture by either providing less detail or less context. This work provides strategies that directly address efficiency bottlenecks in the Point Transformer architecture, and therefore allows processing larger point clouds in a single feed-forward operation. Specifically, we propose using uniform point cloud sizes in all stages of the architecture, a k-D tree-based k-nearest neighbor search algorithm that is not only efficient on large point clouds, but also generates intermediate results that can be reused for downsampling, and a technique for normalizing local densities which improves overall accuracy. Furthermore, our architecture is simpler to implement and does not require custom CUDA kernels to run efficiently.
Download

Paper Nr: 115
Title:

Simple Base Frame Guided Residual Network for RAW Burst Image Super-Resolution

Authors:

Anderson N. Cotrim, Gerson Barbosa, Cid N. Santos and Helio Pedrini

Abstract: Burst super-resolution or multi-frame super-resolution (MFSR) has gained significant attention in recent years, particularly in the context of mobile photography. With modern handheld devices consistently increasing their processing power and the ability to capture multiple images even faster, the development of robust MFSR algorithms has become increasingly feasible. Furthermore, in contrast to extensively studied single-image super-resolution (SISR), burst super-resolution mitigates the ill-posed nature of reconstructing high-resolution images from low-resolution ones by merging information from multiple shifted frames. This research introduces a novel and effective deep learning approach, SBFBurst, designed to tackle this challenging problem. Our network takes multiple noisy RAW images as input and generates a denoised, super-resolved RGB image as output. We demonstrate that significant enhancements can be achieved in this problem by incorporating base frame-guided mechanisms through operations such as feature map concatenation and skip connections. Additionally, we highlight the significance of employing mosaicked convolution to enhance alignment, thus enhancing the overall network performance in super-resolution tasks. These relatively simple improvements underscore the competitiveness of our proposed method when compared to other state-of-the-art approaches.
Download

Paper Nr: 128
Title:

Multispectral Stereo-Image Fusion for 3D Hyperspectral Scene Reconstruction

Authors:

Eric L. Wisotzky, Jost Triller, Anna Hilsmann and Peter Eisert

Abstract: Spectral imaging enables the analysis of optical material properties that are invisible to the human eye. Different spectral capturing setups, e.g., based on filter-wheel, push-broom, line-scanning, or mosaic cameras, have been introduced in the last years to support a wide range of applications in agriculture, medicine, and industrial surveillance. However, these systems often suffer from different disadvantages, such as lack of real-time capability, limited spectral coverage or low spatial resolution. To address these drawbacks, we present a novel approach combining two calibrated multispectral real-time capable snapshot cameras, covering different spectral ranges, into a stereo-system. Therefore, a hyperspectral data-cube can be continuously captured. The combined use of different multispectral snapshot cameras enables both 3D reconstruction and spectral analysis. Both captured images are demosaicked avoiding spatial resolution loss. We fuse the spectral data from one camera into the other to receive a spatially and spectrally high resolution video stream. Experiments demonstrate the feasibility of this approach and the system is investigated with regard to its applicability for surgical assistance monitoring.
Download

Paper Nr: 138
Title:

Pre-Training and Fine-Tuning Attention Based Encoder Decoder Improves Sea Surface Height Multi-Variate Inpainting

Authors:

Théo Archambault, Arthur Filoche, Anastase Charantonis and Dominique Béréziat

Abstract: The ocean is observed through satellites measuring physical data of various natures. Among them, Sea Surface Height (SSH) and Sea Surface Temperature (SST) are physically linked data involving different remote sensing technologies and therefore different image inverse problems. In this work, we propose to use an Attention-based Encoder-Decoder to perform the inpainting of the SSH, using the SST as contextual information. We propose to pre-train this neural network on a realistic twin experiment of the observing system and to fine-tune it in an unsupervised manner on real-world observations. We show the interest of this strategy by comparing it to existing methods. Our training methodology achieves state-of-the-art performances, and we report a decrease of 25% in error compared to the most widely used interpolations product.
Download

Paper Nr: 140
Title:

Deep Learning-Based Models for Performing Multi-Instance Multi-Label Event Classification in Gameplay Footage

Authors:

Etienne Julia, Marcelo Zanchetta do Nascimento, Matheus P. Faria and Rita S. Julia

Abstract: In dynamic environments, like videos, one of the key pieces of information to improve the performance of autonomous agents are the events, since, in a broad manner, they represent the dynamic changes and interactions that happen in the environment. Video games stand out among the most suitable domains for investigating the effectiveness of machine learning techniques. Among the challenging activities explored in such research, it highlights that which endows the automatic game systems with the ability of identifying, in game footage, the events that other players, interacting with them, provoke in the game environment. Thus, the main contribution of this work is the implementation of deep learning models to perform MIML game event classification in gameplay footage, which are composed of: a data generator script to automatically produce multi-labeled frames from game footage (where the labels correspond to game events); a pre-processing method to make the frames generated by the script suitable to be used in the training datasets; a fine-tuned MobileNetV2 to perform feature extraction (trained from the pre-processed frames); an algorithm to produce MIML samples from the pre-processed frames (each sample corresponds to a set of frames named chunk); a deep neural network (NN) to perform classification of game events, which is trained from the chunks. In this investigation, Super Mario Bros is used as a case study.
Download

Paper Nr: 155
Title:

Image Inpainting on the Sketch-Pencil Domain with Vision Transformers

Authors:

Jose F. Campana, Luís L. Decker, Marcos R. Souza, Helena A. Maia and Helio Pedrini

Abstract: Image inpainting aims to realistically fill missing regions in images, which requires both structural and textural understanding. Traditionally, methods in the literature have employed Convolutional Neural Networks (CNN), especially Generative Adversarial Networks (GAN), to restore missing regions in a coherent and reliable manner. However, CNNs’ limited receptive fields can sometimes result in unreliable outcomes due to their inability to capture the broader context of the image. Transformer-based models, on the other hand, can learn long-range dependencies through self-attention mechanisms. In order to generate more consistent results, some approaches have further incorporated auxiliary information to guide the model’s understanding of structural information. In this work, we propose a new method for image inpainting that uses sketch-pencil information to guide the restoration of structural, as well as textural elements. Unlike previous works that employ edges, lines, or segmentation maps, we leverage the sketch-pencil domain and the capabilities of Transformers to learn long-range dependencies to properly match structural and textural information, resulting in more consistent results. Experimental results show the effectiveness of our approach, demonstrating either superior or competitive performance when compared to existing methods, especially in scenarios involving complex images and large missing areas.
Download

Paper Nr: 163
Title:

EBA-PRNetCC: An Efficient Bridge Attention-Integration PoseResNet for Coordinate Classification in 2D Human Pose Estimation

Authors:

Ali Zakir, Sartaj A. Salman, Gibran Benitez-Garcia and Hiroki Takahashi

Abstract: In the current era, 2D Human Pose Estimation has emerged as an essential component in advanced Computer Vision tasks, particularly for understanding human behaviors. While challenges such as occlusion and unfavorable lighting conditions persist, the advent of deep learning has significantly strengthened the efficacy of 2D HPE. Yet, traditional 2D heatmap methodologies face quantization errors and demand complex post-processing. Addressing this, we introduce the EBA-PRNetCC model, an innovative coordinate classification approach for 2D HPE, emphasizing improved prediction accuracy and optimized model parameters. Our EBA-PRNetCC model employs a modified ResNet34 framework. A key feature is its head, which includes a dual-layer Multi-Layer Perceptron augmented by the Mish activation function. This design not only improves pose estimation precision but also minimizes model parameters. Integrating the Efficient Bridge Attention Net further enriches feature extraction, granting the model deep contextual insights. By enhancing pixel-level discretization, joint localization accuracy is improved. Comprehensive evaluations on the COCO dataset validate our model’s superior accuracy and computational efficiency performance compared to prevailing 2D HPE techniques.
Download

Paper Nr: 165
Title:

Training Methods for Regularizing Gradients on Multi-Task Image Restoration Problems

Authors:

Samuel Willingham, Mårten Sjöström and Christine Guillemot

Abstract: Inverse problems refer to the task of reconstructing a clean signal from a degraded observation. In imaging, this pertains to restoration problems like denoising, super-resolution or in-painting. Because inverse problems are often ill-posed, regularization based on prior information is needed. Plug-and-play (pnp) approaches take a general approach to regularization and plug a deep denoiser into an iterative solver for inverse problems. However, considering the inverse problems at hand in training could improve reconstruction performance at test-time. Deep equilibrium models allow for the training of multi-task priors on the reconstruction error via an estimate of the iterative method’s fixed-point (FP). This paper investigates the intersection of pnp and DEQ models for the training of a regularizing gradient (RG) and derives an upper bound for the reconstruction loss of a gradient-descent (GD) procedure. Based on this upper bound, two procedures for the training of RGs are proposed and compared: One optimizes the upper bound directly, the other trains a deep equilibrium GD (DEQGD) procedure and uses the bound for regularization. The resulting regularized RG (RERG) produces consistently good reconstructions across different inverse problems, while the other RGs tend to have some inverse problems on which they provide inferior reconstructions.
Download

Paper Nr: 218
Title:

Feature Selection for Unsupervised Anomaly Detection and Localization Using Synthetic Defects

Authors:

Lars Heckler and Rebecca König

Abstract: Expressive features are crucial for unsupervised visual Anomaly Detection and Localization. State-of-the-art methods like PatchCore or SimpleNet heavily exploit such features from pretrained extractor networks and model their distribution or utilize them for training further parts of the model. However, the layers commonly used for feature extraction might not represent the optimal choice for reaching maximum performance. Thus, we present the first application-specific feature selection strategy for the task of unsupervised Anomaly Detection and Localization that identifies the most suitable layer of a pretrained feature extractor based on the performance on a synthetic validation set. The proposed selection strategy is applicable to any feature extraction-based AD method and may serve as a competitive baseline for future work by not only outperforming single-layer baselines but also features ensembled from multiple layer outputs.
Download

Paper Nr: 234
Title:

Robust Denoising and DenseNet Classification Framework for Plant Disease Detection

Authors:

Kevin Zhou and Dimah Dera

Abstract: Plant disease is one of many obstacles encountered in the field of agriculture. Machine learning models have been used to classify and detect diseases among plants by analyzing and extracting features from plant images. However, a common problem for many models is that they are trained on clean laboratory images and do not exemplify real conditions where noise can be present. In addition, the emergence of adversarial noise that can mislead models into wrong predictions poses a severe challenge to developing preserved models against noisy environments. In this paper, we propose an end-to-end robust plant disease detection framework that combines a DenseNet-based classification with a vigorous deep learning denoising model. We validate a variety of deep learning denoising models and adopt the Real Image Denoising network (RIDnet). The experiments have shown that the proposed denoising classification framework for plant disease detection is more robust against noisy or corrupted input images compared to a single classification model and can also successfully defend against adversarial noises in images.
Download

Paper Nr: 240
Title:

SIDAR: Synthetic Image Dataset for Alignment & Restoration

Authors:

Monika Kwiatkowski, Simon Matern and Olaf Hellwich

Abstract: In this paper, we present a synthetic dataset generation to create large-scale datasets for various image restoration and registration tasks. Illumination changes, shadows, occlusions, and perspective distortions are added to a given image using a 3D rendering pipeline. Each sequence contains the undistorted image, occlusion masks, and homographies. Although we provide two specific datasets, the data generation itself can be customized and used to generate an arbitrarily large dataset with an arbitrary combination of distortions. The datasets allow end-to-end training of deep learning methods for tasks such as image restoration, background subtraction, image matching, and homography estimation. We evaluate multiple image restoration methods to reconstruct the content from a sequence of distorted images. Additionally, a benchmark is provided that evaluates keypoint detectors and image matching methods. Our evaluations show that even learned image descriptors struggle to identify and match keypoints under varying lighting conditions.
Download

Paper Nr: 250
Title:

Beyond Variational Models and Self-Similarity in Super-Resolution: Unfolding Models and Multi-Head Attention

Authors:

Ivan Pereira-Sánchez, Eloi Sans, Julia Navarro and Joan Duran

Abstract: Classical variational methods for solving image processing problems are more interpretable and flexible than pure deep learning approaches, but their performance is limited by the use of rigid priors. Deep unfolding networks combine the strengths of both by unfolding the steps of the optimization algorithm used to estimate the minimizer of an energy functional into a deep learning framework. In this paper, we propose an unfolding approach to extend a variational model exploiting self-similarity of natural images in the data fidelity term for single-image super-resolution. The proximal, downsampling and upsampling operators are written in terms of a neural network specifically designed for each purpose. Moreover, we include a new multi-head attention module to replace the nonlocal term in the original formulation. A comprehensive evaluation covering a wide range of sampling factors and noise realizations proves the benefits of the proposed unfolding techniques. The model shows to better preserve image geometry while being robust to noise.
Download

Paper Nr: 298
Title:

The Risk of Image Generator-Specific Traces in Synthetic Training Data

Authors:

Georg Wimmer, Dominik Söllinger and Andreas Uhl

Abstract: Deep learning based methods require large amounts of annotated training data. Using synthetic images to train deep learning models is a faster and cheaper alternative to gathering and manually annotating training data. However, synthetic images have been demonstrated to exhibit a unique model-specific fingerprint that is not present in real images. In this work, we investigate the effect of such model-specific traces on the training of CNN-based classifiers. Two different methods are applied to generate synthetic training data, a conditional GAN-based image-to-image translation method (BicycleGAN) and a conditional diffusion model (Palette). Our results show that CNN-based classifiers can easily be fooled by generator-specific traces contained in synthetic images. As we will show, classifiers can learn to discriminate based on the traces left by the generator, instead of class-specific features.
Download

Paper Nr: 317
Title:

Facial Point Graphs for Amyotrophic Lateral Sclerosis Identification

Authors:

Nicolas B. Gomes, Arissa Yoshida, Mateus Roder, Guilherme Camargo de Oliveira and João P. Papa

Abstract: Identifying Amyotrophic Lateral Sclerosis (ALS) in its early stages is essential for establishing the beginning of treatment, enriching the outlook, and enhancing the overall well-being of those affected individuals. However, early diagnosis and detecting the disease’s signs is not straightforward. A simpler and cheaper way arises by analyzing the patient’s facial expressions through computational methods. When a patient with ALS engages in specific actions, e.g., opening their mouth, the movement of specific facial muscles differs from that observed in a healthy individual. This paper proposes Facial Point Graphs to learn information from the geometry of facial images to identify ALS automatically. The experimental outcomes in the Toronto Neuroface dataset show the proposed approach outperformed state-of-the-art results, fostering promising developments in the area.
Download

Paper Nr: 384
Title:

Single-Class Instance Segmentation for Vectorization of Line Drawings

Authors:

Rhythm Vohra, Amanda Dash and Alexandra Branzan Albu

Abstract: Images can be represented and stored either in raster or in vector formats. Raster images are most ubiquitous and are defined as matrices of pixel intensities/colours, while vector images consist of a finite set of geometric primitives, such as lines, curves, and polygons. Since geometric shapes are expressed via mathematical equations and defined by a limited number of control points, they can be manipulated in a much easier way than by directly working with pixels; hence, the vector format is much preferred to raster for image editing and understanding purposes. The conversion of a raster image into its vector correspondent is a non-trivial process, called image vectorization. This paper presents a vectorization method for line drawings, which is much faster and more accurate than the state-of-the-art. We propose a novel segmentation method that processes the input raster image by labeling each pixel as belonging to a particular stroke instance. Our contributions consist of a segmentation model (called Multi-Focus Attention UNet), as well as a loss function that handles well infrequent labels and yields outputs which capture accurately the human drawing style.
Download

Paper Nr: 387
Title:

Frames Preprocessing Methods for Chromakey Classification in Video

Authors:

Evgeny Bessonnitsyn, Artyom Chebykin, Grigorii Stafeev and Valeria Efimova

Abstract: Currently, video games, movies, commercials, and television shows are ubiquitous in modern society. However, beneath the surface of their visual variety lies sophisticated technology, which can produce impressive effects. One such technology is chromakey — a method that allows to change the background to any other image or video. Recognizing chromakey technology in video plays a key role in finding fake materials. In this paper, we consider approaches based on deep learning models that allows to recognize chromakey in video based on unnatural artifacts that arise during the transition between frames. The video consists of a sequence of frames, and the the video accuracy can be determined in different ways. If we consider the accuracy frame by frame, our method reaches an F1 score equal to 0:67. If we consider the entire video to be fake in case there is one or more fake segments, then the F1 score equal to 0:76. The proposed methods showed better results on the dataset we collected in comparison with existing methods for chromakey detection.
Download

Paper Nr: 413
Title:

Evaluating Multiple Combinations of Models and Encoders to Segment Clouds in Satellite Images

Authors:

Jocsan L. Ferreira, Leandro P. Silva, Mauricio C. Escarpinati, André R. Backes and João F. Mari

Abstract: This work evaluates methods based on deep learning to perform cloud segmentation in satellite images. Wwe compared several semantic segmentation architectures using different encoder structures. In this sense, we fine-tuned three architectures (U-Net, LinkNet, and PSPNet) with four pre-trained encoders (ResNet-50, VGG-16, MobileNet V2, and EfficientNet B2). The performance of the models was evaluated using the Cloud-38 dataset. The training process was carried out until the validation loss stabilized, according to the early stopping criterion, which provides a comparative analysis of the best models and training strategies to perform cloud segmentation in satellite images. We evaluated the performance using classic evaluation metrics, i.e., pixel accuracy, mean pixel accuracy, mean IoU, and frequency-based IoU. Results demonstrated that the tested models are capable of segmenting clouds with considerable performance, with emphasis on the following values: (i) 96.19% pixel accuracy for LinkNet with VGG-16 encoder, (ii) 92.58% mean pixel accuracy for U-Net with MobileNet V2 encoder, (iii) 87.21% mean IoU for U-Net with VGG-16 encoder, and (iv) 92.89% frequency-based IoU for LinkNet with VGG-16 encoder. In short, the results of this study provide valuable information for developing satellite image analysis solutions in the context of precision agriculture.
Download

Paper Nr: 448
Title:

FingerSeg: Highly-Efficient Dual-Resolution Architecture for Precise Finger-Level Semantic Segmentation

Authors:

Gibran Benitez-Garcia and Hiroki Takahashi

Abstract: Semantic segmentation at the finger level poses unique challenges, including the limited pixel representation of some classes and the complex interdependency of the hand anatomy. In this paper, we propose FingerSeg, a novel architecture inspired by Deep Dual-Resolution Networks, specifically adapted to address the nuances of finger-level hand semantic segmentation. To this end, we introduce three modules: Enhanced Bilateral Fusion (EBF), which refines low- and high-resolution feature fusion via attention mechanisms; Multi-Attention Module (MAM), designed to augment high-level features with a composite of channel, spatial, orientational, and categorical attention; and Asymmetric Dilated Up-sampling (ADU), which combines standard and asymmetric atrous convolutions to capture rich contextual information for pixel-level classification. To properly evaluate our proposal, we introduce IPN-finger, a subset of the IPN-Hand dataset, manually annotated pixel-wise for 13 finger-related classes. Our extensive empirical analysis, including evaluations of the synthetic RHD dataset against current state-of-the-art methods, demonstrates that our proposal achieves top results. FingerSeg reaches 73.8 and 71.1 mIoU on the IPN-Finger and RHD datasets, respectively, while maintaining an efficient computational cost of about 7 GFLOPs and 6 million parameters at VGA resolution. The dataset, source code, and a demo of FingerSeg will be available upon the publication of this paper.
Download

Short Papers
Paper Nr: 22
Title:

Learning End-to-End Deep Learning Based Image Signal Processing Pipeline Using a Few-Shot Domain Adaptation

Authors:

Georgy Perevozchikov and Egor Ershov

Abstract: Nowadays the quality of mobile phone cameras plays one of the most important roles in modern smartphones, as a result, more attention is being paid to the camera Image Signal Processing (ISP) pipeline. The current goal of the scientific community is to develop a neural-based end-to-end pipeline to remove the expensive and exhausting process of classical ISP tuning for each next device. The main drawback of the neural-based approach is the necessity of preparing large-scale datasets each time a new smartphone is designed. In this paper, we address this problem and propose a new method for few-shot domain adaptation of the existing neural ISP to a new domain. We show that it is sufficient to have 10 labeled images of the target domain to achieve state-of-the-art performance on the real camera benchmark datasets. We also provide a comparative analysis of our proposed approach with other existing ISP domain adaptation methods and show that our approach allows us to achieve better results. Our proposed method exhibits notably comparable performance, with only a marginal 2% drop in performance compared to the learned from scratch in the whole dataset baseline. We believe that this solution will significantly reduce the cost of neural-based ISP production for each new device.
Download

Paper Nr: 35
Title:

Machine Learning in Industrial Quality Control of Glass Bottle Prints

Authors:

Maximilian Bundscherer, Thomas H. Schmitt and Tobias Bocklet

Abstract: In industrial manufacturing of glass bottles, quality control of bottle prints is necessary as numerous factors can negatively affect the printing process. Even minor defects in the bottle prints must be detected despite reflections in the glass or manufacturing-related deviations. In cooperation with our medium-sized industrial partner, two ML-based approaches for quality control of these bottle prints were developed and evaluated, which can also be used in this challenging scenario. Our first approach utilized different filters to supress reflections (e.g. Sobel or Canny) and image quality metrics for image comparison (e.g. MSE or SSIM) as features for different supervised classification models (e.g. SVM or k-Neighbors), which resulted in an accuracy of 84%. The images were aligned based on the ORB algorithm, which allowed us to estimate the rotations of the prints, which may serve as an indicator for anomalies in the manufacturing process. In our second approach, we fine-tuned different pre-trained CNN models (e.g. ResNet or VGG) for binary classification, which resulted in an accuracy of 87%. Utilizing Grad-Cam on our fine-tuned ResNet-34, we were able to localize and visualize frequently defective bottle print regions. This method allowed us to provide insights that could be used to optimize the actual manufacturing process. This paper also describes our general approach and the challenges we encountered in practice with data collection during ongoing production, unsupervised preselection, and labeling.
Download

Paper Nr: 38
Title:

Generative Texture Super-Resolution via Differential Rendering

Authors:

Milena Bagdasarian, Peter Eisert and Anna Hilsmann

Abstract: Image super-resolution is a well-studied field that aims at generating high-resolution images from low-resolution inputs while preserving fine details and realistic features. Despite significant progress on regular images, inferring high-resolution textures of 3D models poses unique challenges. Due to the non-contiguous arrangement of texture patches, intended for wrapping around 3D meshes, applying conventional image super-resolution techniques to texture maps often results in artifacts and seams at texture discontinuities on the mesh. Additionally, obtaining ground truth data for texture super-resolution becomes highly complex due to the labor intensive process of hand-crafting ground truth textures for each mesh. We propose a generative deep learning network for texture map super-resolution using a differentiable renderer and calibrated reference images. Combining a super-resolution generative adversarial network (GAN) with differentiable rendering, we guide our network towards learning realistic details and seamless texture map super-resolution without a high-resolution ground truth of the texture. Instead, we use high-resolution reference images. Through the differentiable rendering approach, we include model knowledge such as 3D meshes, projection matrices, and calibrated images to bridge the domain gap between 2D image super-resolution and texture map super-resolution. Our results show textures with fine structures and improved detail, which is especially of interest in virtual and augmented reality environments depicting humans.
Download

Paper Nr: 49
Title:

Iterative Saliency Enhancement over Superpixel Similarity

Authors:

Leonardo M. Joao and Alexandre X. Falcao

Abstract: Saliency Object Detection (SOD) has several applications in image analysis. The methods have evolved from image-intrinsic to object-inspired (deep-learning-based) models. However, when a model fails, there is no alternative to enhance its saliency map. We fill this gap by introducing a hybrid approach, the Iterative Saliency Enhancement over Superpixel Similarity (ISESS), that iteratively generates enhanced saliency maps by executing two operations alternately: object-based superpixel segmentation and superpixel-based saliency estimation - cycling operations never exploited. ISESS estimates seeds for superpixel delineation from a given saliency map and defines superpixel queries in the foreground and background. A new saliency map results from color similarities between queries and superpixels at each iteration. The process repeats, and, after a given number of iterations, the generated saliency maps are combined into one by cellular automata. Finally, the resulting map is merged with the initial one by the maximum between their average values per superpixel. We demonstrate that our hybrid model consistently outperforms three state-of-the-art deep-learning-based methods on five image datasets.
Download

Paper Nr: 54
Title:

Estimation of Package-Boundary Confidence for Object Recognition in Rainbow-SKU Depalletizing Automation

Authors:

Kento Sekiya, Taiki Yano, Nobutaka Kimura and Kiyoto Ito

Abstract: We developed a reliable object recognition method for a rainbow-SKU depalletizing robot. Rainbow SKUs include various types of objects such as boxes, bags, and bottles. The objects’ areas need to be estimated in order to automate a depalletizing robot; however, it is difficult to detect the boundaries between adjacent objects. To solve this problem, we focus on the difference in the shape of the boundaries and propose package-boundary confidence, which assesses whether the recognized boundary correctly corresponds to that of an object unit. This method classifies recognition results into four categories on the basis of the objects’ shape and calculates the package-boundary confidence for each category. The results of our experimental evaluation indicate that the proposed method with slight displacement, which is automatic recovery, can achieve a recognition success rate of 99.0 %. This is higher than that with a conventional object recognition method. Furthermore, we verified that the proposed method is applicable to a real-world depalletizing robot by combining package-boundary confidence with automatic recovery.
Download

Paper Nr: 56
Title:

Calibration-Accuracy Measurement in Railway Overlapping Multi-Camera Systems

Authors:

Martí Sánchez, Nerea Aranjuelo, Jon A. Iñiguez de Gordoa, Pablo Alonso, Mikel García, Marcos Nieto and Mikel Labayen

Abstract: This paper presents a method for assessing calibration quality in overlapping multi-camera systems used in railway transportation. We propose a novel approach that considers the extrinsic and intrinsic parameters of the cameras and extracts features from their images, providing relevant patterns regarding the pose of the cameras to detect cameras’ calibration misalignment. Three feature extractors, including traditional image processing techniques and deep learning approaches, are evaluated and compared. The extracted features are used to provide a calibration quality metric, enabling real-time detection of camera calibration degradation. Additionally, we introduce a radial grid design that weights the contribution of pixels based on their distance from the camera’s optical center. The results demonstrate the effectiveness of our method in assessing the calibration degree between camera pairs. The findings highlight the superior performance of the deep learning approaches in analyzing the similarity degree between captured images. Overall, our method lays a solid foundation for the development of an online camera calibration pipeline.
Download

Paper Nr: 69
Title:

Vision-Perceptual Transformer Network for Semantic Scene Understanding

Authors:

Mohamad Alansari, Hamad AlRemeithi, Bilal Hassan, Sara Alansari, Jorge Dias, Majid Khonji, Naoufel Werghi and Sajid Javed

Abstract: Semantic segmentation, essential in computer vision, involves labeling each image pixel with its semantic class. Transformer-based models, recognized for their exceptional performance, have been pivotal in advancing this field. Our contribution, the Vision-Perceptual Transformer Network (VPTN), ingeniously combines transformer encoders with a feature pyramid-based decoder to deliver precise segmentation maps with minimal computational burden. VPTN’s transformative power lies in its integration of the pyramiding technique, enhancing multi-scale variations handling. In direct comparisons with Vision Transformer-based networks and variants, VPTN consistently excels. On average, it achieves 4.2%, 3.41%, and 6.24% higher mean Intersection over Union (mIoU) compared to Dense Prediction (DPT), Data-efficient image Transformer (DeiT), and Swin Transformer networks, while demanding only 15.63%, 3.18%, and 10.05% of their Giga Floating-Point Operations (GFLOPs). Our validation spans five diverse datasets, including Cityscapes, BDD100K, Mapil-lary Vistas, CamVid, and ADE20K. VPTN secures the position of state-of-the-art (SOTA) on BDD100K and CamVid and consistently outperforms existing deep learning models on other datasets, boasting mIoU scores of 82.6%, 67.29%, 61.2%, 86.3%, and 55.3%, respectively. Impressively, it does so with an average computational complexity just 11.44% of SOTA models. VPTN represents a significant advancement in semantic segmentation, balancing efficiency and performance. It shows promising potential, especially for autonomous driving and natural setting computer vision applications.
Download

Paper Nr: 78
Title:

Data Quality Aware Approaches for Addressing Model Drift of Semantic Segmentation Models

Authors:

Samiha Mirza, Vuong D. Nguyen, Pranav Mantini and Shishir K. Shah

Abstract: In the midst of the rapid integration of artificial intelligence (AI) into real world applications, one pressing challenge we confront is the phenomenon of model drift, wherein the performance of AI models gradually degrades over time, compromising their effectiveness in real-world, dynamic environments. Once identified, we need techniques for handling this drift to preserve the model performance and prevent further degradation. This study investigates two prominent quality aware strategies to combat model drift: data quality assessment and data conditioning based on prior model knowledge. The former leverages image quality assessment metrics to meticulously select high-quality training data, improving the model robustness, while the latter makes use of learned feature vectors from existing models to guide the selection of future data, aligning it with the model’s prior knowledge. Through comprehensive experimentation, this research aims to shed light on the efficacy of these approaches in enhancing the performance and reliability of semantic segmentation models, thereby contributing to the advancement of computer vision capabilities in real-world scenarios.
Download

Paper Nr: 79
Title:

Privacy Preservation in Image Classification Using Seam Doppelganger

Authors:

Nishitha Prakash and James Pope

Abstract: Cloud storage usage continues to increase and many cloud storage sites use advanced machine learning models to classify user’s images for various purposes, possibly malicious in nature. This introduces very serious privacy concerns where users want to store and view their images on the cloud storage but do not want the models to be able to accurately classify their images. This is a difficult problem and there are many proposed solutions including the seam doppelganger algorithm. Seam Doppelganger uses the seam carving content-aware resizing approach to modify the image in a way that is still human-understandable and has been shown to reduce model accuracy. However, the approach was not tested with different classifiers, is not able to provide complete restoration, and uses a limited dataset. We propose several modifications to the Seam Doppelganger algorithm to better enhance the privacy of the image while keeping it human-readable and able to be fully restored. We modify the energy function to use a histogram of gradients, comprehensively compare seam selection, and evaluate with several pre-trained (on ImageNet and Kaggle datasets) image classification models. We use the structural similarity index measure (SSIM) to determine the degree of distortion as a proxy for human understanding. The approach degrades the classification performance by 70% and guarantees 100% restoration of the original image.
Download

Paper Nr: 89
Title:

Automated Generation of Instance Segmentation Labels for Traffic Surveillance Models

Authors:

D. Scholte, T. T. G. Urselmann, M. H. Zwemer, E. Bondarev and P. H. N. de With

Abstract: This paper focuses on instance segmentation and object detection for real-time traffic surveillance applications. Although instance segmentation is currently a hot topic in literature, no suitable dataset for traffic surveillance applications is publicly available and limited work is available with real-time performance. A custom proprietary dataset is available for training, but it contains only bounding-box annotations and lacks segmentation annotations. The paper explores methods for automated generation of instance segmentation labels for custom datasets that can be utilized to finetune state-of-the-art segmentation models to specific application domains. Real-time performance is obtained by adopting the recent YOLACT instance segmentation with the YOLOv7 backbone. Nevertheless, it requires modification of the loss function and an implementation of ground-truth matching to overcome handling imperfect instance labels in custom datasets. Experiments show that it is possible to achieve a high instance segmentation performance using a semi-automatically generated dataset, especially when using the Segment Anything Model for generating the labels.
Download

Paper Nr: 101
Title:

SAMMI: Segment Anything Model for Malaria Identification

Authors:

Luca Zedda, Andrea Loddo and Cecilia Di Ruberto

Abstract: Malaria, a life-threatening disease caused by the Plasmodium parasite, is a pressing global health challenge. Timely detection is critical for effective treatment. This paper introduces a novel computer-aided diagnosis system for detecting Plasmodium parasites in blood smear images, aiming to enhance automation and accessibility in comprehensive screening scenarios. Our approach integrates the Segment Anything Model for precise unsupervised parasite detection. It then employs a deep learning framework, combining Convolutional Neural Networks and Vision Transformer to accurately classify malaria-infected cells. We rigorously evaluate our system using the IML public dataset and compare its performance against various off-the-shelf object detectors. The results underscore the efficacy of our method, demonstrating superior accuracy in detecting and classifying malaria-infected cells. This innovative Computer-aided diagnosis system presents a reliable and near real-time solution for malaria diagnosis, offering significant potential for widespread implementation in healthcare settings. By automating the diagnosis process and ensuring high accuracy, our system can contribute to timely interventions, thereby advancing the fight against malaria globally.
Download

Paper Nr: 103
Title:

Stereo-Event-Camera-Technique for Insect Monitoring

Authors:

Regina Pohle-Fröhlich, Colin Gebler and Tobias Bolten

Abstract: To investigate the causes of declining insect populations, a monitoring system is needed that automatically records insect activity and additional environmental factors over an extended period of time. For this reason, we use a sensor-based method with two event cameras. In this paper, we describe the system, the view volume that can be recorded with it, and a database used for insect detection. We also present the individual steps of our developed processing pipeline for insect monitoring. For the extraction of insect trajectories, a U-Net based segmentation was tested. For this purpose, the events within a time period of 50 ms were transformed into a frame representation using four different encoding types. The tested histogram encoding achieved the best results with an F1 score for insect segmentation of 0.897 and 0.967 for plant movement and noise parts. The detected trajectories were then transformed into a 4D representation, including depth, and visualized.
Download

Paper Nr: 117
Title:

CAVC: Cosine Attention Video Colorization

Authors:

Leandro Stival, Ricardo S. Torres and Helio Pedrini

Abstract: Video colorization is a challenging task, demanding deep learning models to employ diverse abstractions for a comprehensive grasp of the task, ultimately yielding high-quality results. Currently, in example-based colorization approaches, the use of attention processes and convolutional layers has proven to be the most effective method to produce good results. Following this way, in this paper we propose Cosine Attention Video Colorization (CAVC), which is an approach that uses a single attention head with shared weights to produce a refinement of the monochromatic frame, as well as the cosine similarity between this sample and the other channels present in the image. This entire process acts as a pre-processing of the data from our autoencoder, which performs a feature fusion with the latent space extracted from the referent frame, as well as with its histogram. This architecture was trained on the DAVIS, UVO and LDV datasets and achieved superior results compared to state-of-the-art models in terms of FID metric in all the datasets.
Download

Paper Nr: 121
Title:

Efficient Posterior Sampling for Diverse Super-Resolution with Hierarchical VAE Prior

Authors:

Jean Prost, Antoine Houdard, Andrés Almansa and Nicolas Papadakis

Abstract: We investigate the problem of producing diverse solutions to an image super-resolution problem. From a probabilistic perspective, this can be done by sampling from the posterior distribution of an inverse problem, which requires the definition of a prior distribution on the high-resolution images. In this work, we propose to use a pretrained hierarchical variational autoencoder (HVAE) as a prior. We train a lightweight stochastic encoder to encode low-resolution images in the latent space of a pretrained HVAE. At inference, we combine the low-resolution encoder and the pretrained generative model to super-resolve an image. We demonstrate on the task of face super-resolution that our method provides an advantageous trade-off between the computational efficiency of conditional normalizing flows techniques and the sample quality of diffusion based methods.
Download

Paper Nr: 144
Title:

Concept Basis Extraction for Latent Space Interpretation of Image Classifiers

Authors:

Alexandros Doumanoglou, Dimitrios Zarpalas and Kurt Driessens

Abstract: Previous research has shown that, to a large-extend, deep feature representations of image-patches that belong to the same semantic concept, lie in the same direction of an image classifier’s feature space. Conventional approaches compute these directions using annotated data, forming an interpretable feature space basis (also referred as concept basis). Unsupervised Interpretable Basis Extraction (UIBE) was recently proposed as a novel method that can suggest an interpretable basis without annotations. In this work, we show that the addition of a classification loss term to the unsupervised basis search, can lead to bases suggestions that align even more with interpretable concepts. This loss term enforces the basis vectors to point towards directions that maximally influence the classifier’s predictions, exploiting concept knowledge encoded by the network. We evaluate our work by deriving a concept basis for three popular convolutional networks, trained on three different datasets. Experiments show that our contributions enhance the interpretability of the learned bases, according to the interpretability metrics, by up-to +45.8% relative improvement. As additional practical contribution, we report hyper-parameters, found by hyper-parameter search in controlled benchmarks, that can serve as a starting point for applications of the proposed method in real-world scenarios that lack annotations.
Download

Paper Nr: 158
Title:

Assessing the Performance of Autoencoders for Particle Density Estimation in Acoustofluidic Medium: A Visual Analysis Approach

Authors:

Lucas M. Massa, Tiago F. Vieira, Allan M. Martins and Bruno G. Ferreira

Abstract: Micro-particle density is important for understanding different cell types, their growth stages, and how they respond to external stimuli. In previous work, a Gaussian curve fitting method was used to estimate the size of particles, in order to later calculate their density. This approach required a long processing time, making the development of a Point of Care (PoC) device difficult. Current work proposes the application of a convolutional autoencoder (AE) to estimate single particle density, aiming to develop a PoC device that overcomes the limitations presented in the previous study. Thus, we used the AE to bottleneck a set of particle images into a single latent variable to evaluate its ability to represent the particle’s diameter. We employed an identical physical apparatus involving a microscope to take pictures of particles in a liquid submitted to ultrasonic waves before the settling process. The AE was initially trained with a set of images for calibration. The acquired parameters were applied to the test set to estimate the velocity at which the particle falls within the ultrasonic chamber. This velocity was later used to infer the particle density. Our results demonstrated that the AE model performed much better, notably exhibiting significantly enhanced computational speed while concurrently achieving comparable error in density estimation.
Download

Paper Nr: 160
Title:

Image Edge Enhancement for Effective Image Classification

Authors:

Bu Tianhao, Michalis Lazarou and Tania Stathaki

Abstract: Image classification has been a popular task due to its feasibility in real-world applications. Training neural networks by feeding them RGB images has demonstrated success over it. Nevertheless, improving the classification accuracy and computational efficiency of this process continues to present challenges that researchers are actively addressing. A widely popular embraced method to improve the classification performance of neural networks is to incorporate data augmentations during the training process. Data augmentations are simple transformations that create slightly modified versions of the training data, and can be very effective in training neural networks to mitigate overfitting and improve their accuracy performance. In this study, we draw inspiration from high-boost image filtering and propose an edge enhancement-based method as means to enhance both accuracy and training speed of neural networks. Specifically, our approach involves extracting high frequency features, such as edges, from images within the available dataset and fusing them with the original images, to generate new, enriched images. Our comprehensive experiments, conducted on two distinct datasets—CIFAR10 and CALTECH101, and three different network architectures—ResNet-18 ,LeNet-5 and CNN-9—demonstrate the effectiveness of our proposed method.
Download

Paper Nr: 168
Title:

Instance Segmentation of Event Camera Streams in Outdoor Monitoring Scenarios

Authors:

Tobias Bolten, Regina Pohle-Fröhlich and Klaus D. Tönnies

Abstract: Event cameras are a new type of image sensor. The pixels of these sensors operate independently and asynchronously from each other. The sensor output is a variable rate data stream that spatio-temporally encodes the detection of brightness changes. This type of output and sensor operating paradigm poses processing challenges for computer vision applications, as frame-based methods are not natively applicable. We provide the first systematic evaluation of different state-of-the-art deep learning based instance segmentation approaches in the context of event-based outdoor surveillance. For processing, we consider transforming the event output stream into representations of different dimensionalities, including point-, voxel-, and frame-based variants. We introduce a new dataset variant that provides annotations at the level of instances per output event, as well as a density-based preprocessing to generate regions of interest (RoI). The achieved instance segmentation results show that the adaptation of existing algorithms for the event-based domain is a promising approach.
Download

Paper Nr: 173
Title:

Large Filter Low-Level Processing by Edge TPU

Authors:

Gerald Krell and Thilo Pionteck

Abstract: Edge TPUs offer high processing power at a low cost and with minimal power consumption. They are particularly suitable for demanding tasks such as classification or segmentation using Deep Learning Frameworks, acting as a neural coprocessor in host computers and mobile devices. The question arises as to whether this potential can be utilized beyond the specific domains for which the frameworks are originally designed. One example pertains to addressing various error classes by utilizing a trained deconvolution filter with a large filter size, requiring computation power that can be efficiently accelerated by the powerful matrix multiplication unit of the TPU. However, the application of the TPU is restricted due to the fact that Edge TPU software is not fully open source. This limits to integration with existing Deep Learning frameworks and the Edge TPU compiler. Nonetheless, we demonstrate a method of estimating and utilizing a convolutional filter of large size on the TPU for this purpose. The deconvolution process is accomplished by utilizing pre-estimated convolutional filters offline to perform low-level preprocessing for various error classes, such as denoising, deblurring, and distortion removal.
Download

Paper Nr: 185
Title:

Comparing 3D Shape and Texture Descriptors Towards Tourette’s Syndrome Prediction Using Pediatric Magnetic Resonance Imaging

Authors:

Murilo Costa de Barros, Kaue N. Duarte, Chia-Jui Hsu, Wang-Tso Lee and Marco A. Garcia de Carvalho

Abstract: Tourette Syndrome (TS) is a neuropsychiatric disorder characterized by the presence of involuntary motor and vocal tics, with its etiology suggesting a strong and complex genetic basis. The detection of TS is mainly performed clinically, but brain imaging provides additional insights about anatomical structures. Interpreting brain patterns is challenging due to the complexity of the texture and shape of the anatomical regions. This study compares three-dimensional texture and shape features using Gray-Level Co-occurrence Matrix and Scale-Invariant Heat Kernel Signature. These features are analyzed in the context of TS classification (via Support Vector Machines), focusing on anatomical regions believed to be associated with TS. The evaluation is performed on structural Magnetic Resonance (MR) images of 68 individuals (34 TS patients and 34 healthy subjects). Results show that shape features achieve 92.6% accuracy in brain regions like the right thalamus and accumbens area, while texture features reach 73.5% accuracy in regions such as right putamen and left thalamus. Majority voting ensembles using shape features obtain 96% accuracy, with texture features achieving 79.4%. These findings highlight the influence of subcortical regions in the limbic system, consistent with existing literature on TS.
Download

Paper Nr: 201
Title:

Feature Selection Using Quantum Inspired Island Model Genetic Algorithm for Wheat Rust Disease Detection and Severity Estimation

Authors:

Sourav Samanta, Sanjay Chatterji and Sanjoy Pratihar

Abstract: In the context of smart agriculture, an early disease detection system is crucial to increase agricultural yield. A disease detection system based on machine learning can be an excellent tool in this regard. Wheat is one of the world’s most important crops. Leaf rust is one of the most significant wheat diseases. In this work, we have proposed a method to detect the leaf rust disease-affected areas in wheat leaves to estimate the severity of the disease. The method works on a reduced Color-GLCM (C-GLCM) feature set. The proposed feature selection method employs Quantum Inspired Island Model Genetic Algorithm to select the most compelling features from the C-GLCM set. The proposed feature selection method outperforms the classical feature selection methods. The healthy and diseased leaves are classified using four classifiers: Decision Tree, KNN, Support Vector Machine, and MLP. The MLP classifier achieved the highest accuracy of 99 .20% with the proposed feature selection method. Following the detection of the diseased leaf, the k-means algorithm has been utilized to localize the lesion area. Finally, disease severity scores have been calculated and reported for various sample leaves.
Download

Paper Nr: 228
Title:

Investigation of Deep Neural Network Compression Based on Tucker Decomposition for the Classification of Lesions in Cavity Oral

Authors:

Vitor L. Fernandes, Adriano B. Silva, Danilo C. Pereira, Sérgio V. Cardoso, Paulo R. de Faria, Adriano M. Loyola, Thaína A. Tosta, Leandro A. Neves and Marcelo Z. do Nascimento

Abstract: Cancer in the oral cavity is one of the most common, making it necessary to investigate lesions that could develop into cancer. Initial stage lesions, called dysplasia, can develop into more severe stages of the disease and are characterized by variations in the shape and size of the nucleus of epithelial tissue cells. Due to advances in the areas of digital image processing and artificial intelligence, diagnostic aid systems (CAD) have become a tool to help reduce the difficulties of analyzing and classifying lesions. This paper presents an investigation of the Tucker decomposition in tensors for different CNN models to classify dysplasia in histological images of the oral cavity. In addition to the Tucker decomposition, this study investigates the normalization of H&E dyes on the optimized CNN models to evaluate the behavior of the architectures in the classification stage of dysplasia lesions. The results show that for some of the optimized models, the use of normalization contributed to the performance of the CNNs for classifying dysplasia lesions. However, when the features obtained from the final layers of the CNNs associated with the machine learning algorithms were analyzed, it was noted that the normalization process affected performance during classification.
Download

Paper Nr: 243
Title:

Efficient and Accurate Hyperspectral Image Demosaicing with Neural Network Architectures

Authors:

Eric L. Wisotzky, Lara Wallburg, Anna Hilsmann, Peter Eisert, Thomas Wittenberg and Stephan Göb

Abstract: Neural network architectures for image demosaicing have been become more and more complex. This results in long training periods of such deep networks and the size of the networks is huge. These two factors prevent practical implementation and usage of the networks in real-time platforms, which generally only have limited resources. This study investigates the effectiveness of neural network architectures in hyperspectral image demosaicing. We introduce a range of network models and modifications, and compare them with classical interpolation methods and existing reference network approaches. The aim is to identify robust and efficient performing network architectures. Our evaluation is conducted on two datasets, ”SimpleData” and ”SimReal-Data,” representing different degrees of realism in multispectral filter array (MSFA) data. The results indicate that our networks outperform or match reference models in both datasets demonstrating exceptional performance. Notably, our approach focuses on achieving correct spectral reconstruction rather than just visual appeal, and this emphasis is supported by quantitative and qualitative assessments. Furthermore, our findings suggest that efficient demosaicing solutions, which require fewer parameters, are essential for practical applications. This research contributes valuable insights into hyperspectral imaging and its potential applications in various fields, including medical imaging.
Download

Paper Nr: 254
Title:

Two Nonlocal Variational Models for Retinex Image Decomposition

Authors:

Frank W. Hammond, Catalina Sbert and Joan Duran

Abstract: Retinex theory assumes that an image can be decomposed into illumination and reflectance components. In this work, we introduce two variational models to solve the ill-posed inverse problem of estimating illumination and reflectance from a given observation. Nonlocal regularization exploiting image self-similarities is used to estimate the reflectance, since it is assumed to contain fine details and texture. The difference between the proposed models comes from the selected prior for the illumination. Specifically, Tychonoff regularization, which promots smooth solutions, and the total variation, which favours piecewise constant solutions, are independently proposed. A comprehensive theoretical analysis of the resulting functionals is presented within appropriate functional spaces, complemented by an experimental validation for thorough examination.
Download

Paper Nr: 258
Title:

Avoiding Undesirable Solutions of Deep Blind Image Deconvolution

Authors:

Antonie Brožová and Václav Šmídl

Abstract: Blind image deconvolution (BID) is a severely ill-posed optimization problem requiring additional information, typically in the form of regularization. Deep image prior (DIP) promises to model a naturally looking image due to a well-chosen structure of a neural network. The use of DIP in BID results in a significant perfor-mance improvement in terms of average PSNR. In this contribution, we offer qualitative analysis of selected DIP-based methods w.r.t. two types of undesired solutions: blurred image (no-blur) and a visually corrupted image (solution with artifacts). We perform a sensitivity study showing which aspects of the DIP-based algorithms help to avoid which undesired mode. We confirm that the no-blur can be avoided using either sharp image prior or tuning of the hyperparameters of the optimizer. The artifact solution is a harder problem since variations that suppress the artifacts often suppress good solutions as well. Switching to the structural similarity index measure from L 2 norm in loss was found to be the most successful approach to mitigate the artifacts.
Download

Paper Nr: 265
Title:

SWViT-RRDB: Shifted Window Vision Transformer Integrating Residual in Residual Dense Block for Remote Sensing Super-Resolution

Authors:

Mohamed R. Ibrahim, Robert Benavente, Daniel Ponsa and Felipe Lumbreras

Abstract: Remote sensing applications, impacted by acquisition season and sensor variety, require high-resolution images. Transformer-based models improve satellite image super-resolution but are less effective than convolutional neural networks (CNNs) at extracting local details, crucial for image clarity. This paper introduces SWViT-RRDB, a new deep learning model for satellite imagery super-resolution. The SWViT-RRDB, combining transformer with convolution and attention blocks, overcomes the limitations of existing models by better representing small objects in satellite images. In this model, a pipeline of residual fusion group (RFG) blocks is used to combine the multi-headed self-attention (MSA) with residual in residual dense block (RRDB). This combines global and local image data for better super-resolution. Additionally, an overlapping cross-attention block (OCAB) is used to enhance fusion and allow interaction between neighboring pixels to maintain long-range pixel dependencies across the image. The SWViT-RRDB model and its larger variants outperform state-of-the-art (SoTA) models on two different satellite datasets in terms of PSNR and SSIM.
Download

Paper Nr: 289
Title:

An Image Sharpening Technique Based on Dilated Filters and 2D-DWT Image Fusion

Authors:

Victor Bogdan, Cosmin Bonchiş and Ciprian Orhei

Abstract: Image sharpening techniques are pivotal in image processing, serving to accentuate the contrast between darker and lighter regions in images. Building upon prior research that highlights the advantages of dilated kernels in edge detection algorithms, our study introduces a multi-level dilatation wavelet scheme. This novel approach to Unsharp Masking involves processing the input image through a low-pass filter with varying dilatation factors, followed by wavelet fusion. The visual outcomes of this method demonstrate marked improvements in image quality, notably enhancing details without introducing any undesirable crisping effects. Given the absence of a universally accepted index for optimal image sharpness in current literature, we have employed a range of metrics to evaluate the effectiveness of our proposed technique.
Download

Paper Nr: 300
Title:

Using Extended Light Sources for Relighting from a Small Number of Images

Authors:

Toshiki Hirao, Ryo Kawahara and Takahiro Okabe

Abstract: Relighting real scenes/objects is useful for applications such as augmented reality and mixed reality. In general, relighting of glossy objects requires a large number of images, because specular reflection components are sensitive to light source positions/directions, and then the linear interpolation with sparse light sources does not work well. In this paper, we make use of not only point light sources but also extended light sources for efficiently capturing specular reflection components and achieve relighting from a small number of images. Specifically, we propose a CNN-based method that simultaneously learns the illumination module (illumination condition), i.e. the linear combinations of the point light sources and the extended light sources under which a small number of input images are taken and the reconstruction module which recovers the images under arbitrary point light sources from the captured images in an end-to-end manner. We conduct a number of experiments using real images captured with a display-camera system, and confirm the effectiveness of our proposed method.
Download

Paper Nr: 303
Title:

Transformer-Based Video Saliency Prediction with High Temporal Dimension Decoding

Authors:

Morteza Moradi, Simone Palazzo and Concetto Spampinato

Abstract: In recent years, finding an effective and efficient strategy for exploiting spatial and temporal information has been a hot research topic in video saliency prediction (VSP). With the emergence of spatio-temporal transformers, the weakness of the prior strategies, e.g., 3D convolutional networks and LSTM-based networks, for capturing long-range dependencies has been effectively compensated. While VSP has drawn benefits from spatio-temporal transformers, finding the most effective way for aggregating temporal features is still challenging. To address this concern, we propose a transformer-based video saliency prediction approach with high temporal dimension decoding network (THTD-Net). This strategy accounts for the lack of complex hierarchical interactions between features that are extracted from the transformer-based spatio-temporal encoder: in particular, it does not require multiple decoders and aims at gradually reducing temporal features’ dimensions in the decoder. This decoder-based architecture yields comparable performance to multi-branch and over-complicated models on common benchmarks such as DHF1K, UCF-sports and Hollywood-2.
Download

Paper Nr: 312
Title:

FuDensityNet: Fusion-Based Density-Enhanced Network for Occlusion Handling

Authors:

Zainab Ouardirhi, Otmane Amel, Mostapha Zbakh and Sidi A. Mahmoudi

Abstract: Our research introduces an innovative approach for detecting occlusion levels and identifying objects with varying degrees of occlusion. We integrate 2D and 3D data through advanced network architectures, utilizing voxelized density-based occlusion assessment for improved visibility of occluded objects. By combining 2D image and 3D point cloud data through carefully designed network components, our method achieves superior detection accuracy in complex scenarios with occlusions. Experimental evaluation demonstrates adaptability across concatenation techniques, resulting in notable Average Precision (AP) improvements. Despite initial testing on a limited dataset, our method shows competitive performance, suggesting potential for further refinement and scalability. This research significantly contributes to advancements in effective occlusion handling for object detection methodologies. The abstract and conclusion highlight the substantial increase in AP achieved through our model.
Download

Paper Nr: 315
Title:

On the Use of Visual Transformer for Image Complexity Assessment

Authors:

Luigi Celona, Gianluigi Ciocca and Raimondo Schettini

Abstract: Perceiving image complexity is a crucial aspect of human visual understanding, yet explicitly assessing image complexity poses challenges. Historically, this aspect has been understudied due to its inherent subjectivity, stemming from its reliance on human perception, and the semantic dependency of image complexity in the face of diverse real-world images. Different computational models for image complexity estimation have been proposed in the literature. These models leverage a variety of techniques ranging from low-level, hand-crafted features, to advanced machine learning algorithms. This paper explores the use of recent deep-learning approaches based on Visual Transformer to extract robust information for image complexity estimation in a transfer learning paradigm. Specifically, we propose to leverage three visual backbones, CLIP, DINO-v2, and ImageNetViT, as feature extractors, coupled with a Support Vector Regressor with Radial Basis Function kernel as an image complexity estimator. We test our approach on two widely used benchmark datasets (i.e. IC9600 and SAVOIAS) in an intra-dataset and inter-dataset workflow. Our experiments demonstrate the effectiveness of the CLIP-based features for accurate image complexity estimation with results comparable to end-to-end solutions.
Download

Paper Nr: 339
Title:

Camera Self-Calibration from Two Views with a Common Direction

Authors:

Yingna Su, Xinnian Guo and Yang Shen

Abstract: Camera calibration is crucial for enabling accurate and robust visual perception. This paper addresses the challenge of recovering intrinsic camera parameters from two views of a planar surface, that has received limited attention due to its inherent degeneracy. For cameras equipped with Inertial Measurement Units (IMUs), such as those in smartphones and drones, the camera’s y-axes can be aligned with the gravity direction, reducing the relative orientation to a one-degree-of-freedom (1-DoF). A key insight is the general orthogonality between the ground plane and the gravity direction. Leveraging this ground plane constraint, the paper introduces new homography-based minimal solutions for camera self-calibration with a known gravity direction. we derive 2.5- and 3.5-point camera self-calibration algorithms for points in the ground plane to enable simultaneous estimation of the camera’s focal length and principal point. The paper demonstrates the practicality and efficiency of these algorithms and comparisons to existing state-of-the-art methods, confirming their reliability under various levels of noise and different camera configurations.
Download

Paper Nr: 340
Title:

Neural Style Transfer for Vector Graphics

Authors:

Ivan Jarsky, Valeria Efimova, Artyom Chebykin, Viacheslav Shalamov and Andrey Filchenkov

Abstract: Neural style transfer draws researchers’ attention, but the interest focuses on bitmap images. Various models have been developed for bitmap image generation both online and offline with arbitrary and pre-trained styles. However, the style transfer between vector images has not almost been considered. Our research shows that applying standard content and style losses insignificantly changes the vector image drawing style because the structure of vector primitives differs a lot from pixels. To handle this problem, we introduce new loss functions. We also develop a new method based on differentiable rasterization that uses these loss functions and can change the color and shape parameters of the content image corresponding to the drawing of the style image. Qualitative experiments demonstrate the effectiveness of the proposed VectorNST method compared with the state-of-the-art neural style transfer approaches for bitmap images and the only existing approach for stylizing vector images, DiffVG. Although the proposed model does not achieve the quality and smoothness of style transfer between bitmap images, we consider our work an important early step in this area. VectorNST code and demo service are available at https://github.com/IzhanVarsky/VectorNST.
Download

Paper Nr: 353
Title:

Fast and Reliable Inpainting for Real-Time Immersive Video Rendering

Authors:

Jakub Stankowski and Adrian Dziembowski

Abstract: In this paper, the authors describe a fast view inpainting algorithm dedicated to practical, real-time immersive video systems. Inpainting is an inherent step of the entire virtual view rendering process, allowing for achieving high Quality of Experience (QoE) for a user of the immersive video system. The authors propose a novel approach for inpainting, based on dividing the inpainting process into two independent, highly parallelizable stages: view analysis and hole filling. In total, four methods of view analysis and two methods of hole filling were developed, implemented, and evaluated, both in terms of computational time and quality of the virtual view. The proposed technique was compared against an efficient state-of-the-art iterative inpainting technique. The results show that the proposal allows for achieving good objective and subjective quality, requiring less than 2 ms for inpainting of a frame of the typical FullHD multiview sequence.
Download

Paper Nr: 359
Title:

ELSA: Expanded Latent Space Autoencoder for Image Feature Extraction and Classification

Authors:

Emerson Vilar de Oliveira, Dunfrey P. Aragão and Luiz G. Gonçalves

Abstract: In the field of computer vision, image classification has been aiding in the understanding and labeling of images. Machine learning and artificial intelligence algorithms, especially artificial neural networks, are widely used tools for this task. In this work, we present the Expanded Latent space Autoencoder (ELSA). The ELSA network consists of more than one autoencoder in its internal structure, concatenating their latent spaces and constructing an expanded latent space. The expanded latent space aims to extract more information from input data. Thus, this expanded latent space can be used by other networks for general tasks such as prediction and classification. To evaluate these capabilities, we created an image classification network for the FashionM-NIST and MNIST datasets, achieving 99.97 and 99.98 accuracy for the test dataset. The classifier trained with the expanded latent space dataset outperforms some models in public benchmarks.
Download

Paper Nr: 366
Title:

On Granularity Variation of Air Quality Index Vizualization from Sentinel-5

Authors:

Jordan S. Cuno, Arthur A. Bezerra, Aura Conci and Luiz G. Gonçalves

Abstract: Air quality has been a hot research topic not only because it is directly related to climate change and the greenhouse effect, but most because it has been strongly associated to the transmission of respiratory diseases. Considering that different pollutants affect air quality, a methodology based on satellite data processing is proposed. The objective is to obtain images and measure the main atmospheric pollutants in Brazil. Using satellite systems with spectrometers is an alternative technology that has been recently developed for dealing with such a problem. Sentinel-5 is one of these satellites that works contantly monitoring the earth surface generating a vast amount of data mainly for climate monitoring, and that is used in this research. The main contribution of this research is a computational workflow that uses Sentinel-5 data to generate images of Brazil and its states, in addition to calculating the average value of the main atmospheric pollutants, data that can be used in the prediction of pollution as well as the identification of most polluted regions.
Download

Paper Nr: 369
Title:

Improving Low-Light Image Recognition Performance Based on Image-Adaptive Learnable Module

Authors:

Seitaro Ono, Yuka Ogino, Takahiro Toizumi, Atsushi Ito and Masato Tsukada

Abstract: In recent years, significant progress has been made in image recognition technology based on deep neural networks. However, improving recognition performance under low-light conditions remains a significant challenge. This study addresses the enhancement of recognition model performance in low-light conditions. We propose an image-adaptive learnable module which apply appropriate image processing on input images and a hyperparameter predictor to forecast optimal parameters used in the module. Our proposed approach allows for the enhancement of recognition performance under low-light conditions by easily integrating as a front-end filter without the need to retrain existing recognition models designed for low-light conditions. Through experiments, our proposed method demonstrates its contribution to enhancing image recognition performance under low-light conditions.
Download

Paper Nr: 374
Title:

Word and Image Embeddings in Pill Recognition

Authors:

Richárd Rádli, Zsolt Vörösházi and László Czúni

Abstract: Pill recognition is a key task in healthcare and has a wide range of applications. In this study, we are addressing the challenge to improve the accuracy of pill recognition in a metrics learning framework. A multi-stream visual feature extraction and processing architecture, with multi-head attention layers, is used to estimate the similarity of pills. We are introducing an essential enhancement to the triplet loss function to leverage word embeddings for the injection of textual pill similarity into the visual model. This improvement refines the visual embedding on a finer scale than conventional triplet loss models resulting in higher accuracy of the visual model. Experiments and evaluations are made on a new pill dataset, freely available.
Download

Paper Nr: 382
Title:

RecViT: Enhancing Vision Transformer with Top-Down Information Flow

Authors:

Štefan Pócoš, Iveta Bečková and Igor Farkaš

Abstract: We propose and analyse a novel neural network architecture — recurrent vision transformer (RecViT). Building upon the popular vision transformer (ViT), we add a biologically inspired top-down connection, letting the network ‘reconsider’ its initial prediction. Moreover, using a recurrent connection creates space for feeding multiple similar, yet slightly modified or augmented inputs into the network, in a single forward pass. As it has been shown that a top-down connection can increase accuracy in case of convolutional networks, we analyse our architecture, combined with multiple training strategies, in the adversarial examples (AEs) scenario. Our results show that some versions of RecViT indeed exhibit more robust behaviour than the baseline ViT, on the tested datasets yielding ≈18 % and ≈22 % absolute improvement in robustness while the accuracy drop was only ≈1 %. We also leverage the fact that transformer networks have certain level of inherent explainability. By visualising attention maps of various input images, we gain some insight into the inner workings of our network. Finally, using annotated segmentation masks, we numerically compare the quality of attention maps on original and adversarial images.
Download

Paper Nr: 385
Title:

A Learning Paradigm for Interpretable Gradients

Authors:

Felipe T. Figueroa, Hanwei Zhang, Ronan Sicre, Yannis Avrithis and Stephane Ayache

Abstract: This paper studies interpretability of convolutional networks by means of saliency maps. Most approaches based on Class Activation Maps (CAM) combine information from fully connected layers and gradient through variants of backpropagation. However, it is well understood that gradients are noisy and alternatives like guided backpropagation have been proposed to obtain better visualization at inference. In this work, we present a novel training approach to improve the quality of gradients for interpretability. In particular, we introduce a regularization loss such that the gradient with respect to the input image obtained by standard backpropagation is similar to the gradient obtained by guided backpropagation. We find that the resulting gradient is qualitatively less noisy and improves quantitatively the interpretability properties of different networks, using several interpretability methods.
Download

Paper Nr: 402
Title:

Analysis of Scattering Media by High-Frequency Polarized Light Projection Using Polarizing Projector

Authors:

Aigo Ohno, Fumihiko Sakaue and Jun Sato

Abstract: This paper proposes a special projection method called high-frequency polarized light projection using a polarizing projector to analyze scenes filled with scattering medium, and proposes a method to separate reflected lights and scattered lights by scattering medium in the observed image. In high-frequency polarized light projection, a high-frequency pattern is created by light with different polarization directions, projected onto a scattering medium, and the reflected light is observed. The light scattered by the medium and the reflected light from the object have different polarization properties, and we show that these two types of light can be easily separated.
Download

Paper Nr: 403
Title:

Probabilistic NeRF for 3D Shape Recovery in Scattered Medium

Authors:

Yoshiki Ono, Fumihiko Sakaue and Jun Sato

Abstract: This research proposes a method for analyzing scene information including the characteristics of the medium by representing the space where objects and scattering media such as fog and smoke exist using the NeRF (Neural Radiance Fields) (Mildenhall et al., 2020) representation method of light ray fields. In this study, we focus on the fact that the behavior of rays inside a scattering medium can be expressed probabilistically, and show a method for rendering an image that changes in a probabilistic manner from only a single ray, rather than the entire scattering. By combining this method with a scene representation using the stochastic gradient descent method and a neural network, we show that it is possible to analyze scene information without generating images that directly render light scattering.
Download

Paper Nr: 411
Title:

Dense Light Field Imaging with Mixed Focus Camera

Authors:

Masato Hirose, Fumihiko Sakaue and Jun Sato

Abstract: In this study, we propose a method for acquiring a dense light field in a single shot by taking advantage of the sparsity of the 4D light field (LF). Acquiring the LF with one camera is challenging task due to the amount of data. To acquire the LF efficiently, there are various methods like using micro-lens. However, with these methods, images are taken using a single image sensor, which improves directional resolution but reduces positional resolution. In our method, the focal length of the lens is varied, and the exposure is controlled on a pixel-by-pixel level when capturing a single image to obtain a mixed focus image, where each pixel is captured at a different focal length. Furthermore, by analyzing the captured image with an image generator that does not require prior learning, we show how to recover a LF image that is denser than the captured image. With our method, a high-density LF consisting of 5x5 images can be successfully reconstructed only from a single mixed-focus image taken under a simulated environment.
Download

Paper Nr: 416
Title:

Optimization and Learning Rate Influence on Breast Cancer Image Classification

Authors:

Gleidson G. Barbosa, Larissa R. Moreira, Pedro Moises de Sousa, Rodrigo Moreira and André R. Backes

Abstract: Breast cancer is a prevalent and challenging pathology, with significant mortality rates, affecting both women and men. Despite advancements in technology, such as Computer-Aided Diagnosis (CAD) and awareness campaigns, timely and accurate diagnosis remains a crucial issue. This study investigates the performance of Convolutional Neural Networks (CNNs) in predicting and supporting breast cancer diagnosis, considering BreakHis and Biglycan datasets. Through a factorial partial method, we measured the impact of optimization and learning rate factors on the prediction model accuracy. By measuring each factor’s level of influence on the validation accuracy response variable, this paper brings valuable insights into the relevance analyses and CNN behavior. Furthermore, the study sheds light on the explainability of Artificial Intelligence (AI) through factorial partial performance evaluation design. Among the results, we determine which and how much the hyperparameters tunning influenced the performance of the models. The findings contribute to image-based medical diagnosis field, fostering the integration of computational and machine learning approaches to enhance breast cancer diagnosis and treatment.
Download

Paper Nr: 426
Title:

Multimodal Crowd Counting with Pix2Pix GANs

Authors:

Muhammad Asif Khan, Hamid Menouar and Ridha Hamila

Abstract: Most state-of-the-art crowd counting methods use color (RGB) images to learn the density map of the crowd. However, these methods often struggle to achieve higher accuracy in densely crowded scenes with poor illumination. Recently, some studies have reported improvement in the accuracy of crowd counting models using a combination of RGB and thermal images. Although multimodal data can lead to better predictions, multimodal data might not be always available beforehand. In this paper, we propose the use of generative adversarial networks (GANs) to automatically generate thermal infrared (TIR) images from color (RGB) images and use both to train crowd counting models to achieve higher accuracy. We use a Pix2Pix GAN network first to translate RGB images to TIR images. Our experiments on several state-of-the-art crowd counting models and benchmark crowd datasets report significant improvement in accuracy.
Download

Paper Nr: 36
Title:

Towards Better Morphed Face Images Without Ghosting Artifacts

Authors:

Clemens Seibold, Anna Hilsmann and Peter Eisert

Abstract: Automatic generation of morphed face images often produces ghosting artifacts due to poorly aligned structures in the input images. Manual processing can mitigate these artifacts. However, this is not feasible for the generation of large datasets, which are required for training and evaluating robust morphing attack detectors. In this paper, we propose a method for automatic prevention of ghosting artifacts based on a pixel-wise alignment during morph generation. We evaluate our proposed method on state-of-the-art detectors and show that our morphs are harder to detect, particularly, when combined with style-transfer-based improvement of low-level image characteristics. Furthermore, we show that our approach does not impair the biometric quality, which is essential for high quality morphs.
Download

Paper Nr: 47
Title:

Teeth Localization and Lesion Segmentation in CBCT Images Using SpatialConfiguration-Net and U-Net

Authors:

Arnela Hadzic, Barbara Kirnbauer, Darko Štern and Martin Urschler

Abstract: The localization of teeth and segmentation of periapical lesions in cone-beam computed tomography (CBCT) images are crucial tasks for clinical diagnosis and treatment planning, which are often time-consuming and require a high level of expertise. However, automating these tasks is challenging due to variations in shape, size, and orientation of lesions, as well as similar topologies among teeth. Moreover, the small volumes occupied by lesions in CBCT images pose a class imbalance problem that needs to be addressed. In this study, we propose a deep learning-based method utilizing two convolutional neural networks: the SpatialConfiguration-Net (SCN) and a modified version of the U-Net. The SCN accurately predicts the coordinates of all teeth present in an image, enabling precise cropping of teeth volumes that are then fed into the U-Net which detects lesions via segmentation. To address class imbalance, we compare the performance of three reweighting loss functions. After evaluation on 144 CBCT images, our method achieves a 97.3% accuracy for teeth localization, along with a promising sensitivity and specificity of 0.97 and 0.88, respectively, for subsequent lesion detection.
Download

Paper Nr: 96
Title:

Estimation of the Inference Quality of Machine Learning Models for Cutting Tools Inspection

Authors:

Kacper Marciniak, Paweł Majewski and Jacek Reiner

Abstract: The ongoing trend in industry to continuously improve the efficiency of production processes is driving the development of vision-based inspection and measurement systems. With recent significant advances in artificial intelligence, machine learning methods are becoming increasingly applied to these systems. Strict requirements are placed on measurement and control systems regarding accuracy, repeatability, and robustness against variation in working conditions. Machine learning solutions are often unable to meet these requirements - being highly sensitive to the input data variability. Given the depicted difficulties, an original method for estimation of inference quality is proposed. It is based on a feature space analysis and an assessment of the degree of dissimilarity between the input data and the training set described using explicit metrics proposed by the authors. The developed solution has been integrated with an existing system for measuring geometric parameters and determining cutting tool wear, allowing continuous monitoring of the quality of the obtained results and enabling the system operator to take appropriate action in case of a drop below the adopted threshold values.
Download

Paper Nr: 126
Title:

Character Identification in Images Extracted from Portuguese Manuscript Historical Documents

Authors:

Gustavo C. Lacerda and Raimundo S. Vasconcelos

Abstract: The creation of writing has facilitated the humanity’s accumulation and sharing of knowledge, being a vital part of what differentiates humans from other animals and has a high importance for the culture of all peoples. Thus, the first human records (manuscripts), historical documents of organizations and families, began to have new perspectives with the digital age accumulation. These handwritten records remained the primary source for the history of countries, including Brazil before the period of independence, until the Gutenberg movable type printing press dominated the archival world. Thus, over the decades, these handwritten documents, due to their fragility, became difficult to access and manipulate. This has changed, with the possibility of digitization and, consequently, its distribution over the internet. Therefore, this work shows a solution for transcribing historical texts written in Portuguese, bringing accessibility, searchability, sharing and preservation to these records, which achieved a result of 97% of letters recognized in the database used.
Download

Paper Nr: 127
Title:

Identifying Representative Images for Events Description Using Machine Learning

Authors:

Marcos V. Soares de Sousa and Raimundo S. Vasconcelos

Abstract: The use of social networks to record events – disasters, demonstrations, parties – has grown a lot and has begun to receive attention in recent years. Existing research focuses primarily on analyzing text-based messages from social media platforms such as Twitter. Images, photos and other media are increasingly used and can provide valuable information to enhance the understanding of an event and can be used as indicators of relevance. This work explores the Twitter social media platform, based on image and text in the case of the demonstrations that took place in Brazil on September 7, 2021, as a result of the Independence celebrations. This work uses machine learning techniques (VGG-16, VGG-19, ResNet50v2 and InceptionResNetv2) for finding relevant Twitter images. The results show that the existence of an image within a social media message can serve as a high probability indicator of relevant content. An extensive experimental evaluation was carried out and demonstrated that high efficiency gains can be obtained compared to state-of-the-art methods.
Download

Paper Nr: 146
Title:

A Comparative Analysis of the Three-Alternative Forced Choice Method and the Slider-Based Method in Subjective Experiments: A Case Study on Contrast Preference Task

Authors:

Olga Cherepkova, Seyed Ali Amirshahi and Marius Pedersen

Abstract: When it comes to collecting subjective data in the field of image quality assessment, different approaches have been proposed. Most datasets in the field ask observers to evaluate the quality of different test and reference images. However, a number of datasets ask observers to make changes to one or more properties of the image to enhance the image to its best possible quality. Among the methods used in the second approach is the Three-Alternative Forced Choice (3AFC) and the slider-based methods. In this paper, we study and compare the two mentioned methods in the case of collecting contrast preferences for natural images. Fifteen observers participated in two experiments under controlled settings, incorporating 499 unique and 100 repeated images. The reliability of the answers and the differences between the two methods were analyzed. The results revealed a general lack of correlation in contrast preferences between the two methods. The slider-based method generally yielded lower values in contrast preferences compared to 3AFC experiment. In the case of repeated images, the slider-based method showed greater consistency in subjective scores given by each observer. These results suggest that neither method can serve as a direct substitute for the other, as they exhibited low correlation and statistically significant differences in results. The slider-based experiment offered the advantage of significantly shorter completion times, contributing to higher observer satisfaction. In contrast, the 3AFC task provided a more robust interface for collecting preferences. By comparing the results obtained by the two methods, this study provides information on their respective strengths, limitations, and suitability for use in similar preference acquisition tasks.
Download

Paper Nr: 199
Title:

Most Relevant Viewpoint of an Object: A View-Dependent 3D Saliency Approach

Authors:

Marie Pelissier-Combescure, Sylvie Chambon and Géraldine Morin

Abstract: A viewpoint of a 3D object is the position from which we observe the object. A viewpoint always highlights some 3D parts and discards other parts of an object. Here, we define a good viewpoint as offering a relevant view of the object: a view that best showcases the object and that is the most representative of the object. Best view selection plays an essential role in many computer vision and virtual reality applications. In this paper, given a model and a particular viewpoint, we want to quantify its relevance -not aesthetics. We propose a geometric method for selecting the most relevant viewpoint for a 3D object by combining visibility and view-dependent saliency. Evaluating the quality of an estimated best viewpoint is a challenge. Thus, we propose an evaluation protocol that considers two different and complementary solutions: a user study with more than 200 participants to collect human preferences and an analysis of image dataset picturing objects of interest. This evaluation highlights the correlation between our method and human preferences. A quantitative comparison demonstrates the efficiency of our approach over reference methods.
Download

Paper Nr: 214
Title:

XYZ Unsupervised Network: A Robust Image Dehazing Approach

Authors:

Percy Maldonado-Quispe and Helio Pedrini

Abstract: In this work, we examine a major less-explored topic in image dehazing neural networks, specifically how to remove haze (natural phenomenon) in an unsupervised manner from a given image. By considering a hazy image as the entanglement of many “simpler” layers, such as a hazy-free image layer, transmission map layer, and atmospheric light layer, as shown in the atmospheric scattering model, we propose a method based on the concept of layer disentanglement. Our XYZ approach presents improvements in the SSIM and PSNR metrics, this being the combination of the XHOT, YOLY and ZID methods, in which the advantages of each of them are maintained. The main benefits of the proposed XYZ are twofold. First, since it is an unsupervised approach, no clean photos, including hazy-clear pairs, are used as the ground truth. In other words, it differs from the traditional paradigm of deep model training on a large dataset. The second is to consider haze issues as being composed of several layers.
Download

Paper Nr: 221
Title:

Combining Total Variation and Nonlocal Variational Models for Low-Light Image Enhancement

Authors:

Daniel Torres, Catalina Sbert and Joan Duran

Abstract: Images captured under low-light conditions impose significant limitations on the performance of computer vision applications. Therefore, improving their quality by discounting the effects of the illumination is crucial. In this paper, we present a low-light image enhancement method based on the Retinex theory. Our approach estimates illumination and reflectance in two steps. First, the illumination is obtained as the minimizer of an energy functional involving total variation regularization, which favours piecewise smooth solutions. Next, the reflectance component is computed as the minimizer of an energy functional involving contrast-invariant nonlocal regularization and a fidelity term preserving the largest gradients of the input image.
Download

Paper Nr: 229
Title:

Oral Dysplasia Classification by Using Fractal Representation Images and Convolutional Neural Networks

Authors:

Rafael O. Carvalho, Adriano B. Silva, Alessandro S. Martins, Sérgio V. Cardoso, Guilherme R. Freire, Paulo R. de Faria, Adriano M. Loyola, Thaína A. Tosta, Leandro A. Neves and Marcelo Z. do Nascimento

Abstract: Oral cavity lesions can be graded by specialists, a task that is both difficult and subjective. The challenges in defining patterns can lead to inconsistencies in the diagnosis, often due to the color variations on the histological images. The development of computational systems has emerged as an effective approach for aiding specialists in the diagnosis process, with color normalization techniques proving to enhance diagnostic accuracy. There remains an open challenge in understanding the impact of color normalization on the classification of histological tissues representing dysplasia groups. This study presents an approach to classify dysplasia lesions based on ensemble models, fractal representations, and convolutional neural networks (CNN). Additionally, this work evaluates the influence of color normalization in the preprocessing stage. The results obtained with the proposed methodology were analyzed with and without the preprocessing stage. This approach was applied in a dataset composed of 296 histological images categorized into healthy, mild, moderate, and severe oral epithelial dysplasia tissues. The proposed approaches based on the ensemble were evaluated with the cross-validation technique resulting in accuracy rates ranging from 96.1% to 98.5% with the non-normalized dataset. This approach can be employed as a supplementary tool for clinical applications, aiding specialists in decision-making regarding lesion classification.
Download

Paper Nr: 236
Title:

Automated Brain Lobe Segmentation and Feature Extraction from Multiple Sclerosis Lesions Using Deep Learning

Authors:

Nada Haj Messaoud, Rim Ayari, Asma Ben Abdallah and Mohamed Hedi Bedoui

Abstract: This study focuses on automating the segmentation of brain lobes in MRI images of Multiple Sclerosis (MS) lesions to extract crucial features for predicting disability levels. Extracting significant features from MRI images of MS lesions is indeed a complex task due to the variability in lesion characteristics and the detailed nature of MRI images. Furthermore, all these studies required continuous patient monitoring. Therefore, our contribution lies in proposing an approach for the automatic segmentation of brain lobes and the extraction of lesion features (number, size, location, etc.) to predict disability levels in MS patients. To achieve this, we introduced a model inspired by U-Net to perform the segmentation of different brain lobes, aiming to accurately locate the MS lesions. We utilized two private and public databases and achieved an average mean IoU score of 0.70, which can be considered encouraging. Following the segmentation phase, approximately 7200 features were extracted from the MRI scans of MS patients.
Download

Paper Nr: 259
Title:

SynthRSF: A Novel Photorealistic Synthetic Dataset for Adverse Weather Condition Denoising

Authors:

Angelos Kanlis, Vazgken Vanian, Sotiris Karvarsamis, Ioanna Gkika, Konstantinos Konstantoudakis and Dimitrios Zarpalas

Abstract: This paper presents the SynthRSF dataset for training and evaluating single-image rain, snow and haze denoising algorithms, as well as evaluating object detection, semantic segmentation, and depth estimation performance in noisy or denoised images. Our dataset features 26,893 noisy images, each accompanied by its corresponding ground truth image. It further includes 13,800 noisy images accompanied by ground truth, 16-bit depth maps and pixel-accurate annotations for various object instances in each frame. The utility of SynthRSF is assessed by training unified models for rain, snow, and haze removal, achieving good objective metrics and excellent subjective results compared to existing adverse weather condition datasets. Furthermore, we demonstrate its use as a benchmark for the performance of an object detection algorithm in weather-degraded image datasets.
Download

Paper Nr: 282
Title:

Curriculum for Crowd Counting: Is It Worthy?

Authors:

Muhammad Asif Khan, Hamid Menouar and Ridha Hamila

Abstract: Recent advances in deep learning techniques have achieved remarkable performance in several computer vision problems. A notably intuitive technique called Curriculum Learning (CL) has been introduced recently for training deep learning models. Surprisingly, curriculum learning achieves significantly improved results in some tasks but marginal or no improvement in others. Hence, there is still a debate about its adoption as a standard method to train supervised learning models. In this work, we investigate the impact of curriculum learning in crowd counting using the density estimation method. We performed detailed investigations by conducting 112 experiments using six different CL settings using eight different crowd models. Our experiments show that curriculum learning improves the model learning performance and shortens the convergence time.
Download

Paper Nr: 294
Title:

Learning Projection Patterns for Direct-Global Separation

Authors:

Takaoki Ueda, Ryo Kawahara and Takahiro Okabe

Abstract: Separating the direct component such as diffuse reflection and specular reflection and the global component such as inter-reflection and subsurface scattering is important for various computer vision and computer graphics applications. Conventionally, high-frequency patterns designed by physics-based model or signal processing theory are projected from a projector to a scene, but their assumptions do not necessarily hold for real images due to the shallow depth of field of a projector and the limited spatial resolution of a camera. Accordingly, in this paper, we propose a data-driven approach for direct-global separation. Specifically, our proposed method learns not only the separation module but also the imaging module, i.e. the projection patterns at the same time in an end-to-end manner. We conduct a number of experiments using real images captured with a projector-camera system, and confirm the effectiveness of our method.
Download

Paper Nr: 310
Title:

Influence of Pixel Perturbation on eXplainable Artificial Intelligence Methods

Authors:

Juliana C. Feitosa, Mateus Roder, João P. Papa and José F. Brega

Abstract: The current scenario around Artificial Intelligence (AI) has demanded more and more transparent explanations about the existing models. The use of eXplicable Artificial Intelligence (XAI) has been considered as a solution in the search for explainability. As such, XAI methods can be used to verify the influence of adverse scenarios, such as pixel disturbance on AI models for segmentation. This paper presents the experiments performed with fish images of the Pacu species to determine the influence of pixel perturbation through the following explainable methods: Grad-CAM, Saliency Map, Layer Grad-CAM and CNN Filters. The perturbed pixels were considered the most important for the model during the segmentation process of the input image regions. From the existing pixel perturbation techniques, the images were subjected to three main techniques: white noise, color black noise and random noise. From the results obtained, it was observed that the Grad-CAM method had different behaviors for each perturbation technique tested, while the CNN Filters method showed more stability in the variation of the image averaging. The Saliency Map was the least sensitive to the three types of perturbation, as it required fewer iterations. Furthermore, of the perturbation techniques tested, Black noise showed the least ability to impact segmentation. Thus, it is concluded that the perturbation methods influence the outcome of the explainable models tested and interfere with these models in different ways. It is suggested that the experiments presented here be replicated on other AI models, on other explainability methods, and with other existing perturbation techniques to gather more evidence about this influence and from that, quantify which combination of XAI method and pixel perturbation is best for a given problem.
Download

Paper Nr: 318
Title:

Convolutional Neural Networks and Image Patches for Lithological Classification of Brazilian Pre-Salt Rocks

Authors:

Mateus Roder, Leandro A. Passos, Clayton Pereira, João P. Papa, Altanir D. Mello Junior, Marcelo Fagundes de Rezende, Yaro P. Silva and Alexandre Vidal

Abstract: Lithological classification is a process employed to recognize and interpret distinct structures of rocks, providing essential information regarding their petrophysical, morphological, textural, and geological aspects. The process is particularly interesting regarding carbonate sedimentary rocks in the context of petroleum basins since such rocks can store large quantities of natural gas and oil. Thus, their features are intrinsically correlated with the production potential of an oil reservoir. This paper proposes an automatic pipeline for the lithological classification of carbonate rocks into seven distinct classes, comparing nine state-of-the-art deep learning architectures. As far as we know, this is the largest study in the field. Experiments were performed over a private dataset obtained from a Brazilian petroleum company, showing that MobileNetV3large is the more suitable approach for the undertaking.
Download

Paper Nr: 332
Title:

Error Analysis of Aerial Image-Based Relative Object Position Estimation

Authors:

Zsombor Páncsics, Nelli Nyisztor, Tekla Tóth, Imre B. Juhász, Gergely Treplán and Levente Hajder

Abstract: This paper presents a thorough analysis of precision and sensitivity in aerial image-based relative object position estimation, exploring factors such as camera tilt, 3D projection error, marker misalignment, rotation and calibration error. Our unique contribution lies in simulating complex 3D geometries at varying camera altitudes (20-130 m). The simulator has a built-in unique mathematical model offering an extensive set of error parameters to improve reliability of aerial image-based position estimation in practical applications.
Download

Paper Nr: 333
Title:

A Computer Vision Approach to Compute Bubble Flow of Offshore Wells

Authors:

Rogerio C. Hart and Aura Conci

Abstract: This work presents two approaches for detecting and quantifying the offshore flow of leaks, using video recorded by a remote-operated vehicle (ROV) through underwater image analysis and considering the premise of no bubble overlap. One is designed using only traditional digital image approaches, such as Mathematical Morphology operators and Canny edge detection, and the second uses segmentation Convolutional Neural Network. Implementation and experimentation details are presented, enabling comparison and reproduction. The results are compared with videos acquired under controlled conditions and in an operational situation, as well as with all previous possible works. Comparison considers the estimation of the average diameter of rising bubbles, velocity of rise, leak flow rate, computational automation, and flexibility in bubble recognition. The results of both techniques are almost the same depending on the video content in the analysis.
Download

Paper Nr: 338
Title:

Blind Deblurring of THz Time-Domain Images Based on Low-Rank Representation

Authors:

Marina Ljubenović, Mário T. Figueiredo and Arianna Traviglia

Abstract: Terahertz (THz) time-domain imaging holds immense potential for material characterization, capturing three-dimensional data across spatial and temporal dimensions. Despite its capabilities, the technology faces hurdles such as frequency-dependent beam-shape effects and noise. This paper proposes a novel, dual-stage framework for improving THz image resolution beyond the wavelength limit. Our method combats blur at lower frequencies and noise at higher frequencies. The first stage entails selective deblurring of lower-frequency bands, addressing beam-related blurring, while the second stage involves denoising the entire THz hyperspec-tral cube through dimensionality reduction, exploiting its low-rank structure. The synergy of these advanced techniques—beam shaping, noise removal, and low-rank representation—forms a comprehensive approach to enhance THz time-domain images. We present promising preliminary results, showcasing significant improvements across all frequency bands, which is crucial as samples may display varying features across the THz spectrum. Our ongoing work is extending this methodology to complex scenarios such as analyzing multilayered structures in closed ancient manuscripts. This approach paves the way for broader application and refinement of THz imaging in diverse research fields.
Download

Paper Nr: 379
Title:

Optical Illusion in Which Line Segments Continue to Grow or Shrink by Displaying Two Images Alternately

Authors:

Kazuhisa Yanaka and Sota Mihara

Abstract: A new illusion has been discovered, wherein line segments, when alternately displayed with their tonal inversion or monochromatic images for approximately 120 ms each on a monochromatic background, seem to grow or shrink continuously. For instance, if the first image features black line segments on a white background and the second image shows the inverse brightness, switching between these two images causes the line segments to give the illusion of continuous expansion. Although a single line segment suffices, aligning multiple line segments parallel to each other enhances the effect of this illusion. This illusion can be achieved using achromatic colors, such as black and white, as well as chromatic colors, such as red, blue, and green. Specifically, when using an image with a black line segment on a red background alongside its brightness-inverted counterpart, the line segments appear to steadily decrease in length. Our hypothesis suggests a comparison between the mechanisms of this illusion and the changes in water volume in a pond.
Download

Paper Nr: 381
Title:

SAM-Based Detection of Structural Anomalies in 3D Models for Preserving Cultural Heritage

Authors:

David Jurado-Rodríguez, Alfonso López, J. R. Jiménez, Antonio Garrido, Francisco R. Feito and Juan M. Jurado

Abstract: The detection of structural defects and anomalies in cultural heritage emerges as an essential component to ensure the integrity and safety of buildings, plan preservation strategies, and promote the sustainability and durability of buildings over time. In the search to enhance the effectiveness and efficiency of structural health monitoring of cultural heritage, this work aims to develop an automated method focused on detecting unwanted materials and geometric anomalies on the 3D surfaces of ancient buildings. In this study, the proposed solution combines an AI-based technique for fast-forward image labeling and a fully automatic detection of target classes in 3D point clouds. As an advantage of our method, the use of spatial and geometric features in the 3D models enables the recognition of target materials in the whole point cloud from seed, resulting from partial detection in a few images. The results demonstrate the feasibility and utility of detecting self-healing materials, unwanted vegetation, lichens, and encrusted elements in a real-world scenario.
Download

Paper Nr: 391
Title:

A Generative Model for Guided Thermal Image Super-Resolution

Authors:

Patricia L. Suárez and Angel D. Sappa

Abstract: This paper presents a novel approach for thermal super-resolution based on a fusion prior, low-resolution thermal image and H brightness channel of the corresponding visible spectrum image. The method combines bicubic interpolation of the ×8 scale target image with the brightness component. To enhance the guidance process, the original RGB image is converted to HSV, and the brightness channel is extracted. Bicubic interpolation is then applied to the low-resolution thermal image, resulting in a Bicubic-Brightness channel blend. This luminance-bicubic fusion is used as an input image to help the training process. With this fused image, the cyclic adversarial generative network obtains high-resolution thermal image results. Experimental evaluations show that the proposed approach significantly improves spatial resolution and pixel intensity levels compared to other state-of-the-art techniques, making it a promising method to obtain high-resolution thermal.
Download

Paper Nr: 417
Title:

Colorectal Image Classification Using Randomized Neural Network Descriptors

Authors:

Jarbas M. Sá Junior and André R. Backes

Abstract: Colorectal cancer is among the highest incident cancers in the world. A fundamental procedure to diagnose it is the analysis of histological images acquired from a biopsy. Because of this, computer vision approaches have been proposed to help human specialists in such a task. In order to contribute to this field of research, this paper presents a novel way of analyzing colorectal images by using a very discriminative texture signature based on weights of a randomized neural network. For this, we addressed an important multi-class problem composed of eight types of tissues. The results were promising, surpassing the accuracies of many methods present in the literature. Thus, this performance confirms that the randomized neural network signature is an efficient tool for discriminating histological images from colorectal tissues.
Download

Paper Nr: 431
Title:

Deformable Pose Network: A Multi-Stage Deformable Convolutional Network for 2D Hand Pose Estimation

Authors:

Sartaj A. Salman, Ali Zakir and Hiroki Takahashi

Abstract: Hand pose estimation undergoes a significant advancement with the evolution of Convolutional Neural Networks (CNNs) in the field of computer vision. However, existing CNNs fail in many scenarios in learning the unknown transformations and geometrical constraints along with the other existing challenges for accurate estimation of hand keypoints. To tackle these issues we proposed a multi-stage deformable convolutional network for accurate 2D hand pose estimation from monocular RGB images while considering the computational complexity. We utilized EfficientNet as a backbone due to its powerful feature extraction capability, and deformable convolution to learn about the geometrical constraints. Our proposed model called Deformable Pose Network (DPN) outperforms in predicting the 2D keypoints in complex scenarios. Our analysis on the Panoptic studio hand dataset shows that our proposed model improves the accuracy by 2.36% and 7.29% as compared to existing methods i.e., OCPM and CPM respectively.
Download

Paper Nr: 442
Title:

Selection of Backbone for Feature Extraction with U-Net in Pancreas Segmentation

Authors:

Alexandre C. Araújo, Joao D. Sousa de Almeida, Anselmo Cardoso de Paiva and Geraldo Braz Junior

Abstract: The survival rate for pancreatic cancer is among the worst, with a mortality rate of 98%. Diagnosis in the early stage of the disease is the main factor that defines the prognosis. Imaging scans, such as Computerized Tomography scans, are the primary tools for early diagnosis. Computer Assisted Diagnosis tools that use these scans usually include in their pipeline the segmentation of the pancreas as one of the initial steps for diagnosis. This paper presents a comparative study of the use of different backbones in combination with the U-Net. This study aims to demonstrate that using pre-trained backbones is a valuable tool for pancreas segmentation and to provide a comparative benchmark for this task. The best result obtained was 85.96% of Dice in the MSD dataset for the pancreas segmentation using backbone efficientnetb7.
Download

Paper Nr: 452
Title:

RetailKLIP: Finetuning OpenCLIP Backbone Using Metric Learning on a Single GPU for Zero-Shot Retail Product Image Classification

Authors:

Muktabh M. Srivastava

Abstract: Retail product or packaged grocery goods images need to classified in various computer vision applications like self checkout stores, supply chain automation and retail execution evaluation. Previous works explore ways to finetune deep models for this purpose. But because of the fact that finetuning a large model or even linear layer for a pretrained backbone requires to run at least a few epochs of gradient descent for every new retail product added in classification range, frequent retrainings are needed in a real world scenario. In this work, we propose finetuning the vision encoder of a CLIP model in a way that its embeddings can be easily used for nearest neighbor based classification, while also getting accuracy close to or exceeding full finetuning. A nearest neighbor based classifier needs no incremental training for new products, thus saving resources and wait time.
Download

Area 2 - Mobile and Egocentric Vision for Humans and Robots

Full Papers
Paper Nr: 57
Title:

Event-Based Semantic-Aided Motion Segmentation

Authors:

Chenao Jiang, Julien Moreau and Franck Davoine

Abstract: Event cameras are emerging visual sensors inspired by biological systems. They capture intensity changes asynchronously with a temporal precision of up to µs, in contrast to traditional frame imaging techniques running at a fixed frequency of tens of Hz. However, effectively utilizing the data generated by these sensors requires the development of new algorithms and processing. In light of event cameras’ significant advantages in capturing high-speed motion, researchers have turned their attention to event-based motion segmentation. Building upon (Mitrokhin et al., 2019) framework, we propose leveraging semantic segmentation enable the end-to-end network not only to segment moving objects from background motion, but also to achieve semantic segmentation of distinct moving objects. Remarkably, these capabilities are achieved while maintaining the network’s low parameter count of 2.5M. To validate the effectiveness of our approach, we conduct experiments using the EVIMO dataset and the new and more challenging EVIMO2 dataset (Burner et al., 2022). The results demonstrate improvements attained by our method, showcasing its potential in event-based multi-objects motion segmentation.
Download

Paper Nr: 166
Title:

Semantic State Estimation in Robot Cloth Manipulations Using Domain Adaptation from Human Demonstrations

Authors:

Georgies Tzelepis, Eren E. Aksoy, Júlia Borràs and Guillem Alenyà

Abstract: Deformable object manipulations, such as those involving textiles, present a significant challenge due to their high dimensionality and complexity. In this paper, we propose a solution for estimating semantic states in cloth manipulation tasks. To this end, we introduce a new, large-scale, fully-annotated RGB image dataset of semantic states featuring a diverse range of human demonstrations of various complex cloth manipulations. This effectively transforms the problem of action recognition into a classification task. We then evaluate the generalizability of our approach by employing domain adaptation techniques to transfer knowledge from human demonstrations to two distinct robotic platforms: Kinova and UR robots. Additionally, we further improve performance by utilizing a semantic state graph learned from human manipulation data.
Download

Paper Nr: 174
Title:

Hand Mesh and Object Pose Reconstruction Using Cross Model Autoencoder

Authors:

Chaitanya Bandi and Ulrike Thomas

Abstract: Hands and objects severely occlude each other, making it extremely challenging to estimate the hand-object pose during human-robot interactions. In this work, we propose a framework that jointly estimates 3D hand mesh and 6D object pose in real-time. The framework shares the features of a single network with both the hand pose estimation network and the object pose estimation network. Hand pose estimation is a parametric model that regresses the shape and pose parameters of the hand. The object pose estimation network is a cross-model variational autoencoder network for the direct reconstruction of an object’s 6D pose. Our method shows substantial improvement in object pose estimation on two large-scale open-source datasets.
Download

Paper Nr: 175
Title:

Multi-View Inversion for 3D-aware Generative Adversarial Networks

Authors:

Florian Barthel, Anna Hilsmann and Peter Eisert

Abstract: Current 3D GAN inversion methods for human heads typically use only one single frontal image to reconstruct the whole 3D head model. This leaves out meaningful information when multi-view data or dynamic videos are available. Our method builds on existing state-of-the-art 3D GAN inversion techniques to allow for consistent and simultaneous inversion of multiple views of the same subject. We employ a multi-latent extension to handle inconsistencies present in dynamic face videos to re-synthesize consistent 3D representations from the sequence. As our method uses additional information about the target subject, we observe significant enhancements in both geometric accuracy and image quality, particularly when rendering from wide viewing angles. Moreover, we demonstrate the editability of our inverted 3D renderings, which distinguishes them from NeRF-based scene reconstructions.
Download

Paper Nr: 188
Title:

HD-VoxelFlex: Flexible High-Definition Voxel Grid Representation

Authors:

Igor Vozniak, Pavel Astreika, Philipp Müller, Nils Lipp, Christian Müller and Philipp Slusallek

Abstract: Voxel grids are an effective means to represent 3D data, as they accurately preserve spatial relations. However, the inherent sparseness of voxel grid representations leads to significant memory consumption in deep learning architectures, in particular for high-resolution (HD) inputs. As a result, current state-of-the-art approaches to the reconstruction of 3D data tend to avoid voxel grid inputs. In this work, we propose HD-VoxelFlex, a novel 3D CNN architecture that can be flexibly applied to HD voxel grids with only moderate increase in training parameters and memory consumption. HD-VoxelFlex introduces three architectural novelties. First, to improve the models’ generalizability, we introduce a random shuffling layer. Second, to reduce information loss, we introduce a novel reducing skip connection layer. Third, to improve modelling of local structure that is crucial for HD inputs, we incorporate a kNN distance mask as input. We combine these novelties with a “bag of tricks” identified in a comprehensive literature review. Based on these novelties we propose six novel building blocks for our encoder-decoder HD-VoxelFlex architecture. In evaluations on the ModelNet10/40 and PCN datasets, HD-VoxelFlex outperforms the state-of-the-art in all point cloud reconstruction metrics. We show that HD-VoxelFlex is able to process high-definition (128 3 , 192 3 ) voxel grid inputs at much lower memory consumption than previous approaches. Furthermore, we show that HD-VoxelFlex, without additional fine-tuning, demonstrates competitive performance in the classification task, proving its generalization ability. As such, our results underline the neglected potential of voxel grid input for deep learning architectures.
Download

Paper Nr: 337
Title:

BEVFastLine: Single Shot Fast BEV Line Detection for Automated Parking Applications

Authors:

Praveen Narasappareddygari, Venkatesh M. Karunamoorthy, Shubham Sonarghare, Ganesh Sistu and Prasad Deshpande

Abstract: In autonomous parking scenarios, accurate near-field environmental perception is crucial for smooth operations. Parking line detection, unlike the well-understood lane detection, poses unique challenges due to its lack of spatial consistency in orientation, location, and varied appearances in color, pattern, and background surfaces. Consequently, state-of-the-art models for lane detection, which rely on anchors and offsets, are not directly applicable. This paper introduces BEVFastLine, a novel end-to-end line marking detection architecture in Birds Eye View (BEV) space, designed for 360 ◦ multi-camera perception applications. BEVFastLine integrates our single-shot line detection methodology with advanced Inverse Perspective Mapping (IPM) techniques, notably our fast splatting technique, to efficiently detect line markings in varied spatial contexts. This approach is suitable for real-time hardware in Level-3 automated vehicles. BEVFastLine accurately localizes parking lines in BEV space with up to 10 cm precision. Our methods, including the 4X faster Fast Splat and single-shot detection, surpass LSS and OFT in accuracy, achieving 80.1% precision, 90% recall, and nearly doubling the performance of BEV-based segmentation and polyline models. This streamlined solution is highly effective in complex, dynamic parking environments, offering high precision localization within 10 meters around the ego vehicle.
Download

Paper Nr: 427
Title:

Fooling Neural Networks for Motion Forecasting via Adversarial Attacks

Authors:

Edgar Medina and Leyong Loh

Abstract: Human motion prediction is still an open problem, which is extremely important for autonomous driving and safety applications. Although there are great advances in this area, the widely studied topic of adversarial attacks has not been applied to multi-regression models such as GCNs and MLP-based architectures in human motion prediction. This work intends to reduce this gap using extensive quantitative and qualitative experiments in state-of-the-art architectures similar to the initial stages of adversarial attacks in image classification. The results suggest that models are susceptible to attacks even on low levels of perturbation. We also show experiments with 3D transformations that affect the model performance, in particular, we show that most models are sensitive to simple rotations and translations which do not alter joint distances. We conclude that similar to earlier CNN models, motion forecasting tasks are susceptible to small perturbations and simple 3D transformations.
Download

Short Papers
Paper Nr: 41
Title:

Informative Rays Selection for Few-Shot Neural Radiance Fields

Authors:

Marco Orsingher, Anthony Dell’Eva, Paolo Zani, Paolo Medici and Massimo Bertozzi

Abstract: Neural Radiance Fields (NeRF) have recently emerged as a powerful method for image-based 3D reconstruction, but the lengthy per-scene optimization limits their practical usage, especially in resource-constrained settings. Existing approaches solve this issue by reducing the number of input views and regularizing the learned volumetric representation with either complex losses or additional inputs from other modalities. In this paper, we present KeyNeRF, a simple yet effective method for training NeRF in few-shot scenarios by focusing on key informative rays. Such rays are first selected at camera level by a view selection algorithm that promotes baseline diversity while guaranteeing scene coverage, then at pixel level by sampling from a probability distribution based on local image entropy. Our approach performs favorably against state-of-theart methods, while requiring minimal changes to existing NeRF codebases.
Download

Paper Nr: 52
Title:

Augmenting Human-Robot Collaboration Task by Human Hand Position Forecasting

Authors:

Shyngyskhan Abilkassov, Michael Gentner and Mirela Popa

Abstract: Human-Robot collaboration (HRC) plays a critical role in enhancing productivity and safety across various industries. While reactive motion re-planning strategies have proven useful, there is a pressing need for proactive control involving computing human intentions to enable efficient collaboration. This work addresses this challenge by proposing a deep learning-based approach for forecasting human hand trajectories and a heuristic optimization algorithm for proactive robotic task sequencing problem optimization. This work presents a human hand trajectory forecasting deep learning model that achieves state-of-the-art performance on the Ego4D Future Hand Prediction benchmark in all evaluation metrics. In addition, this work presents a problem formulation and a Dynamic Variable Neighborhood Search (DynamicVNS) heuristic optimization algorithm enabling robot to pre-plan their task sequence to avoid human hands. The proposed algorithm exhibits significant computational improvements over the generalized VNS approach. The final framework efficiently incorporates predictions made by the deep learning model into the task sequencer, which is evaluated in an experimental setup for the HRC use-case of the UR10e robot in a visual inspection task. The results indicate the effectiveness and practicality of the proposed approach, showcasing its potential to improve human-robot collaboration in various industrial settings.
Download

Paper Nr: 137
Title:

Analysis of Point Cloud Domain Gap Effects for 3D Object Detection Evaluation

Authors:

Aitor Iglesias, Mikel García, Nerea Aranjuelo, Ignacio Arganda-Carreras and Marcos Nieto

Abstract: The development of autonomous driving systems heavily relies on high-quality LiDAR data, which is essential for robust object detection and scene understanding. Nevertheless, obtaining a substantial amount of such data for effective training and evaluation of autonomous driving algorithms is a major challenge. To overcome this limitation, recent studies are taking advantage of advancements in realistic simulation engines, such as CARLA, which have provided a breakthrough in generating synthetic LiDAR data that closely resembles real-world scenarios. However, these data are far from being identical to real data. In this study, we address the domain gap between real LiDAR data and synthetic data. We train deep-learning models for object detection using real data. Then, those models are rigorously evaluated using synthetic data generated in CARLA. By quantifying the discrepancies between the model’s performance on real and synthetic data, the present study shows that there is indeed a domain gap between the two types of data and does not affect equal to different model architectures. Finally, we propose a method for synthetic data processing to reduce this domain gap. This research contributes to enhancing the use of synthetic data for autonomous driving systems.
Download

Paper Nr: 148
Title:

Incorporating Temporal Information into 3D Hand Pose Estimation Using Scene Flow

Authors:

Niklas Hermes, Alexander Bigalke and Mattias P. Heinrich

Abstract: In this paper we present a novel approach that uses 3D point cloud sequences to integrate temporal information and spatial constraints into existing 3D hand pose estimation methods in order to establish an improved prediction of 3D hand poses. We utilize scene flow to match correspondences between two point sets and present a method that optimizes and harnesses existing scene flow networks for the application of 3D hand pose estimation. For increased generalizability, we propose a module that learns to recognize spatial hand pose associations to transform existing poses into a low-dimensional pose space. In a comprehensive evaluation on the public dataset NYU, we show the benefits of our individual modules and provide insights into the generalization capabilities and the behaviour of our method with noisy data. Furthermore, we demonstrate that our method reduces the error of existing state-of-the-art 3D hand pose estimation methods by up to 7.6%. With a speed of over 40 fps our method is real-time capable and can be integrated into existing 3D hand pose estimation methods with little computational overhead.
Download

Paper Nr: 170
Title:

PIRO: Permutation-Invariant Relational Network for Multi-Person 3D Pose Estimation

Authors:

Nicolas Ugrinovic, Adria Ruiz, Antonio Agudo, Alberto Sanfeliu and Francesc Moreno-Noguer

Abstract: Recovering multi-person 3D poses from a single RGB image is an ill-conditioned problem due to the inherent 2D-3D depth ambiguity, inter-person occlusions, and body truncation. To tackle these issues, recent works have shown promising results by simultaneously reasoning for different individuals. However, in most cases this is done by only considering pairwise inter-person interactions or between pairs of body parts, thus hindering a holistic scene representation able to capture long-range interactions. Some approaches that jointly process all people in the scene require defining one of the individuals as a reference and a pre-defined person ordering or limiting the number of individuals thus being sensitive to these choice. In this paper, we overcome both these limitations, and we propose an approach for multi-person 3D pose estimation that captures long-range interactions independently of the input order. We build a residual-like permutation-invariant network that successfully refines potentially corrupted initial 3D poses estimated by off-the-shelf detectors. The residual function is learned via a Set Attention (Lee et al., 2019) mechanism. Despite of our model being relatively straightforward, a thorough evaluation demonstrates that our approach is able to boost the performance of the initially estimated 3D poses by large margins, achieving state-of-the-art results on two standardized benchmarks.

Paper Nr: 183
Title:

Social Distancing Monitoring by Human Detection Through Bird’s-Eye View Technique

Authors:

Gona Rozhbayani, Amel Tuama and Fadwa Al-Azzo

Abstract: The objective of this study is to offer a YOLOv5 deep learning-based system for social distance monitoring. The YOLOv5 model has been used to detect humans in real- time video frames, and to obtain information on the detected bounding box for the bird’s eye view perspective technique. The pairwise distances of the identified bounding box centroid of people are calculated by utilizing euclidean distance. In addition, a threshold value has been set and applied as an approximation of social distance to pixels for determining social distance violations between people. The effectiveness of this proposed system is tested by experiments on different four video frames. The suggested system’s performance showed a high level of efficiency in monitoring social distancing accurately up to 100%.
Download

Paper Nr: 277
Title:

World-Map Misalignment Detection for Visual Navigation Systems

Authors:

Rosario Forte, Michele Mazzamuto, Francesco Ragusa, Giovanni M. Farinella and Antonino Furnari

Abstract: We consider the problem of inferring when the internal map of an indoor navigation system is misaligned with respect to the real world (world-map misalignment), which can lead to misleading directions given to the user. We note that world-map misalignment can be predicted from an RGB image of the environment and the floor segmentation mask obtained from the internal map of the navigation system. Since collecting and labelling large amounts of real data is expensive, we developed a tool to simulate human navigation, which is used to generate automatically labelled synthetic data from 3D models of environments. Thanks to this tool, we generate a dataset considering 15 different environments, which is complemented by a small set of videos acquired in a real-world scenario and manually labelled for validation purposes. We hence benchmark an approach based on different ResNet18 configurations and compare their results on both synthetic and real images. We achieved an F1 score of 92.37% in the synthetic domain and 75.42% on the proposed real dataset using our best approach. While the results are promising, we also note that the proposed problem is challenging, due to the domain shift between synthetic and real data, and the difficulty in acquiring real data. The dataset and the developed tool are publicly available to encourage research on the topic at the following URL: https://github.com/fpv-iplab/WMM-detection-for-visual-navigation-systems.
Download

Paper Nr: 309
Title:

Region-Transformer: Self-Attention Region Based Class-Agnostic Point Cloud Segmentation

Authors:

Dipesh Gyawali, Jian Zhang and Bijaya B. Karki

Abstract: Point cloud segmentation, which helps us understand the environment of specific structures and objects, can be performed in class-specific and class-agnostic ways. We propose a novel region-based transformer model called Region-Transformer for performing class-agnostic point cloud segmentation. The model utilizes a region-growth approach and self-attention mechanism to iteratively expand or contract a region by adding or removing points. It is trained on simulated point clouds with instance labels only, avoiding semantic labels. Attention-based networks have succeeded in many previous methods of performing point cloud segmentation. However, a region-growth approach with attention-based networks has yet to be used to explore its performance gain. To our knowledge, we are the first to use a self-attention mechanism in a region-growth approach. With the introduction of self-attention to region-growth that can utilize local contextual information of neighborhood points, our experiments demonstrate that the Region-Transformer model outperforms previous class-agnostic and class-specific methods on indoor datasets regarding clustering metrics. The model generalizes well to large-scale scenes. Key advantages include capturing long-range dependencies through self-attention, avoiding the need for semantic labels during training, and applicability to a variable number of objects. The Region-Transformer model represents a promising approach for flexible point cloud segmentation with applications in robotics, digital twinning, and autonomous vehicles.
Download

Paper Nr: 325
Title:

Multidimensional Compressed Sensing for Spectral Light Field Imaging

Authors:

Wen Cao, Ehsan Miandji and Jonas Unger

Abstract: This paper considers a compressive multi-spectral light field camera model that utilizes a one-hot spectral-coded mask and a microlens array to capture spatial, angular, and spectral information using a single monochrome sensor. We propose a model that employs compressed sensing techniques to reconstruct the complete multi-spectral light field from undersampled measurements. Unlike previous work where a light field is vectorized to a 1D signal, our method employs a 5D basis and a novel 5D measurement model, hence, matching the intrinsic dimensionality of multispectral light fields. We mathematically and empirically show the equivalence of 5D and 1D sensing models, and most importantly that the 5D framework achieves orders of magnitude faster reconstruction while requiring a small fraction of the memory. Moreover, our new multidimensional sensing model opens new research directions for designing efficient visual data acquisition algorithms and hardware.
Download

Paper Nr: 356
Title:

Visual Perception of Obstacles: Do Humans and Machines Focus on the Same Image Features?

Authors:

Constantinos A. Kyriakides, Marios Thoma, Zenonas Theodosiou, Harris Partaourides, Loizos Michael and Andreas Lanitis

Abstract: Contemporary cities are fractured by a growing number of barriers, such as on-going construction and infrastructure damages, which endanger pedestrian safety. Automated detection and recognition of such barriers from visual data has been of particular concern to the research community in recent years. Deep Learning (DL) algorithms are now the dominant approach in visual data analysis, achieving excellent results in a wide range of applications, including obstacle detection. However, explaining the underlying operations of DL models remains a key challenge in gaining significant understanding on how they arrive at their decisions. The use of heatmaps that highlight the focal points in input images that helped the models reach their predictions has emerged as a form of post-hoc explainability for such models. In an effort to gain insights into the learning process of DL models, we studied the similarities between heatmaps generated by a number of architectures trained to detect obstacles on sidewalks in images collected via smartphones, and eye-tracking heatmaps generated by humans as they detect the corresponding obstacles on the same data. Our findings indicate that the focus points of humans more closely align with those of a Vision Transformer architecture, as opposed to the other network architectures we examined in our experiments.
Download

Paper Nr: 378
Title:

Reliability and Stability of Mean Opinion Score for Image Aesthetic Quality Assessment Obtained Through Crowdsourcing

Authors:

Egor Ershov, Artyom Panshin, Ivan Ermakov, Nikola Banić, Alex Savchik and Simone Bianco

Abstract: Image quality assessment (IQA) is widely used to evaluate the results of image processing methods. While in recent years the development of objective IQA metrics has seen much progress, there are still many tasks where subjective IQA is significantly more preferred. Using subjective IQA has become even more attractive ever since crowdsourcing platforms such as Amazon Mechanical Turk and Toloka have become available. However, for some specific image processing tasks, there are still some questions related to subjective IQA that have not been solved in a satisfactory way. An example of such a task is the evaluation of image rendering styles where, unlike in the case of distortions, none of the evaluated styles is to be objectively regarded as a priori better or worse. The questions that have not been properly answered up until now are whether the scores for such a task obtained through crowdsourced subjective IQA are reliable and whether they remain stable, i.e., similar if the evaluation is repeated over time. To answer these questions, in this paper first several images and styles are selected and defined, they are then evaluated by using crowdsourced subjective IQA on the Toloka platform, and the obtained scores are numerically analyzed. Experimental results confirm the reliability and stability of the crowdsourced subjective IQA for the problem in question. The experimental data is available at https://zenodo.org/records/10458531.
Download

Paper Nr: 383
Title:

Detecting Anomalous 3D Point Clouds Using Pre-Trained Feature Extractors

Authors:

Dario Mantegazza and Alessandro Giusti

Abstract: In this paper we explore the status of the research effort for the task of 3D visual anomaly detection; in particular, we investigate whether it is possible to find anomalies on 3D point clouds using off-the-shelf feature extractors, similar to what is already feasible on images, without the requirement of an ad-hoc one. Our work uses a model composed of two parts: a feature extraction module and an anomaly detection head. The latter is fixed and works on the embeddings from the feature extraction module. Using the MVTec-3D dataset, we contribute a comparison between a 3D point cloud features extractor, a 2D image features extractor, a combination of the two, and three baselines. We also compare our work with other models on the dataset’s DETECTION-AUROC benchmark. The experiment results demonstrate that, while our proposed approach surpasses the baselines and some other approaches, our best-performing model cannot beat purposely developed ones. We conclude that a combination of dataset size and 3D data complexity is the culprit to a lack of off-the-shelf feature extractors for solving complex 3D vision tasks.
Download

Paper Nr: 430
Title:

Decoding Visual Stimuli and Visual Imagery Information from EEG Signals Utilizing Multi-Perspective 3D-CNN Based Hierarchical Deep-Fusion Learning Network

Authors:

Fatma Y. Emanet and Kazim Sekeroglu

Abstract: Brain-Computer Interface Systems (BCIs) facilitate communication between the brain and machines, enabling applications such as diagnosis, understanding brain function, and cognitive augmentation. This study explores the classification of visual stimuli and visual imagery using electroencephalographic (EEG) data. The proposed method utilizes 3D EEG data generated by transforming 1D EEG data into 2D Spatiotemporal EEG image mappings for feature extraction and classification. Additionally, a multi-perspective 3D CNN-based hierarchical deep fusion learning network is employed to classify multi-dimensional spatiotemporal EEG data, decoding brain activity for visual and visual imagery stimulation. The findings show that the suggested multi-perspective fusion method performs better than a standalone model, indicating promising progress in using BCIs to understand and utilize brain signals for visual and imagined stimulation.
Download

Paper Nr: 28
Title:

Finding and Navigating to Humans in Complex Environments for Assistive Tasks

Authors:

Asfand Yaar, Antonino Furnari, Marco Rosano, Aki Härmä and Giovanni M. Farinella

Abstract: Finding and reaching humans in unseen environments is a major challenge for intelligent agents and social robots. Effective exploration and navigation strategies are necessary to locate the human performing various activities. In this paper, we propose a problem formulation in which the robot is required to locate and reach humans in unseen environments. To tackle this task, we design an approach that makes use of state-of-the-art components to allow the agent to explore the environment, identify the human’s location on the map, and approach them while maintaining a safe distance. To include human models, we utilized Blender to modify the scenes of the Gibson dataset. We conducted experiments using the Habitat simulator, where the proposed approach achieves promising results. The success of our approach is measured by the distance and orientation difference between the robot and the human at the end of the episode. We will release the source code and 3D human models for researchers to benchmark their assistive systems.
Download

Paper Nr: 114
Title:

Enhancing Object Detection Accuracy with Variational Autoencoders as a Filter in YOLO

Authors:

Shubham K. Dubey, J. V. Satyanarayana and C. K. Mohan

Abstract: Object detection is an important task in computer vision systems, encompassing a diverse spectrum of applications, including but not limited to autonomous vehicular navigation and surveillance. Despite considerable advancements in object detection models such as YOLO, the issue of false positive detections remain a prevalent concern, thereby causing misclassifications and diminishing the reliability of these systems. This research endeavors to present an innovative methodology designed to augment object detection accuracy by incorporating Variational Autoencoders (VAEs) as a filtration mechanism within the YOLO framework. This integration seeks to rectify the issue of false positive detections, ultimately fostering a marked enhancement in detection precision and strengthening the overall dependability of object detection systems.
Download

Paper Nr: 189
Title:

Automatic Error Correction of GPT-Based Robot Motion Generation by Partial Affordance of Tool

Authors:

Takahiro Suzuki, Yuta Ando and Manabu Hashimoto

Abstract: In this research, we proposed a technique that, given a simple instruction such as ”Please make a cup of coffee” as would commonly be used when one human gives another human an instruction, determines an appropriate robot motion sequence and the tools to be used for that task and generates a motion trajectory for a robot to execute the task. The proposed method uses a large language model (GPT) to determine robot motion sequences and tools to be used. However, GPT may select tools that do not exist in the scene or are not appropriate. To correct this error, our research focuses on function and functional consistency. An everyday object has a role assigned to each region of that object, such as ”scoop” or ”contain”. There are also constraints such as the fact that a ladle must have scoop and grasp functions. The proposed method judges whether the tools in the scene are inconsistent with these constraints, and automatically corrects the tools as necessary. Experimental results confirmed that the proposed method was able to generate motion sequences a from simple instruction and that the proposed method automatically corrects errors in GPT outputs.
Download

Paper Nr: 306
Title:

Combining Progressive Hierarchical Image Encoding and YOLO to Detect Fish in Their Natural Habitat

Authors:

Antoni Burguera

Abstract: This paper explores the advantages of evaluating Progressive Image Encoding (PIE) methods in the context of the specific task for which they will be used. By focusing on a particular task —fish detection in their natural habitat— and a specific PIE algorithm — Progressive Hierarchical Image Encoding (PHIE)—, the paper investigates the performance of You Only Look Once (YOLO) in detecting fish in underwater images using PHIE-encoded images. This is particularly relevant in underwater environments where image transmission is slow. Results provide insights into the advantages and drawbacks of PHIE image encoding and decoding, not from the perspective of general metrics such as reconstructed image quality but from the viewpoint of its impact on a task —fish detection— that depends on the PHIE encoded and decoded images.
Download

Area 3 - Image and Video Understanding

Full Papers
Paper Nr: 23
Title:

Enabling On-Device Continual Learning with Binary Neural Networks and Latent Replay

Authors:

Lorenzo Vorabbi, Davide Maltoni, Guido Borghi and Stefano Santi

Abstract: On-device learning remains a formidable challenge, especially when dealing with resource-constrained devices that have limited computational capabilities. This challenge is primarily rooted in two key issues: first, the memory available on embedded devices is typically insufficient to accommodate the memory-intensive back-propagation algorithm, which often relies on floating-point precision. Second, the development of learning algorithms on models with extreme quantization levels, such as Binary Neural Networks (BNNs), is critical due to the drastic reduction in bit representation. In this study, we propose a solution that combines recent advancements in the field of Continual Learning (CL) and Binary Neural Networks to enable on-device training while maintaining competitive performance. Specifically, our approach leverages binary latent replay (LR) activations and a novel quantization scheme that significantly reduces the number of bits required for gradient computation. The experimental validation demonstrates a significant accuracy improvement in combination with a noticeable reduction in memory requirement, confirming the suitability of our approach in expanding the practical applications of deep learning in real-world scenarios.
Download

Paper Nr: 40
Title:

Uncertainty-Based Detection of Adversarial Attacks in Semantic Segmentation

Authors:

Kira Maag and Asja Fischer

Abstract: State-of-the-Art deep neural networks have proven to be highly powerful in a broad range of tasks, including semantic image segmentation. However, these networks are vulnerable against adversarial attacks, i.e., non-perceptible perturbations added to the input image causing incorrect predictions, which is hazardous in safety-critical applications like automated driving. Adversarial examples and defense strategies are well studied for the image classification task, while there has been limited research in the context of semantic segmentation. First works however show that the segmentation outcome can be severely distorted by adversarial attacks. In this work, we introduce an uncertainty-based approach for the detection of adversarial attacks in semantic segmentation. We observe that uncertainty as for example captured by the entropy of the output distribution behaves differently on clean and perturbed images and leverage this property to distinguish between the two cases. Our method works in a light-weight and post-processing manner, i.e., we do not modify the model or need knowledge of the process used for generating adversarial examples. In a thorough empirical analysis, we demonstrate the ability of our approach to detect perturbed images across multiple types of adversarial attacks.
Download

Paper Nr: 44
Title:

Synthesizing Classifiers from Prior Knowledge

Authors:

G. J. Burghouts, K. Schutte, M. Kruithof, W. Huizinga, F. Ruis and H. Kuijf

Abstract: Various good methods have been proposed for either zero-shot or few-shot learning, but these are commonly unsuited for both; whereas in practice one often starts without labels and some might become available later. We propose a method that naturally ties zero- and few-shot learning together. We initiate a zero-shot model from prior knowledge about the classes, by recombining the weights from a classification head via a linear reconstruction that is sparse to avoid overfitting. Our mapping is an explicit transfer of knowledge from known to new classes, hence it can be inspected and visualized, which is impossible with recently popular implicit prompt learning strategies. Our mapping is used to construct a classifier for the new class, by adapting the neural weights of the classifiers for the known classes. Effectively we synthesize a new classifier. Our method is flexible: we show its efficacy for various knowledge representations and various neural networks (whereas prompt learning is limited to language-vision models). Our synthesized classifier can operate directly on test samples in a zero-shot fashion. We outperform CLIP especially for uncommon image classes, sometimes by margins up to 32%. Because the synthesized classifier consists of a tensor layer, it can be optimized further when a (few) labeled images become available. For few-shot learning, our synthesized classifier provides a kickstart. With one label per class, it outperforms strong baselines that require annotation of attributes or heavy pretraining (CLIP) by 8%, and increases accuracy by 39% relative to conventional classifier initialization. The code is available.
Download

Paper Nr: 45
Title:

StyleHumanCLIP: Text-Guided Garment Manipulation for StyleGAN-Human

Authors:

Takato Yoshikawa, Yuki Endo and Yoshihiro Kanamori

Abstract: This paper tackles text-guided control of StyleGAN for editing garments in full-body human images. Existing StyleGAN-based methods suffer from handling the rich diversity of garments and body shapes and poses. We propose a framework for text-guided full-body human image synthesis via an attention-based latent code mapper, which enables more disentangled control of StyleGAN than existing mappers. Our latent code mapper adopts an attention mechanism that adaptively manipulates individual latent codes on different StyleGAN layers under text guidance. In addition, we introduce feature-space masking at inference time to avoid unwanted changes caused by text inputs. Our quantitative and qualitative evaluations reveal that our method can control generated images more faithfully to given texts than existing methods.
Download

Paper Nr: 62
Title:

S3Aug: Segmentation, Sampling, and Shift for Action Recognition

Authors:

Taiki Sugiura and Toru Tamaki

Abstract: Action recognition is a well-established area of research in computer vision. In this paper, we propose S3Aug, a video data augmenatation for action recognition. Unlike conventional video data augmentation methods that involve cutting and pasting regions from two videos, the proposed method generates new videos from a single training video through segmentation and label-to-image transformation. Furthermore, the proposed method modifies certain categories of label images by sampling to generate a variety of videos, and shifts intermediate features to enhance the temporal coherency between frames of the generate videos. Experimental results on the UCF101, HMDB51, and Mimetics datasets demonstrate the effectiveness of the proposed method, paricularlly for out-of-context videos of the Mimetics dataset.
Download

Paper Nr: 77
Title:

Attention-Based Shape and Gait Representations Learning for Video-Based Cloth-Changing Person Re-Identification

Authors:

Vuong D. Nguyen, Samiha Mirza, Pranav Mantini and Shishir K. Shah

Abstract: Current state-of-the-art Video-based Person Re-Identification (Re-ID) primarily relies on appearance features extracted by deep learning models. These methods are not applicable for long-term analysis in real-world scenarios where persons have changed clothes, making appearance information unreliable. In this work, we deal with the practical problem of Video-based Cloth-Changing Person Re-ID (VCCRe-ID) by proposing “Attention-based Shape and Gait Representations Learning” (ASGL) for VCCRe-ID. Our ASGL framework improves Re-ID performance under clothing variations by learning clothing-invariant gait cues using a Spatial-Temporal Graph Attention Network (ST-GAT). Given the 3D-skeleton-based spatial-temporal graph, our proposed ST-GAT comprises multi-head attention modules, which are able to enhance the robustness of gait embeddings under viewpoint changes and occlusions. The ST-GAT amplifies the important motion ranges and reduces the influence of noisy poses. Then, the multi-head learning module effectively reserves beneficial local temporal dynamics of movement. We also boost discriminative power of person representations by learning body shape cues using a GAT. Experiments on two large-scale VCCRe-ID datasets demonstrate that our proposed framework outperforms state-of-the-art methods by 12.2% in rank-1 accuracy and 7.0% in mAP.
Download

Paper Nr: 110
Title:

Reducing Bias in Pre-Trained Models by Tuning While Penalizing Change

Authors:

Niklas Penzel, Gideon Stein and Joachim Denzler

Abstract: Deep models trained on large amounts of data often incorporate implicit biases present during training time. If later such a bias is discovered during inference or deployment, it is often necessary to acquire new data and retrain the model. This behavior is especially problematic in critical areas such as autonomous driving or medical decision-making. In these scenarios, new data is often expensive and hard to come by. In this work, we present a method based on change penalization that takes a pre-trained model and adapts the weights to mitigate a previously detected bias. We achieve this by tuning a zero-initialized copy of a frozen pre-trained network. Our method needs very few, in extreme cases only a single, examples that contradict the bias to increase performance. Additionally, we propose an early stopping criterion to modify baselines and reduce overfitting. We evaluate our approach on a well-known bias in skin lesion classification and three other datasets from the domain shift literature. We find that our approach works especially well with very few images. Simple fine-tuning combined with our early stopping also leads to performance benefits for a larger number of tuning samples.
Download

Paper Nr: 111
Title:

Important Pixels Sampling for NeRF Training Based on Edge Values and Squared Errors Between the Ground Truth and the Estimated Colors

Authors:

Kohei Fukuda, Takio Kurita and Hiroaki Aizawa

Abstract: Neural Radiance Fields (NeRF) has impacted computer graphics and computer vision by enabling fine 3D representations using neural networks. However, depending on the data (especially on synthetic datasets with single-color backgrounds), the neural network training of NeRF is often unstable, and the rendering results become poor. This paper proposes a method to sample the informative pixels to remedy these shortcomings. The sampling method consists of two phases. In the early stage of learning (up to 1/10 of all iterations), the sampling probability is determined based on the edge strength obtained by edge detection. Also, we use the squared errors between the ground truth and the estimated color of the pixels for sampling. The introduction of these tweaks improves the learning of NeRF. In the experiment, we confirmed the effectiveness of the method. In particular, for small amounts of data, the training process of the neural network for NeRF was accelerated and stabilized.
Download

Paper Nr: 124
Title:

Pixel-Wise Gradient Uncertainty for Convolutional Neural Networks Applied to Out-of-Distribution Segmentation

Authors:

Kira Maag and Tobias Riedlinger

Abstract: In recent years, deep neural networks have defined the state-of-the-art in semantic segmentation where their predictions are constrained to a predefined set of semantic classes. They are to be deployed in applications such as automated driving, although their categorically confined expressive power runs contrary to such open world scenarios. Thus, the detection and segmentation of objects from outside their predefined semantic space, i.e., out-of-distribution (OoD) objects, is of highest interest. Since uncertainty estimation methods like softmax entropy or Bayesian models are sensitive to erroneous predictions, these methods are a natural baseline for OoD detection. Here, we present a method for obtaining uncertainty scores from pixel-wise loss gradients which can be computed efficiently during inference. Our approach is simple to implement for a large class of models, does not require any additional training or auxiliary data and can be readily used on pre-trained segmentation models. Our experiments show the ability of our method to identify wrong pixel classifications and to estimate prediction quality at negligible computational overhead. In particular, we observe superior performance in terms of OoD segmentation to comparable baselines on the SegmentMeIfYouCan benchmark, clearly outperforming other methods.
Download

Paper Nr: 153
Title:

Enabling RAW Image Classification Using Existing RGB Classifiers

Authors:

Rasmus Munksø, Mathias V. Andersen, Lau Nørgaard, Andreas Møgelmose and Thomas B. Moeslund

Abstract: Unprocessed RAW data stands out as a highly valuable image format in image editing and computer vision due to it preserving more details, colors, and a wider dynamic range as captured directly from the camera’s sensor compared to non-linearly processed RGB images. Despite its advantages, the computer vision community has largely overlooked RAW files, especially in domains where preserving precise details and accurate colors are crucial. This work addresses this oversight by leveraging transfer learning techniques. By exploiting the vast amount of available RGB data, we enhance the usability of a limited RAW image dataset for image classification. Surprisingly, applying transfer learning from an RGB-trained model to a RAW dataset yields impressive performance, reducing the dataset size barrier in RAW research. These results are promising, demonstrating the potential of cross-domain transfer learning between RAW and RGB data and opening doors for further exploration in this area of research.
Download

Paper Nr: 156
Title:

Non-Local Context-Aware Attention for Object Detection in Remote Sensing Images

Authors:

Yassin Terraf, El M. Mercha and Mohammed Erradi

Abstract: Object detection in remote sensing images has been widely studied due to the valuable insights it provides for different fields. Detecting objects in remote sensing images is a very challenging task due to the diverse range of sizes, orientations, and appearances of objects within the images. Many approaches have been developed to address these challenges, primarily focusing on capturing semantic information while missing out on contextual details that can bring more insights to the analysis. In this work, we propose a Non-Local Context-Aware Attention (NLCAA) approach for object detection in remote sensing images. NLCAA includes semantic and contextual attention modules to capture both semantic and contextual information. Extensive experiments were conducted on two publicly available datasets, namely NWPU VHR and DIOR, to evaluate the performance of the proposed approach. The experimental results demonstrate the effectiveness of the NLCAA approach against various state-of-the-art methods.
Download

Paper Nr: 178
Title:

Mediapi-RGB: Enabling Technological Breakthroughs in French Sign Language (LSF) Research Through an Extensive Video-Text Corpus

Authors:

Yanis Ouakrim, Hannah Bull, Michèle Gouiffès, Denis Beautemps, Thomas Hueber and Annelies Braffort

Abstract: We introduce Mediapi-RGB, a new dataset of French Sign Language (LSF) along with the first LSF-to-French machine translation model. With 86 hours of video, it the largest LSF corpora with translation. The corpus consists of original content in French Sign Language produced by deaf journalists, and has subtitles in written French aligned to the signing. The current release of Mediapi-RGB is available at the Ortolang corpus repository (https://www.ortolang.fr/workspaces/mediapi-rgb), and can be used for academic research purposes. The test and validation sets contain 13 and 7 hours of video respectively. The training set contains 66 hours of video that will be released progressively until December 2024. Additionally, the current release contains skeleton keypoints, sign temporal segmentation, spatio-temporal features and subtitles for all the videos in the train, validation and test sets, as well as a suggested vocabulary of nouns for evaluation purposes. In addition, we present the results obtained on this corpus with the first LSF-to-French translation baseline to give an overview of the possibilities offered by this corpus of unprecedented caliber for LSF. Finally, we suggest potential technological and linguistic applications for this new video-text dataset.
Download

Paper Nr: 209
Title:

When Medical Imaging Met Self-Attention: A Love Story That Didn’t Quite Work out

Authors:

Tristan Piater, Niklas Penzel, Gideon Stein and Joachim Denzler

Abstract: A substantial body of research has focused on developing systems that assist medical professionals during labor-intensive early screening processes, many based on convolutional deep-learning architectures. Recently, multiple studies explored the application of so-called self-attention mechanisms in the vision domain. These studies often report empirical improvements over fully convolutional approaches on various datasets and tasks. To evaluate this trend for medical imaging, we extend two widely adopted convolutional architectures with different self-attention variants on two different medical datasets. With this, we aim to specifically evaluate the possible advantages of additional self-attention. We compare our models with similarly sized convolutional and attention-based baselines and evaluate performance gains statistically. Additionally, we investigate how including such layers changes the features learned by these models during the training. Following a hyperparameter search, and contrary to our expectations, we observe no significant improvement in balanced accuracy over fully convolutional models. We also find that important features, such as dermoscopic structures in skin lesion images, are still not learned by employing self-attention. Finally, analyzing local explanations, we confirm biased feature usage. We conclude that merely incorporating attention is insufficient to surpass the performance of existing fully convolutional methods.
Download

Paper Nr: 248
Title:

RailCloud-HdF: A Large-Scale Point Cloud Dataset for Railway Scene Semantic Segmentation

Authors:

Mahdi Abid, Mathis Teixeira, Ankur Mahtani and Thomas Laurent

Abstract: Semantic scene perception is critical for various applications, including railway systems where safety and efficiency are paramount. Railway applications demand precise knowledge of the environment, making Light Detection and Ranging (LiDAR) a fundamental component of sensor suites. Despite the significance of 3D semantic scene understanding in railway context, there exists no publicly available railborne LiDAR dataset tailored for this purpose. In this work, we present a large-scale point cloud dataset designed to advance research in LiDAR-based semantic scene segmentation for railway applications. Our dataset offers dense point-wise annotations for diverse railway scenes, covering over 267km. To facilitate rigorous evaluation and benchmarking, we propose semantic segmentation of point clouds from a single LiDAR scan as a challenging task. Furthermore, we provide baseline experiments to showcase some state-of-the-art deep learning methods for this task. Our findings highlight the need for more advanced models to effectively address this task. This dataset not only catalyzes the development of sophisticated methods for railway applications, but also encourages exploration of novel research directions.
Download

Paper Nr: 255
Title:

Investigating the Corruption Robustness of Image Classifiers with Random p-norm Corruptions

Authors:

Georg Siedel, Weijia Shao, Silvia Vock and Andrey Morozov

Abstract: Robustness is a fundamental property of machine learning classifiers required to achieve safety and reliability. In the field of adversarial robustness of image classifiers, robustness is commonly defined as the stability of a model to all input changes within a p-norm distance. However, in the field of random corruption robustness, variations observed in the real world are used, while p-norm corruptions are rarely considered. This study investigates the use of random p-norm corruptions to augment the training and test data of image classifiers. We evaluate the model robustness against imperceptible random p-norm corruptions and propose a novel robustness metric. We empirically investigate whether robustness transfers across different p-norms and derive conclusions on which p-norm corruptions a model should be trained and evaluated. We find that training data augmentation with a combination of p-norm corruptions significantly improves corruption robustness, even on top of state-of-the-art data augmentation schemes.
Download

Paper Nr: 267
Title:

Calisthenics Skills Temporal Video Segmentation

Authors:

Antonio Finocchiaro, Giovanni M. Farinella and Antonino Furnari

Abstract: Calisthenics is a fast-growing bodyweight discipline that consists of different categories, one of which is focused on skills. Skills in calisthenics encompass both static and dynamic elements performed by athletes. The evaluation of static skills is based on their difficulty level and the duration of the hold. Automated tools able to recognize isometric skills from a video by segmenting them to estimate their duration would be desirable to assist athletes in their training and judges during competitions. Although the video understanding literature on action recognition through body pose analysis is rich, no previous work has specifically addressed the problem of calisthenics skill temporal video segmentation. This study aims to provide an initial step towards the implementation of automated tools within the field of Calisthenics. To advance knowledge in this context, we propose a dataset of video footage of static calisthenics skills performed by athletes. Each video is annotated with a temporal segmentation which determines the extent of each skill. We hence report the results of a baseline approach to address the problem of skill temporal segmentation on the proposed dataset. The results highlight the feasibility of the proposed problem, while there is still room for improvement.
Download

Paper Nr: 288
Title:

Detecting Anomalies in Textured Images Using Modified Transformer Masked Autoencoder

Authors:

Afshin Dini and Esa Rahtu

Abstract: We present a new method for detecting and locating anomalies in textured-type images using transformer-based autoencoders. In this approach, a rectangular patch of an image is masked by setting its value to gray and then fetched into a pre-trained autoencoder with several blocks of transformer encoders and decoders in order to reconstruct the unknown part. It is shown that the pre-trained model is not able to reconstruct the defective parts properly when they are inside the masked patch. In this regard, the combination of the Structural Similarity Index Measure and absolute error between the reconstructed image and the original one can be used to define a new anomaly map to find and locate anomalies. In the experiment with the textured images of the MVTec dataset, we discover that not only can this approach find anomalous samples properly, but also the anomaly map itself can specify the exact locations of defects correctly at the same time. Moreover, not only is our method computationally efficient, as it utilizes a pre-trained model and does not require any training, but also it has a better performance compared to previous autoencoders and other reconstruction-based methods. Due to these reasons, one can use this method as a base approach to find and locate irregularities in real-world applications.
Download

Paper Nr: 304
Title:

Parts-Based Implicit 3D Face Modeling

Authors:

Yajie Gu and Nick Pears

Abstract: Previous 3D face analysis has focussed on 3D facial identity, expression and pose disentanglement. However, the independent control of different facial parts and the ability to learn explainable parts-based latent shape embeddings for implicit surfaces remain as open problems. We propose a method for 3D face modeling that learns a continuous parts-based deformation field that maps the various semantic parts of a subject’s face to a template. By swapping affine-mapped facial features among different individuals from predefined regions we achieve significant parts-based training data augmentation. Moreover, by sequentially morphing the surface points of these parts, we learn corresponding latent representations, shape deformation fields, and the signed distance function of a template shape. This gives improved shape controllability and better interpretability of the face latent space, while retaining all of the known advantages of implicit surface modelling. Unlike previous works that generated new faces based on full-identity latent representations, our approach enables independent control of different facial parts, i.e. nose, mouth, eyes and also the remaining surface and yet generates new faces with high reconstruction quality. Evaluations demonstrate both facial expression and parts disentanglement, independent control of those facial parts, as well as state-of-the art facial parts reconstruction, when evaluated on FaceScape and Headspace datasets.
Download

Paper Nr: 328
Title:

Robust Long-Tailed Image Classification via Adversarial Feature Re-Calibration

Authors:

Jinghao Zhang, Zhenhua Feng and Yaochu Jin

Abstract: Long-tailed data distribution is a common issue in many practical learning-based approaches, causing Deep Neural Networks (DNNs) to under-fit minority classes. Although this biased problem has been extensively studied by the research community, the existing approaches mainly focus on the class-wise (inter-class) imbalance problem. In contrast, this paper considers both inter-class and intra-class data imbalance problems for network training. To this end, we present Adversarial Feature Re-calibration (AFR), a method that improves the standard accuracy of a trained deep network by adding adversarial perturbations to the majority samples of each class. To be specific, an adversarial attack model is fine-tuned to perturb the majority samples by injecting the features from their corresponding intra-class long-tailed minority samples. This procedure makes the dataset more evenly distributed from both the inter- and intra-class perspectives, thus encouraging DNNs to learn better representations. The experimental results obtained on CIFAR-100-LT demonstrate the effectiveness and superiority of the proposed AFR method over the state-of-the-art long-tailed learning methods.
Download

Paper Nr: 331
Title:

Alias-Free GAN for 3D-Aware Image Generation

Authors:

Attila Szabó, Yevgeniy Puzikov, Sahan Ayvaz, Sonia Aurelio, Peter Gehler, Reza Shirvany and Malte Alf

Abstract: In this work we build a 3D-aware generative model that produces high quality results with fast inference times. A 3D-aware model generates images and offers control over camera parameters to the user, so that an object can be shown from different viewpoints. The model we build combines the best of two worlds in a very direct way: alias-free Generative Adversarial Networks (GAN) and a Neural Radiance Field (NeRF) rendering, followed by image super-resolution. We show that fast and high-quality image synthesis is possible with careful modifications of the well designed architecture of StyleGAN3. Our design overcomes the problem of viewpoint inconsistency and aliasing artefacts that a direct application of lower-resolution NeRF would exhibit. We show experimental evaluation on two standard benchmark datasets, FFHQ and AFHQv2 and achieve the best or competitive performance on both. Our method does not sacrifice speed, we can render images at megapixel resolution at interactive frame rates.
Download

Paper Nr: 360
Title:

Conditional Vector Graphics Generation for Music Cover Images

Authors:

Ivan Jarsky, Valeria Efimova, Ilya Bizyaev and Andrey Filchenkov

Abstract: Generative Adversarial Networks (GAN) have motivated a rapid growth of the domain of computer image synthesis. As almost all the existing image synthesis algorithms consider an image as a pixel matrix, the high-resolution image synthesis is complicated. A good alternative can be vector images. However, they belong to the highly sophisticated parametric space, which is a restriction for solving the task of synthesizing vector graphics by GANs. In this paper, we consider a specific application domain that softens this restriction dramatically allowing the usage of vector image synthesis. Music cover images should meet the requirements of Internet streaming services and printing standards, which imply high resolution of graphic materials without any additional requirements on the content of such images. Existing music cover image generation services do not analyze tracks themselves; however, some services mostly consider only genre tags. To generate music covers as vector images that reflect the music and consist of simple geometric objects, we suggest a GAN-based algorithm called CoverGAN. The assessment of resulting images is based on their correspondence to the music compared with AttnGAN and DALL-E text-to-image generation according to title or lyrics. Moreover, the significance of the patterns found by CoverGAN has been evaluated in terms of the correspondence of the generated cover images to the musical tracks. Listeners evaluate the music covers generated by the proposed algorithm as quite satisfactory and corresponding to the tracks. Music cover images generation code and demo are available at https://github.com/IzhanVarsky/CoverGAN.
Download

Paper Nr: 365
Title:

Anomaly Detection on Roads Using an LSTM and Normal Maps

Authors:

Yusuke Nonaka, Hideo Saito, Hideaki Uchiyama, Kyota Higa and Masahiro Yamaguchi

Abstract: Detecting anomalies on the road is crucial for generating hazard maps within factory premises and facilitating navigation for visually impaired individuals or robots. This paper proposes a method for anomaly detection on road surfaces using normal maps and a Long Short-Term Memory (LSTM). While existing research primarily focuses on detecting anomalies on the road based on variations in height or color information of images, our approach leverages anomaly detection to identify changes in the spatial structure of the walking scenario. The normal (non-anomaly) data consists of time series normal maps depicting previously traversed roads, which are utilized to predict the upcoming road conditions. Subsequently, an anomaly score is computed by comparing the predicted normal map with the normal map at t +1. If the anomaly score exceeds a dynamically set threshold, it indicates the presence of anomalies on the road. The proposed method employs unsupervised learning for anomaly detection. To assess the effectiveness of the proposed method, we conducted accuracy assessments using a custom dataset, taking into account a qualitative comparison with the results of existing methods. The results confirm that the proposed method effectively detects anomalies on road surfaces through anomaly detection.
Download

Paper Nr: 368
Title:

Beyond the Known: Adversarial Autoencoders in Novelty Detection

Authors:

Muhammad Asad, Ihsan Ullah, Ganesh Sistu and Michael G. Madden

Abstract: In novelty detection, the goal is to decide if a new data point should be categorized as an inlier or an outlier, given a training dataset that primarily captures the inlier distribution. Recent approaches typically use deep encoder and decoder network frameworks to derive a reconstruction error, and employ this error either to determine a novelty score, or as the basis for a one-class classifier. In this research, we use a similar framework but with a lightweight deep network, and we adopt a probabilistic score with reconstruction error. Our methodology calculates the probability of whether the sample comes from the inlier distribution or not. This work makes two key contributions. The first is that we compute the novelty probability by linearizing the manifold that holds the structure of the inlier distribution. This allows us to interpret how the probability is distributed and can be determined in relation to the local coordinates of the manifold tangent space. The second contribution is that we improve the training protocol for the network. Our results indicate that our approach is effective at learning the target class, and it outperforms recent state-of-the-art methods on several benchmark datasets.
Download

Paper Nr: 399
Title:

Image Generation from Hyper Scene Graphs with Trinomial Hyperedges Using Object Attention

Authors:

Ryosuke Miyake, Tetsu Matsukawa and Einoshin Suzuki

Abstract: Conditional image generation, which aims to generate consistent images with a user’s input, is one of the critical problems in computer vision. Text-to-image models have succeeded in generating realistic images for simple situations in which a few objects are present. Yet, they often fail to generate consistent images for texts representing complex situations. Scene-graph-to-image models have the advantage of generating images for complex situations based on the structure of a scene graph. We extended a scene-graph-to-image model to an image generation model from a hyper scene graph with trinomial hyperedges. Our model, termed hsg2im, improved the consistency of the generated images. However, hsg2im has difficulty in generating natural and consistent images for hyper scene graphs with many objects. The reason is that the graph convolutional network in hsg2im struggles to capture relations of distant objects. In this paper, we propose a novel image generation model which addresses this shortcoming by introducing object attention layers. We also use a layout-to-image model auxiliary to generate higher-resolution images. Experimental validations on COCO-Stuff and Visual Genome datasets show that the proposed model generates more natural and consistent images to user’s inputs than the cutting-edge hyper scene-graph-to-image model.
Download

Paper Nr: 424
Title:

CSE: Surface Anomaly Detection with Contrastively Selected Embedding

Authors:

Simon Thomine and Hichem Snoussi

Abstract: Detecting surface anomalies of industrial materials poses a significant challenge within a myriad of industrial manufacturing processes. In recent times, various methodologies have emerged, capitalizing on the advantages of employing a network pre-trained on natural images for the extraction of representative features. Subsequently, these features are subjected to processing through a diverse range of techniques including memory banks, normalizing flow, and knowledge distillation, which have exhibited exceptional accuracy. This paper revisits approaches based on pre-trained features by introducing a novel method centered on target-specific embedding. To capture the most representative features of the texture under consideration, we employ a variant of a contrastive training procedure that incorporates both artificially generated defective samples and anomaly-free samples during training. Exploiting the intrinsic properties of surfaces, we derived a meaningful representation from the defect-free samples during training, facilitating a straightforward yet effective calculation of anomaly scores. The experiments conducted on the MVTEC AD and TILDA datasets demonstrate the competitiveness of our approach compared to state-of-the-art methods.
Download

Short Papers
Paper Nr: 25
Title:

Improving Pseudo-Labelling and Enhancing Robustness for Semi-Supervised Domain Generalization

Authors:

Adnan Khan, Mai A. Shaaban and Muhammad Haris Khan

Abstract: Beyond attaining domain generalization (DG), visual recognition models should also be data-efficient during learning by leveraging limited labels. We study the problem of Semi-Supervised Domain Generalization (SSDG) which is crucial for real-world applications like automated healthcare. SSDG requires learning a cross-domain generalizable model when the given training data is only partially labelled. Empirical investigations reveal that the DG methods tend to underperform in SSDG settings, likely because they are unable to exploit the unlabelled data. Semi-supervised learning (SSL) shows improved but still inferior results compared to fully-supervised learning. A key challenge, faced by the best performing SSL-based SSDG methods, is selecting accurate pseudo-labels under multiple domain shifts and reducing overfitting to source domains under limited labels. In this work, we propose new SSDG approach, which utilizes a novel uncertainty-guided pseudo-labelling with model averaging (UPLM). Our uncertainty-guided pseudo-labelling (UPL) uses model uncertainty to improve pseudo-labelling selection, addressing poor model calibration under multi-source unla-belled data. The UPL technique, enhanced by our novel model averaging (MA) strategy, mitigates overfitting to source domains with limited labels. Extensive experiments on key representative DG datasets suggest that our method demonstrates effectiveness against existing methods. Our code and chosen labelled data seeds are available on GitHub: https://github.com/Adnan-Khan7/UPLM.
Download

Paper Nr: 32
Title:

Classification of Towels in a Robotic Workcell Using Deep Neural Networks

Authors:

Jens M. Rossen, Patrick S. Terp, Norbert Krüger, Laus S. Bigum and Tudor Morar

Abstract: The industrial laundry industry is becoming increasingly more automated. Inwatec, a company specializing in this field, is developing a new robot (BLIZZ) to automate the process of grasping individual clean towels from a pile, and hand them over to an external folding machine. However, to ensure that towels are folded consistently, information about the type and faces of the towels is required. This paper presents a proof of concept for a towel type and towel face classification system integrated in BLIZZ. These two classification problems are solved by means of a Deep Neural Network (DNN). The performance of the proposed DNN on each of the two classification problems is presented, along with the performance of it solving both classification problems at the same time. It is concluded that the proposed network achieves classification accuracies of 94 .48%, 97.71% and 98.52% on the face classification problem for three different towel types with non-identical faces. On the type classification problem, it achieves an accuracy of 99.10% on the full dataset. Additionally, it is concluded that the system achieves an accuracy of 96.96% when simultaneously classifying the type and face of a towel on the full dataset.
Download

Paper Nr: 33
Title:

Evaluating Learning Potential with Internal States in Deep Neural Networks

Authors:

Shogo Takasaki and Shuichi Enokida

Abstract: Deploying deep learning models on small-scale computing devices necessitates considering computational resources. However, reducing the model size to accommodate these resources often results in a trade-off with accuracy. The iterative process of training and validating to optimize model size and accuracy can be inefficient. A potential solution to this dilemma is the extrapolation of learning curves, which evaluates a model’s potential based on initial learning curves. As a result, it is possible to efficiently search for a network that achieves a balance between accuracy and model size. Nonetheless, we posit that a more effective approach to analyzing the latent potential of training models is to focus on the internal state, rather than merely relying on the validation scores. In this vein, we propose a module dedicated to scrutinizing the network’s internal state, with the goal of automating the optimization of both accuracy and network size. Specifically, this paper delves into analyzing the latent potential of the network by leveraging the internal state of the Long Short-Term Memory (LSTM) in a traffic accident prediction network.
Download

Paper Nr: 51
Title:

Cybersecurity Intrusion Detection with Image Classification Model Using Hilbert Curve

Authors:

Punyawat Jaroensiripong, Karin Sumongkayothin, Prarinya Siritanawan and Kazunori Kotani

Abstract: Cybersecurity intrusion detection is crucial for protecting an online system from cyber-attacks. Traditional monitoring methods used in the Security Operation Center (SOC) are insufficient to handle the vast volume of traffic data, producing an overwhelming number of false alarms, and eventually resulting in the neglect of intrusion incidents. The recent integration of Machine Learning (ML) and Deep Learning (DL) into SOC monitoring systems has enhanced the intrusion detection capabilities by learning the patterns of network traffic data. Despite many ML methods implemented for intrusion detection, the Convolutional Neural Network (CNN), one of the most high-performing ML algorithms, has not been widely adopted for the intrusion detection systems. This research aims to explore the potentials of CNN implementation with the network data flows. Since the CNN was originally designed for image processing applications, it is necessary to convert the 1-dimensional network data flows into 2-dimensional image data. This research presents a novel approach to convert the network data flow into an image (flow-to-image) by the Hilbert curve mapping algorithm which can preserve the locality of the data. Then, we apply the converted images to the CNN-based intrusion detection system. Eventually, the proposed method and model can outperform the recent methods with 92.43% accuracy and 93.05% F1-score on the CIC-IDS2017 dataset, and 81.78% accuracy and 83.46% F1-score on the NSL-KDD dataset. In addition to the classification capability, the flow-to-image mapping algorithm can also visualize the characteristics of the network attack on the images visually, which can be an alternative monitoring approach for SOC.
Download

Paper Nr: 61
Title:

Lens Flare-Aware Detector in Autonomous Driving

Authors:

Shanxing Ma and Jan Aelterman

Abstract: Autonomous driving has the potential of reducing traffic accidents, and object detection plays a key role. This paper focuses on the study of object detection in the presence of lens flare. We analyze the impact of lens flare on object detection in autonomous driving tasks and propose a lens flare adaptation method based on Bayesian reasoning theory to optimize existing object detection models. This allows us to adjust the detection scores to re-rank the detections of detection models based on the intensity of lens flare and achieve a higher average precision. Furthermore, this method only requires simple modifications based on the detection results of the existing object detection models, making it easier to deploy on existing devices.
Download

Paper Nr: 74
Title:

Towards Rapid Prototyping and Comparability in Active Learning for Deep Object Detection

Authors:

Tobias Riedlinger, Marius Schubert, Karsten Kahl, Hanno Gottschalk and Matthias Rottmann

Abstract: Active learning as a paradigm in deep learning is especially important in applications involving intricate perception tasks such as object detection where labels are difficult and expensive to acquire. Development of active learning methods in such fields is highly computationally expensive and time consuming which obstructs the progression of research and leads to a lack of comparability between methods. In this work, we propose and investigate a sandbox setup for rapid development and transparent evaluation of active learning in deep object detection. Our experiments with commonly used configurations of datasets and detection architectures found in the literature show that results obtained in our sandbox environment are representative of results on standard configurations. The total compute time to obtain results and assess the learning behavior can be reduced by factors of up to 14 compared to Pascal VOC and up to 32 compared to BDD100k. This allows for testing and evaluating data acquisition and labeling strategies in under half a day and contributes to the transparency and development speed in the field of active learning for object detection.
Download

Paper Nr: 76
Title:

Deep Active Learning with Noisy Oracle in Object Detection

Authors:

Marius Schubert, Tobias Riedlinger, Karsten Kahl and Matthias Rottmann

Abstract: Obtaining annotations for complex computer vision tasks such as object detection is an expensive and timeintense endeavor involving numerous human workers or expert opinions. Reducing the amount of annotations required while maintaining algorithm performance is, therefore, desirable for machine learning practitioners and has been successfully achieved by active learning. However, it is not merely the amount of annotations which influences model performance but also the annotation quality. In practice, oracles that are queried for new annotations frequently produce significant amounts of noise. Therefore, cleansing procedures are oftentimes necessary to review and correct given labels. This process is subject to the same budget as the initial annotation itself since it requires human workers or even domain experts. Here, we propose a composite active learning framework including a label review module for deep object detection. We show that utilizing part of the annotation budget to correct the noisy annotations partially in the active dataset leads to early improvements in model performance, especially when coupled with uncertainty-based query strategies. The precision of the label error proposals significantly influences the measured effect of the label review. In our experiments we achieve improvements of up to 4.5mAP points by incorporating label reviews at equal annotation budget.
Download

Paper Nr: 82
Title:

Multi-Task Learning Based on Log Dynamic Loss Weighting for Sex Classification and Age Estimation on Panoramic Radiographs

Authors:

Igor Prado, David Lima, Julian Liang, Ana Hougaz, Bernardo Peters and Luciano Oliveira

Abstract: This paper introduces a multi-task learning (MTL) approach for simultaneous sex classification and age estimation in panoramic radiographs, aligning with the tasks pertinent to forensic dentistry. For that, we dynamically optimize the logarithm of the task-specific weights during the loss training. Our results demonstrate the superior performance of our proposed MTL network compared to the individual task-based networks, particularly evident across a diverse data set comprising 7,666 images, spanning ages from 1 to 90 years and encompassing significant sex variability. Our network achieved an F1-score of 90.37%±0.54 and a mean absolute error of 5.66±0.22 through a cross-validation assessment procedure, which resulted in a gain of 1.69 percentage points and 1.15 years with respect to the individual sex classification and age estimation procedures. To the best of our knowledge, it is the first successful MTL-based network for these two tasks.
Download

Paper Nr: 88
Title:

A Comparative Evaluation of Self-Supervised Methods Applied to Rock Images Classification

Authors:

Van T. Nguyen, Dominique Fourer, Désiré Sidibé, Jean-François Lecomte and Souhail Youssef

Abstract: Digital Rock Physics DRP is a discipline that employs advanced computational techniques to analyze and simulate rock properties at the pore-scale level. Recently, Self-Supervised Learning (SSL) has shown promising outcomes in various application domains, but its potential in DRP applications remains largely unexplored. In this study, we propose to assess several self-supervised representation learning methods designed for automatic rock category recognition. Hence, we demonstrate how different SSL approaches can be specifically adapted for DRP, and comparatively evaluated on a new dataset. Our objective is to leverage unlabeled micro-CT (Computed Tomography) image data to train models that capture intricate rock features and obtain representations that enhance the accuracy of classical machine-learning-based rock images classification. Experimental results on a newly proposed rock images dataset indicate that a model initialized using SSL pretraining outperforms its non-self-supervised learning counterpart. Particularly, we find that MoCo-v2 pretraining provides the most benefit with limited labeled training data compared to other models, including supervised model.
Download

Paper Nr: 123
Title:

Kore Initial Clustering for Unsupervised Domain Adaptation

Authors:

Kyungsik Lee, Youngmi Jun, EunJi Kim, Suhyun Kim, Seong J. Hwang and Jonghyun Choi

Abstract: In unsupervised domain adaptation (UDA) literature, there exists an array of techniques to derive domain adaptive features. Among them, a particularly successful family of approaches of pseudo-labeling the unlabeled target data has shown promising results. Yet, the majority of the existing methods primarily focus on leveraging only the target domain knowledge for pseudo-labeling while insufficiently considering the source domain knowledge. Here, we hypothesize that quality pseudo labels obtained via classical K-means clustering considering both the source and target domains bring simple yet significant benefits. In particular, we propose to assign pseudo labels to the target domain’s instances better aligned with the source domain labels by a simple method that modifies K-means clustering by emphasizing the strengthened notion of centroids, namely, Kore Initial Clustering (KIC). The proposed KIC is readily utilizable with a wide array of UDA models, consistently improving the UDA performance on multiple UDA datasets including Office-Home and Office-31, demonstrating the efficacy of pseudo labels in UDA.
Download

Paper Nr: 135
Title:

Interpretable Anomaly Analysis for Surveillance Video

Authors:

Meng Dong

Abstract: Nowadays, there exist plenty of techniques for surveillance video anomaly detection. However, most works focus on anomaly detection, ignoring the interpreting process for anomaly reasons, especially for real-time anomaly monitoring. The automatic surveillance systems would respond to plenty of alarms based on the types and scores of anomalies and then report to the proper parties. Usually, there exist various types of anomalies, such as abnormal objects, motion, and behaviors, captured by surveillance cameras and defined by application requirements. In this work, we investigate the perspective of reasons for anomalies and propose a general and interpretable anomaly analysis framework formed by three branches: abnormal object categories detection, anomalous motion detection, and strange/violent behaviors recognition, and then related scores are combined to obtain final results. The above three branches could cover various anomaly types in the real world. Besides, the fusion of branches is multivariate based on specific domains or user requirements. They can work together or individually based on specific requirements. In particular, an online non-parametric hierarchical event updating motion model is proposed to explore general motion anomaly. The events with low frequency or have not been seen before could be detected in an unsupervised and continue-updating way. Besides, abnormal human behaviors, such as falling and violence, could be recognized by a spatial-temporal transformer model. Three branches cover different regions but complement each other for joint detection and output interpretable anomaly results. Evaluated on existing datasets, our results are competitive to the online and offline state-of-the-art on several public datasets, demonstrating the proposed method’s scene-independent and interpretable abilities even with simple motion update methods. Moreover, the performance of individual anomaly detectors also validates the effectiveness of our proposed method.

Paper Nr: 142
Title:

How Quality Affects Deep Neural Networks in Fine-Grained Image Classification

Authors:

Joseph Smith, Zheming Zuo, Jonathan Stonehouse and Boguslaw Obara

Abstract: In this paper, we propose a No-Reference Image Quality Assessment (NRIQA) guided cut-off point selection (CPS) strategy to enhance the performance of a fine-grained classification system. Scores given by existing NRIQA methods on the same image may vary and not be as independent of natural image augmentations as expected, which weakens their connection and explainability to fine-grained image classification. Taking the three most commonly adopted image augmentation configurations – cropping, rotating, and blurring – as the entry point, we formulate a two-step mechanism for selecting the most discriminative subset from a given image dataset by considering both the confidence of model predictions and the density distribution of image qualities over several NRIQA methods. Concretely, the cut-off points yielded by those methods are aggregated via majority voting to inform the process of image subset selection. The efficacy and efficiency of such a mechanism have been confirmed by comparing the models being trained on high-quality images against a combination of high- and low-quality ones, with a range of 0.7% to 4.2% improvement on a commercial product dataset in terms of mean accuracy through four deep neural classifiers. The robustness of the mechanism has been proven by the observations that all the selected high-quality images can work jointly with 70% low-quality images with 1.3% of classification precision sacrificed when using ResNet34 in an ablation study.
Download

Paper Nr: 151
Title:

Efficient Parameter Mining and Freezing for Continual Object Detection

Authors:

Angelo G. Menezes, Augusto J. Peter