VISAPP 2024 Abstracts

Area 1 - Image and Video Processing and Analysis

Full Papers

Paper Nr:	65
Title:	Investigating Color Illusions from the Perspective of Computational Color Constancy
Authors:	Oguzhan Ulucan, Diclehan Ulucan and Marc Ebner
Abstract:	Color constancy and color illusion perception are two phenomena occurring in the human visual system, which can help us reveal unknown mechanisms of human perception. For decades computer vision scientists have developed numerous color constancy methods, which estimate the reflectance of the surface by discounting the illuminant. However, color illusions have not been analyzed in detail in the field of computational color constancy, which we find surprising since the relationship they share is significant and may let us design more robust systems. We argue that any model that can reproduce our sensation on color illusions should also be able to provide pixel-wise estimates of the light source. In other words, we suggest that the analysis of color illusions helps us to improve the performance of the existing global color constancy methods, and enable them to provide pixel-wise estimates for scenes illuminated by multiple light sources. In this study, we share the outcomes of our investigation in which we take several color constancy methods and modify them to reproduce the behavior of the human visual system on color illusions. Also, we show that parameters purely extracted from illusions are able to improve the performance of color constancy methods. A noteworthy outcome is that our strategy based on the investigation of color illusions outperforms the state-of-the-art methods that are specifically designed to transform global color constancy algorithms into multi-illuminant algorithms.
Download

Paper Nr:	85
Title:	Pair-GAN: A Three-Validated Generative Model from Single Pairs of Biomedical and Ground Truth Images
Authors:	Clara Brémond-Martin, Huaqian Wu, Cédric Clouchoux and Kévin François-Bouaou
Abstract:	Generating synthetic pairs of raw and ground truth (GT) image is a strategy to reduce the amount of acquisition and annotation by biomedical experts. Pair image generation strategies, from single-input paired images (SIP), focus on patch-pyramid (PP) or on dual branch generator but, resulting synthetic images are not natural. With few-input images, for raw synthesis, adversarial auto-encoders synthesises more natural images. Here we propose Pair-GAN, a combination of PP containing auto-encoder generators at each level, for the biomedical image synthesis based upon a SIP. PP allows to synthesise using SIP while the AAE generator renders most natural the image content. We use for this work two biomedical datasets containing raw and GT images. Our architecture is evaluated with seven state of the art method updated for SIP: qualitative, similitude and segmentation metrics, Kullback Leibler divergences from synthetic and original feature image representations, computational costs and statistical analyses. Pair-GAN generates most qualitative and natural outputs, similar to original pairs with complex shape not produced by other methods, however with increased memory needs. Future works may use this generative procedure for multimodal biomedical dataset synthesis to help their automatic processing such as classification or segmentation with deep learning tools.
Download

Paper Nr:	99
Title:	CaRe-CNN: Cascading Refinement CNN for Myocardial Infarct Segmentation with Microvascular Obstructions
Authors:	Franz Thaler, Matthias F. Gsell, Gernot Plank and Martin Urschler
Abstract:	Late gadolinium enhanced (LGE) magnetic resonance (MR) imaging is widely established to assess the viability of myocardial tissue of patients after acute myocardial infarction (MI). We propose the Cascading Refinement CNN (CaRe-CNN), which is a fully 3D, end-to-end trained, 3-stage CNN cascade that exploits the hierarchical structure of such labeled cardiac data. Throughout the three stages of the cascade, the label definition changes and CaRe-CNN learns to gradually refine its intermediate predictions accordingly. Furthermore, to obtain more consistent qualitative predictions, we propose a series of post-processing steps that take anatomical constraints into account. Our CaRe-CNN was submitted to the FIMH 2023 MYOSAIQ challenge, where it ranked second out of 18 participating teams. CaRe-CNN showed great improvements most notably when segmenting the difficult but clinically most relevant myocardial infarct tissue (MIT) as well as microvascular obstructions (MVO). When computing the average scores over all labels, our method obtained the best score in eight out of ten metrics. Thus, accurate cardiac segmentation after acute MI via our CaRe-CNN allows generating patient-specific models of the heart serving as an important step towards personalized medicine.
Download

Paper Nr:	100
Title:	Efficiency Optimization Strategies for Point Transformer Networks
Authors:	Jannis Unkrig and Markus Friedrich
Abstract:	The Point Transformer, and especially its successor Point Transformer V2, are among the state-of-the-art architectures for point cloud processing in terms of accuracy. However, like many other point cloud processing architectures, they suffer from the inherently irregular structure of point clouds, which makes efficient processing computationally expensive. Common workarounds include reducing the point cloud density, or cropping out partitions, processing them sequentially, and then stitching them back together. However, those approaches inherently limit the architecture by either providing less detail or less context. This work provides strategies that directly address efficiency bottlenecks in the Point Transformer architecture, and therefore allows processing larger point clouds in a single feed-forward operation. Specifically, we propose using uniform point cloud sizes in all stages of the architecture, a k-D tree-based k-nearest neighbor search algorithm that is not only efficient on large point clouds, but also generates intermediate results that can be reused for downsampling, and a technique for normalizing local densities which improves overall accuracy. Furthermore, our architecture is simpler to implement and does not require custom CUDA kernels to run efficiently.
Download

Paper Nr:	115
Title:	Simple Base Frame Guided Residual Network for RAW Burst Image Super-Resolution
Authors:	Anderson N. Cotrim, Gerson Barbosa, Cid N. Santos and Helio Pedrini
Abstract:	Burst super-resolution or multi-frame super-resolution (MFSR) has gained significant attention in recent years, particularly in the context of mobile photography. With modern handheld devices consistently increasing their processing power and the ability to capture multiple images even faster, the development of robust MFSR algorithms has become increasingly feasible. Furthermore, in contrast to extensively studied single-image super-resolution (SISR), burst super-resolution mitigates the ill-posed nature of reconstructing high-resolution images from low-resolution ones by merging information from multiple shifted frames. This research introduces a novel and effective deep learning approach, SBFBurst, designed to tackle this challenging problem. Our network takes multiple noisy RAW images as input and generates a denoised, super-resolved RGB image as output. We demonstrate that significant enhancements can be achieved in this problem by incorporating base frame-guided mechanisms through operations such as feature map concatenation and skip connections. Additionally, we highlight the significance of employing mosaicked convolution to enhance alignment, thus enhancing the overall network performance in super-resolution tasks. These relatively simple improvements underscore the competitiveness of our proposed method when compared to other state-of-the-art approaches.
Download

Paper Nr:	128
Title:	Multispectral Stereo-Image Fusion for 3D Hyperspectral Scene Reconstruction
Authors:	Eric L. Wisotzky, Jost Triller, Anna Hilsmann and Peter Eisert
Abstract:	Spectral imaging enables the analysis of optical material properties that are invisible to the human eye. Different spectral capturing setups, e.g., based on filter-wheel, push-broom, line-scanning, or mosaic cameras, have been introduced in the last years to support a wide range of applications in agriculture, medicine, and industrial surveillance. However, these systems often suffer from different disadvantages, such as lack of real-time capability, limited spectral coverage or low spatial resolution. To address these drawbacks, we present a novel approach combining two calibrated multispectral real-time capable snapshot cameras, covering different spectral ranges, into a stereo-system. Therefore, a hyperspectral data-cube can be continuously captured. The combined use of different multispectral snapshot cameras enables both 3D reconstruction and spectral analysis. Both captured images are demosaicked avoiding spatial resolution loss. We fuse the spectral data from one camera into the other to receive a spatially and spectrally high resolution video stream. Experiments demonstrate the feasibility of this approach and the system is investigated with regard to its applicability for surgical assistance monitoring.
Download

Paper Nr:	138
Title:	Pre-Training and Fine-Tuning Attention Based Encoder Decoder Improves Sea Surface Height Multi-Variate Inpainting
Authors:	Théo Archambault, Arthur Filoche, Anastase Charantonis and Dominique Béréziat
Abstract:	The ocean is observed through satellites measuring physical data of various natures. Among them, Sea Surface Height (SSH) and Sea Surface Temperature (SST) are physically linked data involving different remote sensing technologies and therefore different image inverse problems. In this work, we propose to use an Attention-based Encoder-Decoder to perform the inpainting of the SSH, using the SST as contextual information. We propose to pre-train this neural network on a realistic twin experiment of the observing system and to fine-tune it in an unsupervised manner on real-world observations. We show the interest of this strategy by comparing it to existing methods. Our training methodology achieves state-of-the-art performances, and we report a decrease of 25% in error compared to the most widely used interpolations product.
Download

Paper Nr:	140
Title:	Deep Learning-Based Models for Performing Multi-Instance Multi-Label Event Classification in Gameplay Footage
Authors:	Etienne Julia, Marcelo Zanchetta do Nascimento, Matheus P. Faria and Rita S. Julia
Abstract:	In dynamic environments, like videos, one of the key pieces of information to improve the performance of autonomous agents are the events, since, in a broad manner, they represent the dynamic changes and interactions that happen in the environment. Video games stand out among the most suitable domains for investigating the effectiveness of machine learning techniques. Among the challenging activities explored in such research, it highlights that which endows the automatic game systems with the ability of identifying, in game footage, the events that other players, interacting with them, provoke in the game environment. Thus, the main contribution of this work is the implementation of deep learning models to perform MIML game event classification in gameplay footage, which are composed of: a data generator script to automatically produce multi-labeled frames from game footage (where the labels correspond to game events); a pre-processing method to make the frames generated by the script suitable to be used in the training datasets; a fine-tuned MobileNetV2 to perform feature extraction (trained from the pre-processed frames); an algorithm to produce MIML samples from the pre-processed frames (each sample corresponds to a set of frames named chunk); a deep neural network (NN) to perform classification of game events, which is trained from the chunks. In this investigation, Super Mario Bros is used as a case study.
Download

Paper Nr:	155
Title:	Image Inpainting on the Sketch-Pencil Domain with Vision Transformers
Authors:	Jose F. Campana, Luís L. Decker, Marcos R. Souza, Helena A. Maia and Helio Pedrini
Abstract:	Image inpainting aims to realistically fill missing regions in images, which requires both structural and textural understanding. Traditionally, methods in the literature have employed Convolutional Neural Networks (CNN), especially Generative Adversarial Networks (GAN), to restore missing regions in a coherent and reliable manner. However, CNNs’ limited receptive fields can sometimes result in unreliable outcomes due to their inability to capture the broader context of the image. Transformer-based models, on the other hand, can learn long-range dependencies through self-attention mechanisms. In order to generate more consistent results, some approaches have further incorporated auxiliary information to guide the model’s understanding of structural information. In this work, we propose a new method for image inpainting that uses sketch-pencil information to guide the restoration of structural, as well as textural elements. Unlike previous works that employ edges, lines, or segmentation maps, we leverage the sketch-pencil domain and the capabilities of Transformers to learn long-range dependencies to properly match structural and textural information, resulting in more consistent results. Experimental results show the effectiveness of our approach, demonstrating either superior or competitive performance when compared to existing methods, especially in scenarios involving complex images and large missing areas.
Download

Paper Nr:	163
Title:	EBA-PRNetCC: An Efficient Bridge Attention-Integration PoseResNet for Coordinate Classification in 2D Human Pose Estimation
Authors:	Ali Zakir, Sartaj A. Salman, Gibran Benitez-Garcia and Hiroki Takahashi
Abstract:	In the current era, 2D Human Pose Estimation has emerged as an essential component in advanced Computer Vision tasks, particularly for understanding human behaviors. While challenges such as occlusion and unfavorable lighting conditions persist, the advent of deep learning has significantly strengthened the efficacy of 2D HPE. Yet, traditional 2D heatmap methodologies face quantization errors and demand complex post-processing. Addressing this, we introduce the EBA-PRNetCC model, an innovative coordinate classification approach for 2D HPE, emphasizing improved prediction accuracy and optimized model parameters. Our EBA-PRNetCC model employs a modified ResNet34 framework. A key feature is its head, which includes a dual-layer Multi-Layer Perceptron augmented by the Mish activation function. This design not only improves pose estimation precision but also minimizes model parameters. Integrating the Efficient Bridge Attention Net further enriches feature extraction, granting the model deep contextual insights. By enhancing pixel-level discretization, joint localization accuracy is improved. Comprehensive evaluations on the COCO dataset validate our model’s superior accuracy and computational efficiency performance compared to prevailing 2D HPE techniques.
Download

Paper Nr:	165
Title:	Training Methods for Regularizing Gradients on Multi-Task Image Restoration Problems
Authors:	Samuel Willingham, Mårten Sjöström and Christine Guillemot
Abstract:	Inverse problems refer to the task of reconstructing a clean signal from a degraded observation. In imaging, this pertains to restoration problems like denoising, super-resolution or in-painting. Because inverse problems are often ill-posed, regularization based on prior information is needed. Plug-and-play (pnp) approaches take a general approach to regularization and plug a deep denoiser into an iterative solver for inverse problems. However, considering the inverse problems at hand in training could improve reconstruction performance at test-time. Deep equilibrium models allow for the training of multi-task priors on the reconstruction error via an estimate of the iterative method’s fixed-point (FP). This paper investigates the intersection of pnp and DEQ models for the training of a regularizing gradient (RG) and derives an upper bound for the reconstruction loss of a gradient-descent (GD) procedure. Based on this upper bound, two procedures for the training of RGs are proposed and compared: One optimizes the upper bound directly, the other trains a deep equilibrium GD (DEQGD) procedure and uses the bound for regularization. The resulting regularized RG (RERG) produces consistently good reconstructions across different inverse problems, while the other RGs tend to have some inverse problems on which they provide inferior reconstructions.
Download

Paper Nr:	218
Title:	Feature Selection for Unsupervised Anomaly Detection and Localization Using Synthetic Defects
Authors:	Lars Heckler and Rebecca König
Abstract:	Expressive features are crucial for unsupervised visual Anomaly Detection and Localization. State-of-the-art methods like PatchCore or SimpleNet heavily exploit such features from pretrained extractor networks and model their distribution or utilize them for training further parts of the model. However, the layers commonly used for feature extraction might not represent the optimal choice for reaching maximum performance. Thus, we present the first application-specific feature selection strategy for the task of unsupervised Anomaly Detection and Localization that identifies the most suitable layer of a pretrained feature extractor based on the performance on a synthetic validation set. The proposed selection strategy is applicable to any feature extraction-based AD method and may serve as a competitive baseline for future work by not only outperforming single-layer baselines but also features ensembled from multiple layer outputs.
Download

Paper Nr:	234
Title:	Robust Denoising and DenseNet Classification Framework for Plant Disease Detection
Authors:	Kevin Zhou and Dimah Dera
Abstract:	Plant disease is one of many obstacles encountered in the field of agriculture. Machine learning models have been used to classify and detect diseases among plants by analyzing and extracting features from plant images. However, a common problem for many models is that they are trained on clean laboratory images and do not exemplify real conditions where noise can be present. In addition, the emergence of adversarial noise that can mislead models into wrong predictions poses a severe challenge to developing preserved models against noisy environments. In this paper, we propose an end-to-end robust plant disease detection framework that combines a DenseNet-based classification with a vigorous deep learning denoising model. We validate a variety of deep learning denoising models and adopt the Real Image Denoising network (RIDnet). The experiments have shown that the proposed denoising classification framework for plant disease detection is more robust against noisy or corrupted input images compared to a single classification model and can also successfully defend against adversarial noises in images.
Download

Paper Nr:	240
Title:	SIDAR: Synthetic Image Dataset for Alignment & Restoration
Authors:	Monika Kwiatkowski, Simon Matern and Olaf Hellwich
Abstract:	In this paper, we present a synthetic dataset generation to create large-scale datasets for various image restoration and registration tasks. Illumination changes, shadows, occlusions, and perspective distortions are added to a given image using a 3D rendering pipeline. Each sequence contains the undistorted image, occlusion masks, and homographies. Although we provide two specific datasets, the data generation itself can be customized and used to generate an arbitrarily large dataset with an arbitrary combination of distortions. The datasets allow end-to-end training of deep learning methods for tasks such as image restoration, background subtraction, image matching, and homography estimation. We evaluate multiple image restoration methods to reconstruct the content from a sequence of distorted images. Additionally, a benchmark is provided that evaluates keypoint detectors and image matching methods. Our evaluations show that even learned image descriptors struggle to identify and match keypoints under varying lighting conditions.
Download

Paper Nr:	250
Title:	Beyond Variational Models and Self-Similarity in Super-Resolution: Unfolding Models and Multi-Head Attention
Authors:	Ivan Pereira-Sánchez, Eloi Sans, Julia Navarro and Joan Duran
Abstract:	Classical variational methods for solving image processing problems are more interpretable and flexible than pure deep learning approaches, but their performance is limited by the use of rigid priors. Deep unfolding networks combine the strengths of both by unfolding the steps of the optimization algorithm used to estimate the minimizer of an energy functional into a deep learning framework. In this paper, we propose an unfolding approach to extend a variational model exploiting self-similarity of natural images in the data fidelity term for single-image super-resolution. The proximal, downsampling and upsampling operators are written in terms of a neural network specifically designed for each purpose. Moreover, we include a new multi-head attention module to replace the nonlocal term in the original formulation. A comprehensive evaluation covering a wide range of sampling factors and noise realizations proves the benefits of the proposed unfolding techniques. The model shows to better preserve image geometry while being robust to noise.
Download

Paper Nr:	298
Title:	The Risk of Image Generator-Specific Traces in Synthetic Training Data
Authors:	Georg Wimmer, Dominik Söllinger and Andreas Uhl
Abstract:	Deep learning based methods require large amounts of annotated training data. Using synthetic images to train deep learning models is a faster and cheaper alternative to gathering and manually annotating training data. However, synthetic images have been demonstrated to exhibit a unique model-specific fingerprint that is not present in real images. In this work, we investigate the effect of such model-specific traces on the training of CNN-based classifiers. Two different methods are applied to generate synthetic training data, a conditional GAN-based image-to-image translation method (BicycleGAN) and a conditional diffusion model (Palette). Our results show that CNN-based classifiers can easily be fooled by generator-specific traces contained in synthetic images. As we will show, classifiers can learn to discriminate based on the traces left by the generator, instead of class-specific features.
Download

Paper Nr:	317
Title:	Facial Point Graphs for Amyotrophic Lateral Sclerosis Identification
Authors:	Nicolas B. Gomes, Arissa Yoshida, Mateus Roder, Guilherme Camargo de Oliveira and João P. Papa
Abstract:	Identifying Amyotrophic Lateral Sclerosis (ALS) in its early stages is essential for establishing the beginning of treatment, enriching the outlook, and enhancing the overall well-being of those affected individuals. However, early diagnosis and detecting the disease’s signs is not straightforward. A simpler and cheaper way arises by analyzing the patient’s facial expressions through computational methods. When a patient with ALS engages in specific actions, e.g., opening their mouth, the movement of specific facial muscles differs from that observed in a healthy individual. This paper proposes Facial Point Graphs to learn information from the geometry of facial images to identify ALS automatically. The experimental outcomes in the Toronto Neuroface dataset show the proposed approach outperformed state-of-the-art results, fostering promising developments in the area.
Download

Paper Nr:	384
Title:	Single-Class Instance Segmentation for Vectorization of Line Drawings
Authors:	Rhythm Vohra, Amanda Dash and Alexandra Branzan Albu
Abstract:	Images can be represented and stored either in raster or in vector formats. Raster images are most ubiquitous and are defined as matrices of pixel intensities/colours, while vector images consist of a finite set of geometric primitives, such as lines, curves, and polygons. Since geometric shapes are expressed via mathematical equations and defined by a limited number of control points, they can be manipulated in a much easier way than by directly working with pixels; hence, the vector format is much preferred to raster for image editing and understanding purposes. The conversion of a raster image into its vector correspondent is a non-trivial process, called image vectorization. This paper presents a vectorization method for line drawings, which is much faster and more accurate than the state-of-the-art. We propose a novel segmentation method that processes the input raster image by labeling each pixel as belonging to a particular stroke instance. Our contributions consist of a segmentation model (called Multi-Focus Attention UNet), as well as a loss function that handles well infrequent labels and yields outputs which capture accurately the human drawing style.
Download

Paper Nr:	387
Title:	Frames Preprocessing Methods for Chromakey Classification in Video
Authors:	Evgeny Bessonnitsyn, Artyom Chebykin, Grigorii Stafeev and Valeria Efimova
Abstract:	Currently, video games, movies, commercials, and television shows are ubiquitous in modern society. However, beneath the surface of their visual variety lies sophisticated technology, which can produce impressive effects. One such technology is chromakey — a method that allows to change the background to any other image or video. Recognizing chromakey technology in video plays a key role in finding fake materials. In this paper, we consider approaches based on deep learning models that allows to recognize chromakey in video based on unnatural artifacts that arise during the transition between frames. The video consists of a sequence of frames, and the the video accuracy can be determined in different ways. If we consider the accuracy frame by frame, our method reaches an F1 score equal to 0:67. If we consider the entire video to be fake in case there is one or more fake segments, then the F1 score equal to 0:76. The proposed methods showed better results on the dataset we collected in comparison with existing methods for chromakey detection.
Download

Paper Nr:	413
Title:	Evaluating Multiple Combinations of Models and Encoders to Segment Clouds in Satellite Images
Authors:	Jocsan L. Ferreira, Leandro P. Silva, Mauricio C. Escarpinati, André R. Backes and João F. Mari
Abstract:	This work evaluates methods based on deep learning to perform cloud segmentation in satellite images. Wwe compared several semantic segmentation architectures using different encoder structures. In this sense, we fine-tuned three architectures (U-Net, LinkNet, and PSPNet) with four pre-trained encoders (ResNet-50, VGG-16, MobileNet V2, and EfficientNet B2). The performance of the models was evaluated using the Cloud-38 dataset. The training process was carried out until the validation loss stabilized, according to the early stopping criterion, which provides a comparative analysis of the best models and training strategies to perform cloud segmentation in satellite images. We evaluated the performance using classic evaluation metrics, i.e., pixel accuracy, mean pixel accuracy, mean IoU, and frequency-based IoU. Results demonstrated that the tested models are capable of segmenting clouds with considerable performance, with emphasis on the following values: (i) 96.19% pixel accuracy for LinkNet with VGG-16 encoder, (ii) 92.58% mean pixel accuracy for U-Net with MobileNet V2 encoder, (iii) 87.21% mean IoU for U-Net with VGG-16 encoder, and (iv) 92.89% frequency-based IoU for LinkNet with VGG-16 encoder. In short, the results of this study provide valuable information for developing satellite image analysis solutions in the context of precision agriculture.
Download

Paper Nr:	448
Title:	FingerSeg: Highly-Efficient Dual-Resolution Architecture for Precise Finger-Level Semantic Segmentation
Authors:	Gibran Benitez-Garcia and Hiroki Takahashi
Abstract:	Semantic segmentation at the finger level poses unique challenges, including the limited pixel representation of some classes and the complex interdependency of the hand anatomy. In this paper, we propose FingerSeg, a novel architecture inspired by Deep Dual-Resolution Networks, specifically adapted to address the nuances of finger-level hand semantic segmentation. To this end, we introduce three modules: Enhanced Bilateral Fusion (EBF), which refines low- and high-resolution feature fusion via attention mechanisms; Multi-Attention Module (MAM), designed to augment high-level features with a composite of channel, spatial, orientational, and categorical attention; and Asymmetric Dilated Up-sampling (ADU), which combines standard and asymmetric atrous convolutions to capture rich contextual information for pixel-level classification. To properly evaluate our proposal, we introduce IPN-finger, a subset of the IPN-Hand dataset, manually annotated pixel-wise for 13 finger-related classes. Our extensive empirical analysis, including evaluations of the synthetic RHD dataset against current state-of-the-art methods, demonstrates that our proposal achieves top results. FingerSeg reaches 73.8 and 71.1 mIoU on the IPN-Finger and RHD datasets, respectively, while maintaining an efficient computational cost of about 7 GFLOPs and 6 million parameters at VGA resolution. The dataset, source code, and a demo of FingerSeg will be available upon the publication of this paper.
Download

Short Papers

Paper Nr:	22
Title:	Learning End-to-End Deep Learning Based Image Signal Processing Pipeline Using a Few-Shot Domain Adaptation
Authors:	Georgy Perevozchikov and Egor Ershov
Abstract:	Nowadays the quality of mobile phone cameras plays one of the most important roles in modern smartphones, as a result, more attention is being paid to the camera Image Signal Processing (ISP) pipeline. The current goal of the scientific community is to develop a neural-based end-to-end pipeline to remove the expensive and exhausting process of classical ISP tuning for each next device. The main drawback of the neural-based approach is the necessity of preparing large-scale datasets each time a new smartphone is designed. In this paper, we address this problem and propose a new method for few-shot domain adaptation of the existing neural ISP to a new domain. We show that it is sufficient to have 10 labeled images of the target domain to achieve state-of-the-art performance on the real camera benchmark datasets. We also provide a comparative analysis of our proposed approach with other existing ISP domain adaptation methods and show that our approach allows us to achieve better results. Our proposed method exhibits notably comparable performance, with only a marginal 2% drop in performance compared to the learned from scratch in the whole dataset baseline. We believe that this solution will significantly reduce the cost of neural-based ISP production for each new device.
Download

Paper Nr:	35
Title:	Machine Learning in Industrial Quality Control of Glass Bottle Prints
Authors:	Maximilian Bundscherer, Thomas H. Schmitt and Tobias Bocklet
Abstract:	In industrial manufacturing of glass bottles, quality control of bottle prints is necessary as numerous factors can negatively affect the printing process. Even minor defects in the bottle prints must be detected despite reflections in the glass or manufacturing-related deviations. In cooperation with our medium-sized industrial partner, two ML-based approaches for quality control of these bottle prints were developed and evaluated, which can also be used in this challenging scenario. Our first approach utilized different filters to supress reflections (e.g. Sobel or Canny) and image quality metrics for image comparison (e.g. MSE or SSIM) as features for different supervised classification models (e.g. SVM or k-Neighbors), which resulted in an accuracy of 84%. The images were aligned based on the ORB algorithm, which allowed us to estimate the rotations of the prints, which may serve as an indicator for anomalies in the manufacturing process. In our second approach, we fine-tuned different pre-trained CNN models (e.g. ResNet or VGG) for binary classification, which resulted in an accuracy of 87%. Utilizing Grad-Cam on our fine-tuned ResNet-34, we were able to localize and visualize frequently defective bottle print regions. This method allowed us to provide insights that could be used to optimize the actual manufacturing process. This paper also describes our general approach and the challenges we encountered in practice with data collection during ongoing production, unsupervised preselection, and labeling.
Download

Paper Nr:	38
Title:	Generative Texture Super-Resolution via Differential Rendering
Authors:	Milena Bagdasarian, Peter Eisert and Anna Hilsmann
Abstract:	Image super-resolution is a well-studied field that aims at generating high-resolution images from low-resolution inputs while preserving fine details and realistic features. Despite significant progress on regular images, inferring high-resolution textures of 3D models poses unique challenges. Due to the non-contiguous arrangement of texture patches, intended for wrapping around 3D meshes, applying conventional image super-resolution techniques to texture maps often results in artifacts and seams at texture discontinuities on the mesh. Additionally, obtaining ground truth data for texture super-resolution becomes highly complex due to the labor intensive process of hand-crafting ground truth textures for each mesh. We propose a generative deep learning network for texture map super-resolution using a differentiable renderer and calibrated reference images. Combining a super-resolution generative adversarial network (GAN) with differentiable rendering, we guide our network towards learning realistic details and seamless texture map super-resolution without a high-resolution ground truth of the texture. Instead, we use high-resolution reference images. Through the differentiable rendering approach, we include model knowledge such as 3D meshes, projection matrices, and calibrated images to bridge the domain gap between 2D image super-resolution and texture map super-resolution. Our results show textures with fine structures and improved detail, which is especially of interest in virtual and augmented reality environments depicting humans.
Download

Paper Nr:	49
Title:	Iterative Saliency Enhancement over Superpixel Similarity
Authors:	Leonardo M. Joao and Alexandre X. Falcao
Abstract:	Saliency Object Detection (SOD) has several applications in image analysis. The methods have evolved from image-intrinsic to object-inspired (deep-learning-based) models. However, when a model fails, there is no alternative to enhance its saliency map. We fill this gap by introducing a hybrid approach, the Iterative Saliency Enhancement over Superpixel Similarity (ISESS), that iteratively generates enhanced saliency maps by executing two operations alternately: object-based superpixel segmentation and superpixel-based saliency estimation - cycling operations never exploited. ISESS estimates seeds for superpixel delineation from a given saliency map and defines superpixel queries in the foreground and background. A new saliency map results from color similarities between queries and superpixels at each iteration. The process repeats, and, after a given number of iterations, the generated saliency maps are combined into one by cellular automata. Finally, the resulting map is merged with the initial one by the maximum between their average values per superpixel. We demonstrate that our hybrid model consistently outperforms three state-of-the-art deep-learning-based methods on five image datasets.
Download

Paper Nr:	54
Title:	Estimation of Package-Boundary Confidence for Object Recognition in Rainbow-SKU Depalletizing Automation
Authors:	Kento Sekiya, Taiki Yano, Nobutaka Kimura and Kiyoto Ito
Abstract:	We developed a reliable object recognition method for a rainbow-SKU depalletizing robot. Rainbow SKUs include various types of objects such as boxes, bags, and bottles. The objects’ areas need to be estimated in order to automate a depalletizing robot; however, it is difficult to detect the boundaries between adjacent objects. To solve this problem, we focus on the difference in the shape of the boundaries and propose package-boundary confidence, which assesses whether the recognized boundary correctly corresponds to that of an object unit. This method classifies recognition results into four categories on the basis of the objects’ shape and calculates the package-boundary confidence for each category. The results of our experimental evaluation indicate that the proposed method with slight displacement, which is automatic recovery, can achieve a recognition success rate of 99.0 %. This is higher than that with a conventional object recognition method. Furthermore, we verified that the proposed method is applicable to a real-world depalletizing robot by combining package-boundary confidence with automatic recovery.
Download

Paper Nr:	56
Title:	Calibration-Accuracy Measurement in Railway Overlapping Multi-Camera Systems
Authors:	Martí Sánchez, Nerea Aranjuelo, Jon A. Iñiguez de Gordoa, Pablo Alonso, Mikel García, Marcos Nieto and Mikel Labayen
Abstract:	This paper presents a method for assessing calibration quality in overlapping multi-camera systems used in railway transportation. We propose a novel approach that considers the extrinsic and intrinsic parameters of the cameras and extracts features from their images, providing relevant patterns regarding the pose of the cameras to detect cameras’ calibration misalignment. Three feature extractors, including traditional image processing techniques and deep learning approaches, are evaluated and compared. The extracted features are used to provide a calibration quality metric, enabling real-time detection of camera calibration degradation. Additionally, we introduce a radial grid design that weights the contribution of pixels based on their distance from the camera’s optical center. The results demonstrate the effectiveness of our method in assessing the calibration degree between camera pairs. The findings highlight the superior performance of the deep learning approaches in analyzing the similarity degree between captured images. Overall, our method lays a solid foundation for the development of an online camera calibration pipeline.
Download

Paper Nr:	69
Title:	Vision-Perceptual Transformer Network for Semantic Scene Understanding
Authors:	Mohamad Alansari, Hamad AlRemeithi, Bilal Hassan, Sara Alansari, Jorge Dias, Majid Khonji, Naoufel Werghi and Sajid Javed
Abstract:	Semantic segmentation, essential in computer vision, involves labeling each image pixel with its semantic class. Transformer-based models, recognized for their exceptional performance, have been pivotal in advancing this field. Our contribution, the Vision-Perceptual Transformer Network (VPTN), ingeniously combines transformer encoders with a feature pyramid-based decoder to deliver precise segmentation maps with minimal computational burden. VPTN’s transformative power lies in its integration of the pyramiding technique, enhancing multi-scale variations handling. In direct comparisons with Vision Transformer-based networks and variants, VPTN consistently excels. On average, it achieves 4.2%, 3.41%, and 6.24% higher mean Intersection over Union (mIoU) compared to Dense Prediction (DPT), Data-efficient image Transformer (DeiT), and Swin Transformer networks, while demanding only 15.63%, 3.18%, and 10.05% of their Giga Floating-Point Operations (GFLOPs). Our validation spans five diverse datasets, including Cityscapes, BDD100K, Mapil-lary Vistas, CamVid, and ADE20K. VPTN secures the position of state-of-the-art (SOTA) on BDD100K and CamVid and consistently outperforms existing deep learning models on other datasets, boasting mIoU scores of 82.6%, 67.29%, 61.2%, 86.3%, and 55.3%, respectively. Impressively, it does so with an average computational complexity just 11.44% of SOTA models. VPTN represents a significant advancement in semantic segmentation, balancing efficiency and performance. It shows promising potential, especially for autonomous driving and natural setting computer vision applications.
Download

Paper Nr:	78
Title:	Data Quality Aware Approaches for Addressing Model Drift of Semantic Segmentation Models
Authors:	Samiha Mirza, Vuong D. Nguyen, Pranav Mantini and Shishir K. Shah
Abstract:	In the midst of the rapid integration of artificial intelligence (AI) into real world applications, one pressing challenge we confront is the phenomenon of model drift, wherein the performance of AI models gradually degrades over time, compromising their effectiveness in real-world, dynamic environments. Once identified, we need techniques for handling this drift to preserve the model performance and prevent further degradation. This study investigates two prominent quality aware strategies to combat model drift: data quality assessment and data conditioning based on prior model knowledge. The former leverages image quality assessment metrics to meticulously select high-quality training data, improving the model robustness, while the latter makes use of learned feature vectors from existing models to guide the selection of future data, aligning it with the model’s prior knowledge. Through comprehensive experimentation, this research aims to shed light on the efficacy of these approaches in enhancing the performance and reliability of semantic segmentation models, thereby contributing to the advancement of computer vision capabilities in real-world scenarios.
Download

Paper Nr:	79
Title:	Privacy Preservation in Image Classification Using Seam Doppelganger
Authors:	Nishitha Prakash and James Pope
Abstract:	Cloud storage usage continues to increase and many cloud storage sites use advanced machine learning models to classify user’s images for various purposes, possibly malicious in nature. This introduces very serious privacy concerns where users want to store and view their images on the cloud storage but do not want the models to be able to accurately classify their images. This is a difficult problem and there are many proposed solutions including the seam doppelganger algorithm. Seam Doppelganger uses the seam carving content-aware resizing approach to modify the image in a way that is still human-understandable and has been shown to reduce model accuracy. However, the approach was not tested with different classifiers, is not able to provide complete restoration, and uses a limited dataset. We propose several modifications to the Seam Doppelganger algorithm to better enhance the privacy of the image while keeping it human-readable and able to be fully restored. We modify the energy function to use a histogram of gradients, comprehensively compare seam selection, and evaluate with several pre-trained (on ImageNet and Kaggle datasets) image classification models. We use the structural similarity index measure (SSIM) to determine the degree of distortion as a proxy for human understanding. The approach degrades the classification performance by 70% and guarantees 100% restoration of the original image.
Download

Paper Nr:	89
Title:	Automated Generation of Instance Segmentation Labels for Traffic Surveillance Models
Authors:	D. Scholte, T. T. G. Urselmann, M. H. Zwemer, E. Bondarev and P. H. N. de With
Abstract:	This paper focuses on instance segmentation and object detection for real-time traffic surveillance applications. Although instance segmentation is currently a hot topic in literature, no suitable dataset for traffic surveillance applications is publicly available and limited work is available with real-time performance. A custom proprietary dataset is available for training, but it contains only bounding-box annotations and lacks segmentation annotations. The paper explores methods for automated generation of instance segmentation labels for custom datasets that can be utilized to finetune state-of-the-art segmentation models to specific application domains. Real-time performance is obtained by adopting the recent YOLACT instance segmentation with the YOLOv7 backbone. Nevertheless, it requires modification of the loss function and an implementation of ground-truth matching to overcome handling imperfect instance labels in custom datasets. Experiments show that it is possible to achieve a high instance segmentation performance using a semi-automatically generated dataset, especially when using the Segment Anything Model for generating the labels.
Download

Paper Nr:	101
Title:	SAMMI: Segment Anything Model for Malaria Identification
Authors:	Luca Zedda, Andrea Loddo and Cecilia Di Ruberto
Abstract:	Malaria, a life-threatening disease caused by the Plasmodium parasite, is a pressing global health challenge. Timely detection is critical for effective treatment. This paper introduces a novel computer-aided diagnosis system for detecting Plasmodium parasites in blood smear images, aiming to enhance automation and accessibility in comprehensive screening scenarios. Our approach integrates the Segment Anything Model for precise unsupervised parasite detection. It then employs a deep learning framework, combining Convolutional Neural Networks and Vision Transformer to accurately classify malaria-infected cells. We rigorously evaluate our system using the IML public dataset and compare its performance against various off-the-shelf object detectors. The results underscore the efficacy of our method, demonstrating superior accuracy in detecting and classifying malaria-infected cells. This innovative Computer-aided diagnosis system presents a reliable and near real-time solution for malaria diagnosis, offering significant potential for widespread implementation in healthcare settings. By automating the diagnosis process and ensuring high accuracy, our system can contribute to timely interventions, thereby advancing the fight against malaria globally.
Download

Paper Nr:	103
Title:	Stereo-Event-Camera-Technique for Insect Monitoring
Authors:	Regina Pohle-Fröhlich, Colin Gebler and Tobias Bolten
Abstract:	To investigate the causes of declining insect populations, a monitoring system is needed that automatically records insect activity and additional environmental factors over an extended period of time. For this reason, we use a sensor-based method with two event cameras. In this paper, we describe the system, the view volume that can be recorded with it, and a database used for insect detection. We also present the individual steps of our developed processing pipeline for insect monitoring. For the extraction of insect trajectories, a U-Net based segmentation was tested. For this purpose, the events within a time period of 50 ms were transformed into a frame representation using four different encoding types. The tested histogram encoding achieved the best results with an F1 score for insect segmentation of 0.897 and 0.967 for plant movement and noise parts. The detected trajectories were then transformed into a 4D representation, including depth, and visualized.
Download

Paper Nr:	117
Title:	CAVC: Cosine Attention Video Colorization
Authors:	Leandro Stival, Ricardo S. Torres and Helio Pedrini
Abstract:	Video colorization is a challenging task, demanding deep learning models to employ diverse abstractions for a comprehensive grasp of the task, ultimately yielding high-quality results. Currently, in example-based colorization approaches, the use of attention processes and convolutional layers has proven to be the most effective method to produce good results. Following this way, in this paper we propose Cosine Attention Video Colorization (CAVC), which is an approach that uses a single attention head with shared weights to produce a refinement of the monochromatic frame, as well as the cosine similarity between this sample and the other channels present in the image. This entire process acts as a pre-processing of the data from our autoencoder, which performs a feature fusion with the latent space extracted from the referent frame, as well as with its histogram. This architecture was trained on the DAVIS, UVO and LDV datasets and achieved superior results compared to state-of-the-art models in terms of FID metric in all the datasets.
Download

Paper Nr:	121
Title:	Efficient Posterior Sampling for Diverse Super-Resolution with Hierarchical VAE Prior
Authors:	Jean Prost, Antoine Houdard, Andrés Almansa and Nicolas Papadakis
Abstract:	We investigate the problem of producing diverse solutions to an image super-resolution problem. From a probabilistic perspective, this can be done by sampling from the posterior distribution of an inverse problem, which requires the definition of a prior distribution on the high-resolution images. In this work, we propose to use a pretrained hierarchical variational autoencoder (HVAE) as a prior. We train a lightweight stochastic encoder to encode low-resolution images in the latent space of a pretrained HVAE. At inference, we combine the low-resolution encoder and the pretrained generative model to super-resolve an image. We demonstrate on the task of face super-resolution that our method provides an advantageous trade-off between the computational efficiency of conditional normalizing flows techniques and the sample quality of diffusion based methods.
Download

Paper Nr:	144
Title:	Concept Basis Extraction for Latent Space Interpretation of Image Classifiers
Authors:	Alexandros Doumanoglou, Dimitrios Zarpalas and Kurt Driessens
Abstract:	Previous research has shown that, to a large-extend, deep feature representations of image-patches that belong to the same semantic concept, lie in the same direction of an image classifier’s feature space. Conventional approaches compute these directions using annotated data, forming an interpretable feature space basis (also referred as concept basis). Unsupervised Interpretable Basis Extraction (UIBE) was recently proposed as a novel method that can suggest an interpretable basis without annotations. In this work, we show that the addition of a classification loss term to the unsupervised basis search, can lead to bases suggestions that align even more with interpretable concepts. This loss term enforces the basis vectors to point towards directions that maximally influence the classifier’s predictions, exploiting concept knowledge encoded by the network. We evaluate our work by deriving a concept basis for three popular convolutional networks, trained on three different datasets. Experiments show that our contributions enhance the interpretability of the learned bases, according to the interpretability metrics, by up-to +45.8% relative improvement. As additional practical contribution, we report hyper-parameters, found by hyper-parameter search in controlled benchmarks, that can serve as a starting point for applications of the proposed method in real-world scenarios that lack annotations.
Download

Paper Nr:	158
Title:	Assessing the Performance of Autoencoders for Particle Density Estimation in Acoustofluidic Medium: A Visual Analysis Approach
Authors:	Lucas M. Massa, Tiago F. Vieira, Allan M. Martins and Bruno G. Ferreira
Abstract:	Micro-particle density is important for understanding different cell types, their growth stages, and how they respond to external stimuli. In previous work, a Gaussian curve fitting method was used to estimate the size of particles, in order to later calculate their density. This approach required a long processing time, making the development of a Point of Care (PoC) device difficult. Current work proposes the application of a convolutional autoencoder (AE) to estimate single particle density, aiming to develop a PoC device that overcomes the limitations presented in the previous study. Thus, we used the AE to bottleneck a set of particle images into a single latent variable to evaluate its ability to represent the particle’s diameter. We employed an identical physical apparatus involving a microscope to take pictures of particles in a liquid submitted to ultrasonic waves before the settling process. The AE was initially trained with a set of images for calibration. The acquired parameters were applied to the test set to estimate the velocity at which the particle falls within the ultrasonic chamber. This velocity was later used to infer the particle density. Our results demonstrated that the AE model performed much better, notably exhibiting significantly enhanced computational speed while concurrently achieving comparable error in density estimation.
Download

Paper Nr:	160
Title:	Image Edge Enhancement for Effective Image Classification
Authors:	Bu Tianhao, Michalis Lazarou and Tania Stathaki
Abstract:	Image classification has been a popular task due to its feasibility in real-world applications. Training neural networks by feeding them RGB images has demonstrated success over it. Nevertheless, improving the classification accuracy and computational efficiency of this process continues to present challenges that researchers are actively addressing. A widely popular embraced method to improve the classification performance of neural networks is to incorporate data augmentations during the training process. Data augmentations are simple transformations that create slightly modified versions of the training data, and can be very effective in training neural networks to mitigate overfitting and improve their accuracy performance. In this study, we draw inspiration from high-boost image filtering and propose an edge enhancement-based method as means to enhance both accuracy and training speed of neural networks. Specifically, our approach involves extracting high frequency features, such as edges, from images within the available dataset and fusing them with the original images, to generate new, enriched images. Our comprehensive experiments, conducted on two distinct datasets—CIFAR10 and CALTECH101, and three different network architectures—ResNet-18 ,LeNet-5 and CNN-9—demonstrate the effectiveness of our proposed method.
Download

Paper Nr:	168
Title:	Instance Segmentation of Event Camera Streams in Outdoor Monitoring Scenarios
Authors:	Tobias Bolten, Regina Pohle-Fröhlich and Klaus D. Tönnies
Abstract:	Event cameras are a new type of image sensor. The pixels of these sensors operate independently and asynchronously from each other. The sensor output is a variable rate data stream that spatio-temporally encodes the detection of brightness changes. This type of output and sensor operating paradigm poses processing challenges for computer vision applications, as frame-based methods are not natively applicable. We provide the first systematic evaluation of different state-of-the-art deep learning based instance segmentation approaches in the context of event-based outdoor surveillance. For processing, we consider transforming the event output stream into representations of different dimensionalities, including point-, voxel-, and frame-based variants. We introduce a new dataset variant that provides annotations at the level of instances per output event, as well as a density-based preprocessing to generate regions of interest (RoI). The achieved instance segmentation results show that the adaptation of existing algorithms for the event-based domain is a promising approach.
Download

Paper Nr:	173
Title:	Large Filter Low-Level Processing by Edge TPU
Authors:	Gerald Krell and Thilo Pionteck
Abstract:	Edge TPUs offer high processing power at a low cost and with minimal power consumption. They are particularly suitable for demanding tasks such as classification or segmentation using Deep Learning Frameworks, acting as a neural coprocessor in host computers and mobile devices. The question arises as to whether this potential can be utilized beyond the specific domains for which the frameworks are originally designed. One example pertains to addressing various error classes by utilizing a trained deconvolution filter with a large filter size, requiring computation power that can be efficiently accelerated by the powerful matrix multiplication unit of the TPU. However, the application of the TPU is restricted due to the fact that Edge TPU software is not fully open source. This limits to integration with existing Deep Learning frameworks and the Edge TPU compiler. Nonetheless, we demonstrate a method of estimating and utilizing a convolutional filter of large size on the TPU for this purpose. The deconvolution process is accomplished by utilizing pre-estimated convolutional filters offline to perform low-level preprocessing for various error classes, such as denoising, deblurring, and distortion removal.
Download

Paper Nr:	185
Title:	Comparing 3D Shape and Texture Descriptors Towards Tourette’s Syndrome Prediction Using Pediatric Magnetic Resonance Imaging
Authors:	Murilo Costa de Barros, Kaue N. Duarte, Chia-Jui Hsu, Wang-Tso Lee and Marco A. Garcia de Carvalho
Abstract:	Tourette Syndrome (TS) is a neuropsychiatric disorder characterized by the presence of involuntary motor and vocal tics, with its etiology suggesting a strong and complex genetic basis. The detection of TS is mainly performed clinically, but brain imaging provides additional insights about anatomical structures. Interpreting brain patterns is challenging due to the complexity of the texture and shape of the anatomical regions. This study compares three-dimensional texture and shape features using Gray-Level Co-occurrence Matrix and Scale-Invariant Heat Kernel Signature. These features are analyzed in the context of TS classification (via Support Vector Machines), focusing on anatomical regions believed to be associated with TS. The evaluation is performed on structural Magnetic Resonance (MR) images of 68 individuals (34 TS patients and 34 healthy subjects). Results show that shape features achieve 92.6% accuracy in brain regions like the right thalamus and accumbens area, while texture features reach 73.5% accuracy in regions such as right putamen and left thalamus. Majority voting ensembles using shape features obtain 96% accuracy, with texture features achieving 79.4%. These findings highlight the influence of subcortical regions in the limbic system, consistent with existing literature on TS.
Download

Paper Nr:	201
Title:	Feature Selection Using Quantum Inspired Island Model Genetic Algorithm for Wheat Rust Disease Detection and Severity Estimation
Authors:	Sourav Samanta, Sanjay Chatterji and Sanjoy Pratihar
Abstract:	In the context of smart agriculture, an early disease detection system is crucial to increase agricultural yield. A disease detection system based on machine learning can be an excellent tool in this regard. Wheat is one of the world’s most important crops. Leaf rust is one of the most significant wheat diseases. In this work, we have proposed a method to detect the leaf rust disease-affected areas in wheat leaves to estimate the severity of the disease. The method works on a reduced Color-GLCM (C-GLCM) feature set. The proposed feature selection method employs Quantum Inspired Island Model Genetic Algorithm to select the most compelling features from the C-GLCM set. The proposed feature selection method outperforms the classical feature selection methods. The healthy and diseased leaves are classified using four classifiers: Decision Tree, KNN, Support Vector Machine, and MLP. The MLP classifier achieved the highest accuracy of 99 .20% with the proposed feature selection method. Following the detection of the diseased leaf, the k-means algorithm has been utilized to localize the lesion area. Finally, disease severity scores have been calculated and reported for various sample leaves.
Download

Paper Nr:	228
Title:	Investigation of Deep Neural Network Compression Based on Tucker Decomposition for the Classification of Lesions in Cavity Oral
Authors:	Vitor L. Fernandes, Adriano B. Silva, Danilo C. Pereira, Sérgio V. Cardoso, Paulo R. de Faria, Adriano M. Loyola, Thaína A. Tosta, Leandro A. Neves and Marcelo Z. do Nascimento
Abstract:	Cancer in the oral cavity is one of the most common, making it necessary to investigate lesions that could develop into cancer. Initial stage lesions, called dysplasia, can develop into more severe stages of the disease and are characterized by variations in the shape and size of the nucleus of epithelial tissue cells. Due to advances in the areas of digital image processing and artificial intelligence, diagnostic aid systems (CAD) have become a tool to help reduce the difficulties of analyzing and classifying lesions. This paper presents an investigation of the Tucker decomposition in tensors for different CNN models to classify dysplasia in histological images of the oral cavity. In addition to the Tucker decomposition, this study investigates the normalization of H&E dyes on the optimized CNN models to evaluate the behavior of the architectures in the classification stage of dysplasia lesions. The results show that for some of the optimized models, the use of normalization contributed to the performance of the CNNs for classifying dysplasia lesions. However, when the features obtained from the final layers of the CNNs associated with the machine learning algorithms were analyzed, it was noted that the normalization process affected performance during classification.
Download

Paper Nr:	243
Title:	Efficient and Accurate Hyperspectral Image Demosaicing with Neural Network Architectures
Authors:	Eric L. Wisotzky, Lara Wallburg, Anna Hilsmann, Peter Eisert, Thomas Wittenberg and Stephan Göb
Abstract:	Neural network architectures for image demosaicing have been become more and more complex. This results in long training periods of such deep networks and the size of the networks is huge. These two factors prevent practical implementation and usage of the networks in real-time platforms, which generally only have limited resources. This study investigates the effectiveness of neural network architectures in hyperspectral image demosaicing. We introduce a range of network models and modifications, and compare them with classical interpolation methods and existing reference network approaches. The aim is to identify robust and efficient performing network architectures. Our evaluation is conducted on two datasets, ”SimpleData” and ”SimReal-Data,” representing different degrees of realism in multispectral filter array (MSFA) data. The results indicate that our networks outperform or match reference models in both datasets demonstrating exceptional performance. Notably, our approach focuses on achieving correct spectral reconstruction rather than just visual appeal, and this emphasis is supported by quantitative and qualitative assessments. Furthermore, our findings suggest that efficient demosaicing solutions, which require fewer parameters, are essential for practical applications. This research contributes valuable insights into hyperspectral imaging and its potential applications in various fields, including medical imaging.
Download

Paper Nr:	254
Title:	Two Nonlocal Variational Models for Retinex Image Decomposition
Authors:	Frank W. Hammond, Catalina Sbert and Joan Duran
Abstract:	Retinex theory assumes that an image can be decomposed into illumination and reflectance components. In this work, we introduce two variational models to solve the ill-posed inverse problem of estimating illumination and reflectance from a given observation. Nonlocal regularization exploiting image self-similarities is used to estimate the reflectance, since it is assumed to contain fine details and texture. The difference between the proposed models comes from the selected prior for the illumination. Specifically, Tychonoff regularization, which promots smooth solutions, and the total variation, which favours piecewise constant solutions, are independently proposed. A comprehensive theoretical analysis of the resulting functionals is presented within appropriate functional spaces, complemented by an experimental validation for thorough examination.
Download

Paper Nr:	258
Title:	Avoiding Undesirable Solutions of Deep Blind Image Deconvolution
Authors:	Antonie Brožová and Václav Šmídl
Abstract:	Blind image deconvolution (BID) is a severely ill-posed optimization problem requiring additional information, typically in the form of regularization. Deep image prior (DIP) promises to model a naturally looking image due to a well-chosen structure of a neural network. The use of DIP in BID results in a significant perfor-mance improvement in terms of average PSNR. In this contribution, we offer qualitative analysis of selected DIP-based methods w.r.t. two types of undesired solutions: blurred image (no-blur) and a visually corrupted image (solution with artifacts). We perform a sensitivity study showing which aspects of the DIP-based algorithms help to avoid which undesired mode. We confirm that the no-blur can be avoided using either sharp image prior or tuning of the hyperparameters of the optimizer. The artifact solution is a harder problem since variations that suppress the artifacts often suppress good solutions as well. Switching to the structural similarity index measure from L 2 norm in loss was found to be the most successful approach to mitigate the artifacts.
Download

Paper Nr:	265
Title:	SWViT-RRDB: Shifted Window Vision Transformer Integrating Residual in Residual Dense Block for Remote Sensing Super-Resolution
Authors:	Mohamed R. Ibrahim, Robert Benavente, Daniel Ponsa and Felipe Lumbreras
Abstract:	Remote sensing applications, impacted by acquisition season and sensor variety, require high-resolution images. Transformer-based models improve satellite image super-resolution but are less effective than convolutional neural networks (CNNs) at extracting local details, crucial for image clarity. This paper introduces SWViT-RRDB, a new deep learning model for satellite imagery super-resolution. The SWViT-RRDB, combining transformer with convolution and attention blocks, overcomes the limitations of existing models by better representing small objects in satellite images. In this model, a pipeline of residual fusion group (RFG) blocks is used to combine the multi-headed self-attention (MSA) with residual in residual dense block (RRDB). This combines global and local image data for better super-resolution. Additionally, an overlapping cross-attention block (OCAB) is used to enhance fusion and allow interaction between neighboring pixels to maintain long-range pixel dependencies across the image. The SWViT-RRDB model and its larger variants outperform state-of-the-art (SoTA) models on two different satellite datasets in terms of PSNR and SSIM.
Download

Paper Nr:	289
Title:	An Image Sharpening Technique Based on Dilated Filters and 2D-DWT Image Fusion
Authors:	Victor Bogdan, Cosmin Bonchiş and Ciprian Orhei
Abstract:	Image sharpening techniques are pivotal in image processing, serving to accentuate the contrast between darker and lighter regions in images. Building upon prior research that highlights the advantages of dilated kernels in edge detection algorithms, our study introduces a multi-level dilatation wavelet scheme. This novel approach to Unsharp Masking involves processing the input image through a low-pass filter with varying dilatation factors, followed by wavelet fusion. The visual outcomes of this method demonstrate marked improvements in image quality, notably enhancing details without introducing any undesirable crisping effects. Given the absence of a universally accepted index for optimal image sharpness in current literature, we have employed a range of metrics to evaluate the effectiveness of our proposed technique.
Download

Paper Nr:	300
Title:	Using Extended Light Sources for Relighting from a Small Number of Images
Authors:	Toshiki Hirao, Ryo Kawahara and Takahiro Okabe
Abstract:	Relighting real scenes/objects is useful for applications such as augmented reality and mixed reality. In general, relighting of glossy objects requires a large number of images, because specular reflection components are sensitive to light source positions/directions, and then the linear interpolation with sparse light sources does not work well. In this paper, we make use of not only point light sources but also extended light sources for efficiently capturing specular reflection components and achieve relighting from a small number of images. Specifically, we propose a CNN-based method that simultaneously learns the illumination module (illumination condition), i.e. the linear combinations of the point light sources and the extended light sources under which a small number of input images are taken and the reconstruction module which recovers the images under arbitrary point light sources from the captured images in an end-to-end manner. We conduct a number of experiments using real images captured with a display-camera system, and confirm the effectiveness of our proposed method.
Download

Paper Nr:	303
Title:	Transformer-Based Video Saliency Prediction with High Temporal Dimension Decoding
Authors:	Morteza Moradi, Simone Palazzo and Concetto Spampinato
Abstract:	In recent years, finding an effective and efficient strategy for exploiting spatial and temporal information has been a hot research topic in video saliency prediction (VSP). With the emergence of spatio-temporal transformers, the weakness of the prior strategies, e.g., 3D convolutional networks and LSTM-based networks, for capturing long-range dependencies has been effectively compensated. While VSP has drawn benefits from spatio-temporal transformers, finding the most effective way for aggregating temporal features is still challenging. To address this concern, we propose a transformer-based video saliency prediction approach with high temporal dimension decoding network (THTD-Net). This strategy accounts for the lack of complex hierarchical interactions between features that are extracted from the transformer-based spatio-temporal encoder: in particular, it does not require multiple decoders and aims at gradually reducing temporal features’ dimensions in the decoder. This decoder-based architecture yields comparable performance to multi-branch and over-complicated models on common benchmarks such as DHF1K, UCF-sports and Hollywood-2.
Download

Paper Nr:	312
Title:	FuDensityNet: Fusion-Based Density-Enhanced Network for Occlusion Handling
Authors:	Zainab Ouardirhi, Otmane Amel, Mostapha Zbakh and Sidi A. Mahmoudi
Abstract:	Our research introduces an innovative approach for detecting occlusion levels and identifying objects with varying degrees of occlusion. We integrate 2D and 3D data through advanced network architectures, utilizing voxelized density-based occlusion assessment for improved visibility of occluded objects. By combining 2D image and 3D point cloud data through carefully designed network components, our method achieves superior detection accuracy in complex scenarios with occlusions. Experimental evaluation demonstrates adaptability across concatenation techniques, resulting in notable Average Precision (AP) improvements. Despite initial testing on a limited dataset, our method shows competitive performance, suggesting potential for further refinement and scalability. This research significantly contributes to advancements in effective occlusion handling for object detection methodologies. The abstract and conclusion highlight the substantial increase in AP achieved through our model.
Download

Paper Nr:	315
Title:	On the Use of Visual Transformer for Image Complexity Assessment
Authors:	Luigi Celona, Gianluigi Ciocca and Raimondo Schettini
Abstract:	Perceiving image complexity is a crucial aspect of human visual understanding, yet explicitly assessing image complexity poses challenges. Historically, this aspect has been understudied due to its inherent subjectivity, stemming from its reliance on human perception, and the semantic dependency of image complexity in the face of diverse real-world images. Different computational models for image complexity estimation have been proposed in the literature. These models leverage a variety of techniques ranging from low-level, hand-crafted features, to advanced machine learning algorithms. This paper explores the use of recent deep-learning approaches based on Visual Transformer to extract robust information for image complexity estimation in a transfer learning paradigm. Specifically, we propose to leverage three visual backbones, CLIP, DINO-v2, and ImageNetViT, as feature extractors, coupled with a Support Vector Regressor with Radial Basis Function kernel as an image complexity estimator. We test our approach on two widely used benchmark datasets (i.e. IC9600 and SAVOIAS) in an intra-dataset and inter-dataset workflow. Our experiments demonstrate the effectiveness of the CLIP-based features for accurate image complexity estimation with results comparable to end-to-end solutions.
Download

Paper Nr:	339
Title:	Camera Self-Calibration from Two Views with a Common Direction
Authors:	Yingna Su, Xinnian Guo and Yang Shen
Abstract:	Camera calibration is crucial for enabling accurate and robust visual perception. This paper addresses the challenge of recovering intrinsic camera parameters from two views of a planar surface, that has received limited attention due to its inherent degeneracy. For cameras equipped with Inertial Measurement Units (IMUs), such as those in smartphones and drones, the camera’s y-axes can be aligned with the gravity direction, reducing the relative orientation to a one-degree-of-freedom (1-DoF). A key insight is the general orthogonality between the ground plane and the gravity direction. Leveraging this ground plane constraint, the paper introduces new homography-based minimal solutions for camera self-calibration with a known gravity direction. we derive 2.5- and 3.5-point camera self-calibration algorithms for points in the ground plane to enable simultaneous estimation of the camera’s focal length and principal point. The paper demonstrates the practicality and efficiency of these algorithms and comparisons to existing state-of-the-art methods, confirming their reliability under various levels of noise and different camera configurations.
Download

Paper Nr:	340
Title:	Neural Style Transfer for Vector Graphics
Authors:	Ivan Jarsky, Valeria Efimova, Artyom Chebykin, Viacheslav Shalamov and Andrey Filchenkov
Abstract:	Neural style transfer draws researchers’ attention, but the interest focuses on bitmap images. Various models have been developed for bitmap image generation both online and offline with arbitrary and pre-trained styles. However, the style transfer between vector images has not almost been considered. Our research shows that applying standard content and style losses insignificantly changes the vector image drawing style because the structure of vector primitives differs a lot from pixels. To handle this problem, we introduce new loss functions. We also develop a new method based on differentiable rasterization that uses these loss functions and can change the color and shape parameters of the content image corresponding to the drawing of the style image. Qualitative experiments demonstrate the effectiveness of the proposed VectorNST method compared with the state-of-the-art neural style transfer approaches for bitmap images and the only existing approach for stylizing vector images, DiffVG. Although the proposed model does not achieve the quality and smoothness of style transfer between bitmap images, we consider our work an important early step in this area. VectorNST code and demo service are available at https://github.com/IzhanVarsky/VectorNST.
Download

Paper Nr:	353
Title:	Fast and Reliable Inpainting for Real-Time Immersive Video Rendering
Authors:	Jakub Stankowski and Adrian Dziembowski
Abstract:	In this paper, the authors describe a fast view inpainting algorithm dedicated to practical, real-time immersive video systems. Inpainting is an inherent step of the entire virtual view rendering process, allowing for achieving high Quality of Experience (QoE) for a user of the immersive video system. The authors propose a novel approach for inpainting, based on dividing the inpainting process into two independent, highly parallelizable stages: view analysis and hole filling. In total, four methods of view analysis and two methods of hole filling were developed, implemented, and evaluated, both in terms of computational time and quality of the virtual view. The proposed technique was compared against an efficient state-of-the-art iterative inpainting technique. The results show that the proposal allows for achieving good objective and subjective quality, requiring less than 2 ms for inpainting of a frame of the typical FullHD multiview sequence.
Download

Paper Nr:	359
Title:	ELSA: Expanded Latent Space Autoencoder for Image Feature Extraction and Classification
Authors:	Emerson Vilar de Oliveira, Dunfrey P. Aragão and Luiz G. Gonçalves
Abstract:	In the field of computer vision, image classification has been aiding in the understanding and labeling of images. Machine learning and artificial intelligence algorithms, especially artificial neural networks, are widely used tools for this task. In this work, we present the Expanded Latent space Autoencoder (ELSA). The ELSA network consists of more than one autoencoder in its internal structure, concatenating their latent spaces and constructing an expanded latent space. The expanded latent space aims to extract more information from input data. Thus, this expanded latent space can be used by other networks for general tasks such as prediction and classification. To evaluate these capabilities, we created an image classification network for the FashionM-NIST and MNIST datasets, achieving 99.97 and 99.98 accuracy for the test dataset. The classifier trained with the expanded latent space dataset outperforms some models in public benchmarks.
Download

Paper Nr:	366
Title:	On Granularity Variation of Air Quality Index Vizualization from Sentinel-5
Authors:	Jordan S. Cuno, Arthur A. Bezerra, Aura Conci and Luiz G. Gonçalves
Abstract:	Air quality has been a hot research topic not only because it is directly related to climate change and the greenhouse effect, but most because it has been strongly associated to the transmission of respiratory diseases. Considering that different pollutants affect air quality, a methodology based on satellite data processing is proposed. The objective is to obtain images and measure the main atmospheric pollutants in Brazil. Using satellite systems with spectrometers is an alternative technology that has been recently developed for dealing with such a problem. Sentinel-5 is one of these satellites that works contantly monitoring the earth surface generating a vast amount of data mainly for climate monitoring, and that is used in this research. The main contribution of this research is a computational workflow that uses Sentinel-5 data to generate images of Brazil and its states, in addition to calculating the average value of the main atmospheric pollutants, data that can be used in the prediction of pollution as well as the identification of most polluted regions.
Download

Paper Nr:	369
Title:	Improving Low-Light Image Recognition Performance Based on Image-Adaptive Learnable Module
Authors:	Seitaro Ono, Yuka Ogino, Takahiro Toizumi, Atsushi Ito and Masato Tsukada
Abstract:	In recent years, significant progress has been made in image recognition technology based on deep neural networks. However, improving recognition performance under low-light conditions remains a significant challenge. This study addresses the enhancement of recognition model performance in low-light conditions. We propose an image-adaptive learnable module which apply appropriate image processing on input images and a hyperparameter predictor to forecast optimal parameters used in the module. Our proposed approach allows for the enhancement of recognition performance under low-light conditions by easily integrating as a front-end filter without the need to retrain existing recognition models designed for low-light conditions. Through experiments, our proposed method demonstrates its contribution to enhancing image recognition performance under low-light conditions.
Download

Paper Nr:	374
Title:	Word and Image Embeddings in Pill Recognition
Authors:	Richárd Rádli, Zsolt Vörösházi and László Czúni
Abstract:	Pill recognition is a key task in healthcare and has a wide range of applications. In this study, we are addressing the challenge to improve the accuracy of pill recognition in a metrics learning framework. A multi-stream visual feature extraction and processing architecture, with multi-head attention layers, is used to estimate the similarity of pills. We are introducing an essential enhancement to the triplet loss function to leverage word embeddings for the injection of textual pill similarity into the visual model. This improvement refines the visual embedding on a finer scale than conventional triplet loss models resulting in higher accuracy of the visual model. Experiments and evaluations are made on a new pill dataset, freely available.
Download

Paper Nr:	382
Title:	RecViT: Enhancing Vision Transformer with Top-Down Information Flow
Authors:	Štefan Pócoš, Iveta Bečková and Igor Farkaš
Abstract:	We propose and analyse a novel neural network architecture — recurrent vision transformer (RecViT). Building upon the popular vision transformer (ViT), we add a biologically inspired top-down connection, letting the network ‘reconsider’ its initial prediction. Moreover, using a recurrent connection creates space for feeding multiple similar, yet slightly modified or augmented inputs into the network, in a single forward pass. As it has been shown that a top-down connection can increase accuracy in case of convolutional networks, we analyse our architecture, combined with multiple training strategies, in the adversarial examples (AEs) scenario. Our results show that some versions of RecViT indeed exhibit more robust behaviour than the baseline ViT, on the tested datasets yielding ≈18 % and ≈22 % absolute improvement in robustness while the accuracy drop was only ≈1 %. We also leverage the fact that transformer networks have certain level of inherent explainability. By visualising attention maps of various input images, we gain some insight into the inner workings of our network. Finally, using annotated segmentation masks, we numerically compare the quality of attention maps on original and adversarial images.
Download

Paper Nr:	385
Title:	A Learning Paradigm for Interpretable Gradients
Authors:	Felipe T. Figueroa, Hanwei Zhang, Ronan Sicre, Yannis Avrithis and Stephane Ayache
Abstract:	This paper studies interpretability of convolutional networks by means of saliency maps. Most approaches based on Class Activation Maps (CAM) combine information from fully connected layers and gradient through variants of backpropagation. However, it is well understood that gradients are noisy and alternatives like guided backpropagation have been proposed to obtain better visualization at inference. In this work, we present a novel training approach to improve the quality of gradients for interpretability. In particular, we introduce a regularization loss such that the gradient with respect to the input image obtained by standard backpropagation is similar to the gradient obtained by guided backpropagation. We find that the resulting gradient is qualitatively less noisy and improves quantitatively the interpretability properties of different networks, using several interpretability methods.
Download

Paper Nr:	402
Title:	Analysis of Scattering Media by High-Frequency Polarized Light Projection Using Polarizing Projector
Authors:	Aigo Ohno, Fumihiko Sakaue and Jun Sato
Abstract:	This paper proposes a special projection method called high-frequency polarized light projection using a polarizing projector to analyze scenes filled with scattering medium, and proposes a method to separate reflected lights and scattered lights by scattering medium in the observed image. In high-frequency polarized light projection, a high-frequency pattern is created by light with different polarization directions, projected onto a scattering medium, and the reflected light is observed. The light scattered by the medium and the reflected light from the object have different polarization properties, and we show that these two types of light can be easily separated.
Download

Paper Nr:	403
Title:	Probabilistic NeRF for 3D Shape Recovery in Scattered Medium
Authors:	Yoshiki Ono, Fumihiko Sakaue and Jun Sato
Abstract:	This research proposes a method for analyzing scene information including the characteristics of the medium by representing the space where objects and scattering media such as fog and smoke exist using the NeRF (Neural Radiance Fields) (Mildenhall et al., 2020) representation method of light ray fields. In this study, we focus on the fact that the behavior of rays inside a scattering medium can be expressed probabilistically, and show a method for rendering an image that changes in a probabilistic manner from only a single ray, rather than the entire scattering. By combining this method with a scene representation using the stochastic gradient descent method and a neural network, we show that it is possible to analyze scene information without generating images that directly render light scattering.
Download

Paper Nr:	411
Title:	Dense Light Field Imaging with Mixed Focus Camera
Authors:	Masato Hirose, Fumihiko Sakaue and Jun Sato
Abstract:	In this study, we propose a method for acquiring a dense light field in a single shot by taking advantage of the sparsity of the 4D light field (LF). Acquiring the LF with one camera is challenging task due to the amount of data. To acquire the LF efficiently, there are various methods like using micro-lens. However, with these methods, images are taken using a single image sensor, which improves directional resolution but reduces positional resolution. In our method, the focal length of the lens is varied, and the exposure is controlled on a pixel-by-pixel level when capturing a single image to obtain a mixed focus image, where each pixel is captured at a different focal length. Furthermore, by analyzing the captured image with an image generator that does not require prior learning, we show how to recover a LF image that is denser than the captured image. With our method, a high-density LF consisting of 5x5 images can be successfully reconstructed only from a single mixed-focus image taken under a simulated environment.
Download

Paper Nr:	416
Title:	Optimization and Learning Rate Influence on Breast Cancer Image Classification
Authors:	Gleidson G. Barbosa, Larissa R. Moreira, Pedro Moises de Sousa, Rodrigo Moreira and André R. Backes
Abstract:	Breast cancer is a prevalent and challenging pathology, with significant mortality rates, affecting both women and men. Despite advancements in technology, such as Computer-Aided Diagnosis (CAD) and awareness campaigns, timely and accurate diagnosis remains a crucial issue. This study investigates the performance of Convolutional Neural Networks (CNNs) in predicting and supporting breast cancer diagnosis, considering BreakHis and Biglycan datasets. Through a factorial partial method, we measured the impact of optimization and learning rate factors on the prediction model accuracy. By measuring each factor’s level of influence on the validation accuracy response variable, this paper brings valuable insights into the relevance analyses and CNN behavior. Furthermore, the study sheds light on the explainability of Artificial Intelligence (AI) through factorial partial performance evaluation design. Among the results, we determine which and how much the hyperparameters tunning influenced the performance of the models. The findings contribute to image-based medical diagnosis field, fostering the integration of computational and machine learning approaches to enhance breast cancer diagnosis and treatment.
Download

Paper Nr:	426
Title:	Multimodal Crowd Counting with Pix2Pix GANs
Authors:	Muhammad Asif Khan, Hamid Menouar and Ridha Hamila
Abstract:	Most state-of-the-art crowd counting methods use color (RGB) images to learn the density map of the crowd. However, these methods often struggle to achieve higher accuracy in densely crowded scenes with poor illumination. Recently, some studies have reported improvement in the accuracy of crowd counting models using a combination of RGB and thermal images. Although multimodal data can lead to better predictions, multimodal data might not be always available beforehand. In this paper, we propose the use of generative adversarial networks (GANs) to automatically generate thermal infrared (TIR) images from color (RGB) images and use both to train crowd counting models to achieve higher accuracy. We use a Pix2Pix GAN network first to translate RGB images to TIR images. Our experiments on several state-of-the-art crowd counting models and benchmark crowd datasets report significant improvement in accuracy.
Download

Paper Nr:	36
Title:	Towards Better Morphed Face Images Without Ghosting Artifacts
Authors:	Clemens Seibold, Anna Hilsmann and Peter Eisert
Abstract:	Automatic generation of morphed face images often produces ghosting artifacts due to poorly aligned structures in the input images. Manual processing can mitigate these artifacts. However, this is not feasible for the generation of large datasets, which are required for training and evaluating robust morphing attack detectors. In this paper, we propose a method for automatic prevention of ghosting artifacts based on a pixel-wise alignment during morph generation. We evaluate our proposed method on state-of-the-art detectors and show that our morphs are harder to detect, particularly, when combined with style-transfer-based improvement of low-level image characteristics. Furthermore, we show that our approach does not impair the biometric quality, which is essential for high quality morphs.
Download

Paper Nr:	47
Title:	Teeth Localization and Lesion Segmentation in CBCT Images Using SpatialConfiguration-Net and U-Net
Authors:	Arnela Hadzic, Barbara Kirnbauer, Darko Štern and Martin Urschler
Abstract:	The localization of teeth and segmentation of periapical lesions in cone-beam computed tomography (CBCT) images are crucial tasks for clinical diagnosis and treatment planning, which are often time-consuming and require a high level of expertise. However, automating these tasks is challenging due to variations in shape, size, and orientation of lesions, as well as similar topologies among teeth. Moreover, the small volumes occupied by lesions in CBCT images pose a class imbalance problem that needs to be addressed. In this study, we propose a deep learning-based method utilizing two convolutional neural networks: the SpatialConfiguration-Net (SCN) and a modified version of the U-Net. The SCN accurately predicts the coordinates of all teeth present in an image, enabling precise cropping of teeth volumes that are then fed into the U-Net which detects lesions via segmentation. To address class imbalance, we compare the performance of three reweighting loss functions. After evaluation on 144 CBCT images, our method achieves a 97.3% accuracy for teeth localization, along with a promising sensitivity and specificity of 0.97 and 0.88, respectively, for subsequent lesion detection.
Download

Paper Nr:	96
Title:	Estimation of the Inference Quality of Machine Learning Models for Cutting Tools Inspection
Authors:	Kacper Marciniak, Paweł Majewski and Jacek Reiner
Abstract:	The ongoing trend in industry to continuously improve the efficiency of production processes is driving the development of vision-based inspection and measurement systems. With recent significant advances in artificial intelligence, machine learning methods are becoming increasingly applied to these systems. Strict requirements are placed on measurement and control systems regarding accuracy, repeatability, and robustness against variation in working conditions. Machine learning solutions are often unable to meet these requirements - being highly sensitive to the input data variability. Given the depicted difficulties, an original method for estimation of inference quality is proposed. It is based on a feature space analysis and an assessment of the degree of dissimilarity between the input data and the training set described using explicit metrics proposed by the authors. The developed solution has been integrated with an existing system for measuring geometric parameters and determining cutting tool wear, allowing continuous monitoring of the quality of the obtained results and enabling the system operator to take appropriate action in case of a drop below the adopted threshold values.
Download

Paper Nr:	126
Title:	Character Identification in Images Extracted from Portuguese Manuscript Historical Documents
Authors:	Gustavo C. Lacerda and Raimundo S. Vasconcelos
Abstract:	The creation of writing has facilitated the humanity’s accumulation and sharing of knowledge, being a vital part of what differentiates humans from other animals and has a high importance for the culture of all peoples. Thus, the first human records (manuscripts), historical documents of organizations and families, began to have new perspectives with the digital age accumulation. These handwritten records remained the primary source for the history of countries, including Brazil before the period of independence, until the Gutenberg movable type printing press dominated the archival world. Thus, over the decades, these handwritten documents, due to their fragility, became difficult to access and manipulate. This has changed, with the possibility of digitization and, consequently, its distribution over the internet. Therefore, this work shows a solution for transcribing historical texts written in Portuguese, bringing accessibility, searchability, sharing and preservation to these records, which achieved a result of 97% of letters recognized in the database used.
Download

Paper Nr:	127
Title:	Identifying Representative Images for Events Description Using Machine Learning
Authors:	Marcos V. Soares de Sousa and Raimundo S. Vasconcelos
Abstract:	The use of social networks to record events – disasters, demonstrations, parties – has grown a lot and has begun to receive attention in recent years. Existing research focuses primarily on analyzing text-based messages from social media platforms such as Twitter. Images, photos and other media are increasingly used and can provide valuable information to enhance the understanding of an event and can be used as indicators of relevance. This work explores the Twitter social media platform, based on image and text in the case of the demonstrations that took place in Brazil on September 7, 2021, as a result of the Independence celebrations. This work uses machine learning techniques (VGG-16, VGG-19, ResNet50v2 and InceptionResNetv2) for finding relevant Twitter images. The results show that the existence of an image within a social media message can serve as a high probability indicator of relevant content. An extensive experimental evaluation was carried out and demonstrated that high efficiency gains can be obtained compared to state-of-the-art methods.
Download

Paper Nr:	146
Title:	A Comparative Analysis of the Three-Alternative Forced Choice Method and the Slider-Based Method in Subjective Experiments: A Case Study on Contrast Preference Task
Authors:	Olga Cherepkova, Seyed Ali Amirshahi and Marius Pedersen
Abstract:	When it comes to collecting subjective data in the field of image quality assessment, different approaches have been proposed. Most datasets in the field ask observers to evaluate the quality of different test and reference images. However, a number of datasets ask observers to make changes to one or more properties of the image to enhance the image to its best possible quality. Among the methods used in the second approach is the Three-Alternative Forced Choice (3AFC) and the slider-based methods. In this paper, we study and compare the two mentioned methods in the case of collecting contrast preferences for natural images. Fifteen observers participated in two experiments under controlled settings, incorporating 499 unique and 100 repeated images. The reliability of the answers and the differences between the two methods were analyzed. The results revealed a general lack of correlation in contrast preferences between the two methods. The slider-based method generally yielded lower values in contrast preferences compared to 3AFC experiment. In the case of repeated images, the slider-based method showed greater consistency in subjective scores given by each observer. These results suggest that neither method can serve as a direct substitute for the other, as they exhibited low correlation and statistically significant differences in results. The slider-based experiment offered the advantage of significantly shorter completion times, contributing to higher observer satisfaction. In contrast, the 3AFC task provided a more robust interface for collecting preferences. By comparing the results obtained by the two methods, this study provides information on their respective strengths, limitations, and suitability for use in similar preference acquisition tasks.
Download

Paper Nr:	199
Title:	Most Relevant Viewpoint of an Object: A View-Dependent 3D Saliency Approach
Authors:	Marie Pelissier-Combescure, Sylvie Chambon and Géraldine Morin
Abstract:	A viewpoint of a 3D object is the position from which we observe the object. A viewpoint always highlights some 3D parts and discards other parts of an object. Here, we define a good viewpoint as offering a relevant view of the object: a view that best showcases the object and that is the most representative of the object. Best view selection plays an essential role in many computer vision and virtual reality applications. In this paper, given a model and a particular viewpoint, we want to quantify its relevance -not aesthetics. We propose a geometric method for selecting the most relevant viewpoint for a 3D object by combining visibility and view-dependent saliency. Evaluating the quality of an estimated best viewpoint is a challenge. Thus, we propose an evaluation protocol that considers two different and complementary solutions: a user study with more than 200 participants to collect human preferences and an analysis of image dataset picturing objects of interest. This evaluation highlights the correlation between our method and human preferences. A quantitative comparison demonstrates the efficiency of our approach over reference methods.
Download

Paper Nr:	214
Title:	XYZ Unsupervised Network: A Robust Image Dehazing Approach
Authors:	Percy Maldonado-Quispe and Helio Pedrini
Abstract:	In this work, we examine a major less-explored topic in image dehazing neural networks, specifically how to remove haze (natural phenomenon) in an unsupervised manner from a given image. By considering a hazy image as the entanglement of many “simpler” layers, such as a hazy-free image layer, transmission map layer, and atmospheric light layer, as shown in the atmospheric scattering model, we propose a method based on the concept of layer disentanglement. Our XYZ approach presents improvements in the SSIM and PSNR metrics, this being the combination of the XHOT, YOLY and ZID methods, in which the advantages of each of them are maintained. The main benefits of the proposed XYZ are twofold. First, since it is an unsupervised approach, no clean photos, including hazy-clear pairs, are used as the ground truth. In other words, it differs from the traditional paradigm of deep model training on a large dataset. The second is to consider haze issues as being composed of several layers.
Download

Paper Nr:	221
Title:	Combining Total Variation and Nonlocal Variational Models for Low-Light Image Enhancement
Authors:	Daniel Torres, Catalina Sbert and Joan Duran
Abstract:	Images captured under low-light conditions impose significant limitations on the performance of computer vision applications. Therefore, improving their quality by discounting the effects of the illumination is crucial. In this paper, we present a low-light image enhancement method based on the Retinex theory. Our approach estimates illumination and reflectance in two steps. First, the illumination is obtained as the minimizer of an energy functional involving total variation regularization, which favours piecewise smooth solutions. Next, the reflectance component is computed as the minimizer of an energy functional involving contrast-invariant nonlocal regularization and a fidelity term preserving the largest gradients of the input image.
Download

Paper Nr:	229
Title:	Oral Dysplasia Classification by Using Fractal Representation Images and Convolutional Neural Networks
Authors:	Rafael O. Carvalho, Adriano B. Silva, Alessandro S. Martins, Sérgio V. Cardoso, Guilherme R. Freire, Paulo R. de Faria, Adriano M. Loyola, Thaína A. Tosta, Leandro A. Neves and Marcelo Z. do Nascimento
Abstract:	Oral cavity lesions can be graded by specialists, a task that is both difficult and subjective. The challenges in defining patterns can lead to inconsistencies in the diagnosis, often due to the color variations on the histological images. The development of computational systems has emerged as an effective approach for aiding specialists in the diagnosis process, with color normalization techniques proving to enhance diagnostic accuracy. There remains an open challenge in understanding the impact of color normalization on the classification of histological tissues representing dysplasia groups. This study presents an approach to classify dysplasia lesions based on ensemble models, fractal representations, and convolutional neural networks (CNN). Additionally, this work evaluates the influence of color normalization in the preprocessing stage. The results obtained with the proposed methodology were analyzed with and without the preprocessing stage. This approach was applied in a dataset composed of 296 histological images categorized into healthy, mild, moderate, and severe oral epithelial dysplasia tissues. The proposed approaches based on the ensemble were evaluated with the cross-validation technique resulting in accuracy rates ranging from 96.1% to 98.5% with the non-normalized dataset. This approach can be employed as a supplementary tool for clinical applications, aiding specialists in decision-making regarding lesion classification.
Download

Paper Nr:	236
Title:	Automated Brain Lobe Segmentation and Feature Extraction from Multiple Sclerosis Lesions Using Deep Learning
Authors:	Nada Haj Messaoud, Rim Ayari, Asma Ben Abdallah and Mohamed Hedi Bedoui
Abstract:	This study focuses on automating the segmentation of brain lobes in MRI images of Multiple Sclerosis (MS) lesions to extract crucial features for predicting disability levels. Extracting significant features from MRI images of MS lesions is indeed a complex task due to the variability in lesion characteristics and the detailed nature of MRI images. Furthermore, all these studies required continuous patient monitoring. Therefore, our contribution lies in proposing an approach for the automatic segmentation of brain lobes and the extraction of lesion features (number, size, location, etc.) to predict disability levels in MS patients. To achieve this, we introduced a model inspired by U-Net to perform the segmentation of different brain lobes, aiming to accurately locate the MS lesions. We utilized two private and public databases and achieved an average mean IoU score of 0.70, which can be considered encouraging. Following the segmentation phase, approximately 7200 features were extracted from the MRI scans of MS patients.
Download

Paper Nr:	259
Title:	SynthRSF: A Novel Photorealistic Synthetic Dataset for Adverse Weather Condition Denoising
Authors:	Angelos Kanlis, Vazgken Vanian, Sotiris Karvarsamis, Ioanna Gkika, Konstantinos Konstantoudakis and Dimitrios Zarpalas
Abstract:	This paper presents the SynthRSF dataset for training and evaluating single-image rain, snow and haze denoising algorithms, as well as evaluating object detection, semantic segmentation, and depth estimation performance in noisy or denoised images. Our dataset features 26,893 noisy images, each accompanied by its corresponding ground truth image. It further includes 13,800 noisy images accompanied by ground truth, 16-bit depth maps and pixel-accurate annotations for various object instances in each frame. The utility of SynthRSF is assessed by training unified models for rain, snow, and haze removal, achieving good objective metrics and excellent subjective results compared to existing adverse weather condition datasets. Furthermore, we demonstrate its use as a benchmark for the performance of an object detection algorithm in weather-degraded image datasets.
Download

Paper Nr:	282
Title:	Curriculum for Crowd Counting: Is It Worthy?
Authors:	Muhammad Asif Khan, Hamid Menouar and Ridha Hamila
Abstract:	Recent advances in deep learning techniques have achieved remarkable performance in several computer vision problems. A notably intuitive technique called Curriculum Learning (CL) has been introduced recently for training deep learning models. Surprisingly, curriculum learning achieves significantly improved results in some tasks but marginal or no improvement in others. Hence, there is still a debate about its adoption as a standard method to train supervised learning models. In this work, we investigate the impact of curriculum learning in crowd counting using the density estimation method. We performed detailed investigations by conducting 112 experiments using six different CL settings using eight different crowd models. Our experiments show that curriculum learning improves the model learning performance and shortens the convergence time.
Download

Paper Nr:	294
Title:	Learning Projection Patterns for Direct-Global Separation
Authors:	Takaoki Ueda, Ryo Kawahara and Takahiro Okabe
Abstract:	Separating the direct component such as diffuse reflection and specular reflection and the global component such as inter-reflection and subsurface scattering is important for various computer vision and computer graphics applications. Conventionally, high-frequency patterns designed by physics-based model or signal processing theory are projected from a projector to a scene, but their assumptions do not necessarily hold for real images due to the shallow depth of field of a projector and the limited spatial resolution of a camera. Accordingly, in this paper, we propose a data-driven approach for direct-global separation. Specifically, our proposed method learns not only the separation module but also the imaging module, i.e. the projection patterns at the same time in an end-to-end manner. We conduct a number of experiments using real images captured with a projector-camera system, and confirm the effectiveness of our method.
Download

Paper Nr:	310
Title:	Influence of Pixel Perturbation on eXplainable Artificial Intelligence Methods
Authors:	Juliana C. Feitosa, Mateus Roder, João P. Papa and José F. Brega
Abstract:	The current scenario around Artificial Intelligence (AI) has demanded more and more transparent explanations about the existing models. The use of eXplicable Artificial Intelligence (XAI) has been considered as a solution in the search for explainability. As such, XAI methods can be used to verify the influence of adverse scenarios, such as pixel disturbance on AI models for segmentation. This paper presents the experiments performed with fish images of the Pacu species to determine the influence of pixel perturbation through the following explainable methods: Grad-CAM, Saliency Map, Layer Grad-CAM and CNN Filters. The perturbed pixels were considered the most important for the model during the segmentation process of the input image regions. From the existing pixel perturbation techniques, the images were subjected to three main techniques: white noise, color black noise and random noise. From the results obtained, it was observed that the Grad-CAM method had different behaviors for each perturbation technique tested, while the CNN Filters method showed more stability in the variation of the image averaging. The Saliency Map was the least sensitive to the three types of perturbation, as it required fewer iterations. Furthermore, of the perturbation techniques tested, Black noise showed the least ability to impact segmentation. Thus, it is concluded that the perturbation methods influence the outcome of the explainable models tested and interfere with these models in different ways. It is suggested that the experiments presented here be replicated on other AI models, on other explainability methods, and with other existing perturbation techniques to gather more evidence about this influence and from that, quantify which combination of XAI method and pixel perturbation is best for a given problem.
Download

Paper Nr:	318
Title:	Convolutional Neural Networks and Image Patches for Lithological Classification of Brazilian Pre-Salt Rocks
Authors:	Mateus Roder, Leandro A. Passos, Clayton Pereira, João P. Papa, Altanir D. Mello Junior, Marcelo Fagundes de Rezende, Yaro P. Silva and Alexandre Vidal
Abstract:	Lithological classification is a process employed to recognize and interpret distinct structures of rocks, providing essential information regarding their petrophysical, morphological, textural, and geological aspects. The process is particularly interesting regarding carbonate sedimentary rocks in the context of petroleum basins since such rocks can store large quantities of natural gas and oil. Thus, their features are intrinsically correlated with the production potential of an oil reservoir. This paper proposes an automatic pipeline for the lithological classification of carbonate rocks into seven distinct classes, comparing nine state-of-the-art deep learning architectures. As far as we know, this is the largest study in the field. Experiments were performed over a private dataset obtained from a Brazilian petroleum company, showing that MobileNetV3large is the more suitable approach for the undertaking.
Download

Paper Nr:	332
Title:	Error Analysis of Aerial Image-Based Relative Object Position Estimation
Authors:	Zsombor Páncsics, Nelli Nyisztor, Tekla Tóth, Imre B. Juhász, Gergely Treplán and Levente Hajder
Abstract:	This paper presents a thorough analysis of precision and sensitivity in aerial image-based relative object position estimation, exploring factors such as camera tilt, 3D projection error, marker misalignment, rotation and calibration error. Our unique contribution lies in simulating complex 3D geometries at varying camera altitudes (20-130 m). The simulator has a built-in unique mathematical model offering an extensive set of error parameters to improve reliability of aerial image-based position estimation in practical applications.
Download

Paper Nr:	333
Title:	A Computer Vision Approach to Compute Bubble Flow of Offshore Wells
Authors:	Rogerio C. Hart and Aura Conci
Abstract:	This work presents two approaches for detecting and quantifying the offshore flow of leaks, using video recorded by a remote-operated vehicle (ROV) through underwater image analysis and considering the premise of no bubble overlap. One is designed using only traditional digital image approaches, such as Mathematical Morphology operators and Canny edge detection, and the second uses segmentation Convolutional Neural Network. Implementation and experimentation details are presented, enabling comparison and reproduction. The results are compared with videos acquired under controlled conditions and in an operational situation, as well as with all previous possible works. Comparison considers the estimation of the average diameter of rising bubbles, velocity of rise, leak flow rate, computational automation, and flexibility in bubble recognition. The results of both techniques are almost the same depending on the video content in the analysis.
Download

Paper Nr:	338
Title:	Blind Deblurring of THz Time-Domain Images Based on Low-Rank Representation
Authors:	Marina Ljubenović, Mário T. Figueiredo and Arianna Traviglia
Abstract:	Terahertz (THz) time-domain imaging holds immense potential for material characterization, capturing three-dimensional data across spatial and temporal dimensions. Despite its capabilities, the technology faces hurdles such as frequency-dependent beam-shape effects and noise. This paper proposes a novel, dual-stage framework for improving THz image resolution beyond the wavelength limit. Our method combats blur at lower frequencies and noise at higher frequencies. The first stage entails selective deblurring of lower-frequency bands, addressing beam-related blurring, while the second stage involves denoising the entire THz hyperspec-tral cube through dimensionality reduction, exploiting its low-rank structure. The synergy of these advanced techniques—beam shaping, noise removal, and low-rank representation—forms a comprehensive approach to enhance THz time-domain images. We present promising preliminary results, showcasing significant improvements across all frequency bands, which is crucial as samples may display varying features across the THz spectrum. Our ongoing work is extending this methodology to complex scenarios such as analyzing multilayered structures in closed ancient manuscripts. This approach paves the way for broader application and refinement of THz imaging in diverse research fields.
Download

Paper Nr:	379
Title:	Optical Illusion in Which Line Segments Continue to Grow or Shrink by Displaying Two Images Alternately
Authors:	Kazuhisa Yanaka and Sota Mihara
Abstract:	A new illusion has been discovered, wherein line segments, when alternately displayed with their tonal inversion or monochromatic images for approximately 120 ms each on a monochromatic background, seem to grow or shrink continuously. For instance, if the first image features black line segments on a white background and the second image shows the inverse brightness, switching between these two images causes the line segments to give the illusion of continuous expansion. Although a single line segment suffices, aligning multiple line segments parallel to each other enhances the effect of this illusion. This illusion can be achieved using achromatic colors, such as black and white, as well as chromatic colors, such as red, blue, and green. Specifically, when using an image with a black line segment on a red background alongside its brightness-inverted counterpart, the line segments appear to steadily decrease in length. Our hypothesis suggests a comparison between the mechanisms of this illusion and the changes in water volume in a pond.
Download

Paper Nr:	381
Title:	SAM-Based Detection of Structural Anomalies in 3D Models for Preserving Cultural Heritage
Authors:	David Jurado-Rodríguez, Alfonso López, J. R. Jiménez, Antonio Garrido, Francisco R. Feito and Juan M. Jurado
Abstract:	The detection of structural defects and anomalies in cultural heritage emerges as an essential component to ensure the integrity and safety of buildings, plan preservation strategies, and promote the sustainability and durability of buildings over time. In the search to enhance the effectiveness and efficiency of structural health monitoring of cultural heritage, this work aims to develop an automated method focused on detecting unwanted materials and geometric anomalies on the 3D surfaces of ancient buildings. In this study, the proposed solution combines an AI-based technique for fast-forward image labeling and a fully automatic detection of target classes in 3D point clouds. As an advantage of our method, the use of spatial and geometric features in the 3D models enables the recognition of target materials in the whole point cloud from seed, resulting from partial detection in a few images. The results demonstrate the feasibility and utility of detecting self-healing materials, unwanted vegetation, lichens, and encrusted elements in a real-world scenario.
Download

Paper Nr:	391
Title:	A Generative Model for Guided Thermal Image Super-Resolution
Authors:	Patricia L. Suárez and Angel D. Sappa
Abstract:	This paper presents a novel approach for thermal super-resolution based on a fusion prior, low-resolution thermal image and H brightness channel of the corresponding visible spectrum image. The method combines bicubic interpolation of the ×8 scale target image with the brightness component. To enhance the guidance process, the original RGB image is converted to HSV, and the brightness channel is extracted. Bicubic interpolation is then applied to the low-resolution thermal image, resulting in a Bicubic-Brightness channel blend. This luminance-bicubic fusion is used as an input image to help the training process. With this fused image, the cyclic adversarial generative network obtains high-resolution thermal image results. Experimental evaluations show that the proposed approach significantly improves spatial resolution and pixel intensity levels compared to other state-of-the-art techniques, making it a promising method to obtain high-resolution thermal.
Download

Paper Nr:	417
Title:	Colorectal Image Classification Using Randomized Neural Network Descriptors
Authors:	Jarbas M. Sá Junior and André R. Backes
Abstract:	Colorectal cancer is among the highest incident cancers in the world. A fundamental procedure to diagnose it is the analysis of histological images acquired from a biopsy. Because of this, computer vision approaches have been proposed to help human specialists in such a task. In order to contribute to this field of research, this paper presents a novel way of analyzing colorectal images by using a very discriminative texture signature based on weights of a randomized neural network. For this, we addressed an important multi-class problem composed of eight types of tissues. The results were promising, surpassing the accuracies of many methods present in the literature. Thus, this performance confirms that the randomized neural network signature is an efficient tool for discriminating histological images from colorectal tissues.
Download

Paper Nr:	431
Title:	Deformable Pose Network: A Multi-Stage Deformable Convolutional Network for 2D Hand Pose Estimation
Authors:	Sartaj A. Salman, Ali Zakir and Hiroki Takahashi
Abstract:	Hand pose estimation undergoes a significant advancement with the evolution of Convolutional Neural Networks (CNNs) in the field of computer vision. However, existing CNNs fail in many scenarios in learning the unknown transformations and geometrical constraints along with the other existing challenges for accurate estimation of hand keypoints. To tackle these issues we proposed a multi-stage deformable convolutional network for accurate 2D hand pose estimation from monocular RGB images while considering the computational complexity. We utilized EfficientNet as a backbone due to its powerful feature extraction capability, and deformable convolution to learn about the geometrical constraints. Our proposed model called Deformable Pose Network (DPN) outperforms in predicting the 2D keypoints in complex scenarios. Our analysis on the Panoptic studio hand dataset shows that our proposed model improves the accuracy by 2.36% and 7.29% as compared to existing methods i.e., OCPM and CPM respectively.
Download

Paper Nr:	442
Title:	Selection of Backbone for Feature Extraction with U-Net in Pancreas Segmentation
Authors:	Alexandre C. Araújo, Joao D. Sousa de Almeida, Anselmo Cardoso de Paiva and Geraldo Braz Junior
Abstract:	The survival rate for pancreatic cancer is among the worst, with a mortality rate of 98%. Diagnosis in the early stage of the disease is the main factor that defines the prognosis. Imaging scans, such as Computerized Tomography scans, are the primary tools for early diagnosis. Computer Assisted Diagnosis tools that use these scans usually include in their pipeline the segmentation of the pancreas as one of the initial steps for diagnosis. This paper presents a comparative study of the use of different backbones in combination with the U-Net. This study aims to demonstrate that using pre-trained backbones is a valuable tool for pancreas segmentation and to provide a comparative benchmark for this task. The best result obtained was 85.96% of Dice in the MSD dataset for the pancreas segmentation using backbone efficientnetb7.
Download

Paper Nr:	452
Title:	RetailKLIP: Finetuning OpenCLIP Backbone Using Metric Learning on a Single GPU for Zero-Shot Retail Product Image Classification
Authors:	Muktabh M. Srivastava
Abstract:	Retail product or packaged grocery goods images need to classified in various computer vision applications like self checkout stores, supply chain automation and retail execution evaluation. Previous works explore ways to finetune deep models for this purpose. But because of the fact that finetuning a large model or even linear layer for a pretrained backbone requires to run at least a few epochs of gradient descent for every new retail product added in classification range, frequent retrainings are needed in a real world scenario. In this work, we propose finetuning the vision encoder of a CLIP model in a way that its embeddings can be easily used for nearest neighbor based classification, while also getting accuracy close to or exceeding full finetuning. A nearest neighbor based classifier needs no incremental training for new products, thus saving resources and wait time.
Download

Area 2 - Mobile and Egocentric Vision for Humans and Robots

Full Papers

Paper Nr:	57
Title:	Event-Based Semantic-Aided Motion Segmentation
Authors:	Chenao Jiang, Julien Moreau and Franck Davoine
Abstract:	Event cameras are emerging visual sensors inspired by biological systems. They capture intensity changes asynchronously with a temporal precision of up to µs, in contrast to traditional frame imaging techniques running at a fixed frequency of tens of Hz. However, effectively utilizing the data generated by these sensors requires the development of new algorithms and processing. In light of event cameras’ significant advantages in capturing high-speed motion, researchers have turned their attention to event-based motion segmentation. Building upon (Mitrokhin et al., 2019) framework, we propose leveraging semantic segmentation enable the end-to-end network not only to segment moving objects from background motion, but also to achieve semantic segmentation of distinct moving objects. Remarkably, these capabilities are achieved while maintaining the network’s low parameter count of 2.5M. To validate the effectiveness of our approach, we conduct experiments using the EVIMO dataset and the new and more challenging EVIMO2 dataset (Burner et al., 2022). The results demonstrate improvements attained by our method, showcasing its potential in event-based multi-objects motion segmentation.
Download

Paper Nr:	166
Title:	Semantic State Estimation in Robot Cloth Manipulations Using Domain Adaptation from Human Demonstrations
Authors:	Georgies Tzelepis, Eren E. Aksoy, Júlia Borràs and Guillem Alenyà
Abstract:	Deformable object manipulations, such as those involving textiles, present a significant challenge due to their high dimensionality and complexity. In this paper, we propose a solution for estimating semantic states in cloth manipulation tasks. To this end, we introduce a new, large-scale, fully-annotated RGB image dataset of semantic states featuring a diverse range of human demonstrations of various complex cloth manipulations. This effectively transforms the problem of action recognition into a classification task. We then evaluate the generalizability of our approach by employing domain adaptation techniques to transfer knowledge from human demonstrations to two distinct robotic platforms: Kinova and UR robots. Additionally, we further improve performance by utilizing a semantic state graph learned from human manipulation data.
Download

Paper Nr:	174
Title:	Hand Mesh and Object Pose Reconstruction Using Cross Model Autoencoder
Authors:	Chaitanya Bandi and Ulrike Thomas
Abstract:	Hands and objects severely occlude each other, making it extremely challenging to estimate the hand-object pose during human-robot interactions. In this work, we propose a framework that jointly estimates 3D hand mesh and 6D object pose in real-time. The framework shares the features of a single network with both the hand pose estimation network and the object pose estimation network. Hand pose estimation is a parametric model that regresses the shape and pose parameters of the hand. The object pose estimation network is a cross-model variational autoencoder network for the direct reconstruction of an object’s 6D pose. Our method shows substantial improvement in object pose estimation on two large-scale open-source datasets.
Download

Paper Nr:	175
Title:	Multi-View Inversion for 3D-aware Generative Adversarial Networks
Authors:	Florian Barthel, Anna Hilsmann and Peter Eisert
Abstract:	Current 3D GAN inversion methods for human heads typically use only one single frontal image to reconstruct the whole 3D head model. This leaves out meaningful information when multi-view data or dynamic videos are available. Our method builds on existing state-of-the-art 3D GAN inversion techniques to allow for consistent and simultaneous inversion of multiple views of the same subject. We employ a multi-latent extension to handle inconsistencies present in dynamic face videos to re-synthesize consistent 3D representations from the sequence. As our method uses additional information about the target subject, we observe significant enhancements in both geometric accuracy and image quality, particularly when rendering from wide viewing angles. Moreover, we demonstrate the editability of our inverted 3D renderings, which distinguishes them from NeRF-based scene reconstructions.
Download

Paper Nr:	188
Title:	HD-VoxelFlex: Flexible High-Definition Voxel Grid Representation
Authors:	Igor Vozniak, Pavel Astreika, Philipp Müller, Nils Lipp, Christian Müller and Philipp Slusallek
Abstract:	Voxel grids are an effective means to represent 3D data, as they accurately preserve spatial relations. However, the inherent sparseness of voxel grid representations leads to significant memory consumption in deep learning architectures, in particular for high-resolution (HD) inputs. As a result, current state-of-the-art approaches to the reconstruction of 3D data tend to avoid voxel grid inputs. In this work, we propose HD-VoxelFlex, a novel 3D CNN architecture that can be flexibly applied to HD voxel grids with only moderate increase in training parameters and memory consumption. HD-VoxelFlex introduces three architectural novelties. First, to improve the models’ generalizability, we introduce a random shuffling layer. Second, to reduce information loss, we introduce a novel reducing skip connection layer. Third, to improve modelling of local structure that is crucial for HD inputs, we incorporate a kNN distance mask as input. We combine these novelties with a “bag of tricks” identified in a comprehensive literature review. Based on these novelties we propose six novel building blocks for our encoder-decoder HD-VoxelFlex architecture. In evaluations on the ModelNet10/40 and PCN datasets, HD-VoxelFlex outperforms the state-of-the-art in all point cloud reconstruction metrics. We show that HD-VoxelFlex is able to process high-definition (128 3 , 192 3 ) voxel grid inputs at much lower memory consumption than previous approaches. Furthermore, we show that HD-VoxelFlex, without additional fine-tuning, demonstrates competitive performance in the classification task, proving its generalization ability. As such, our results underline the neglected potential of voxel grid input for deep learning architectures.
Download

Paper Nr:	337
Title:	BEVFastLine: Single Shot Fast BEV Line Detection for Automated Parking Applications
Authors:	Praveen Narasappareddygari, Venkatesh M. Karunamoorthy, Shubham Sonarghare, Ganesh Sistu and Prasad Deshpande
Abstract:	In autonomous parking scenarios, accurate near-field environmental perception is crucial for smooth operations. Parking line detection, unlike the well-understood lane detection, poses unique challenges due to its lack of spatial consistency in orientation, location, and varied appearances in color, pattern, and background surfaces. Consequently, state-of-the-art models for lane detection, which rely on anchors and offsets, are not directly applicable. This paper introduces BEVFastLine, a novel end-to-end line marking detection architecture in Birds Eye View (BEV) space, designed for 360 ◦ multi-camera perception applications. BEVFastLine integrates our single-shot line detection methodology with advanced Inverse Perspective Mapping (IPM) techniques, notably our fast splatting technique, to efficiently detect line markings in varied spatial contexts. This approach is suitable for real-time hardware in Level-3 automated vehicles. BEVFastLine accurately localizes parking lines in BEV space with up to 10 cm precision. Our methods, including the 4X faster Fast Splat and single-shot detection, surpass LSS and OFT in accuracy, achieving 80.1% precision, 90% recall, and nearly doubling the performance of BEV-based segmentation and polyline models. This streamlined solution is highly effective in complex, dynamic parking environments, offering high precision localization within 10 meters around the ego vehicle.
Download

Paper Nr:	427
Title:	Fooling Neural Networks for Motion Forecasting via Adversarial Attacks
Authors:	Edgar Medina and Leyong Loh
Abstract:	Human motion prediction is still an open problem, which is extremely important for autonomous driving and safety applications. Although there are great advances in this area, the widely studied topic of adversarial attacks has not been applied to multi-regression models such as GCNs and MLP-based architectures in human motion prediction. This work intends to reduce this gap using extensive quantitative and qualitative experiments in state-of-the-art architectures similar to the initial stages of adversarial attacks in image classification. The results suggest that models are susceptible to attacks even on low levels of perturbation. We also show experiments with 3D transformations that affect the model performance, in particular, we show that most models are sensitive to simple rotations and translations which do not alter joint distances. We conclude that similar to earlier CNN models, motion forecasting tasks are susceptible to small perturbations and simple 3D transformations.
Download

Short Papers

Paper Nr:	41
Title:	Informative Rays Selection for Few-Shot Neural Radiance Fields
Authors:	Marco Orsingher, Anthony Dell’Eva, Paolo Zani, Paolo Medici and Massimo Bertozzi
Abstract:	Neural Radiance Fields (NeRF) have recently emerged as a powerful method for image-based 3D reconstruction, but the lengthy per-scene optimization limits their practical usage, especially in resource-constrained settings. Existing approaches solve this issue by reducing the number of input views and regularizing the learned volumetric representation with either complex losses or additional inputs from other modalities. In this paper, we present KeyNeRF, a simple yet effective method for training NeRF in few-shot scenarios by focusing on key informative rays. Such rays are first selected at camera level by a view selection algorithm that promotes baseline diversity while guaranteeing scene coverage, then at pixel level by sampling from a probability distribution based on local image entropy. Our approach performs favorably against state-of-theart methods, while requiring minimal changes to existing NeRF codebases.
Download

Paper Nr:	52
Title:	Augmenting Human-Robot Collaboration Task by Human Hand Position Forecasting
Authors:	Shyngyskhan Abilkassov, Michael Gentner and Mirela Popa
Abstract:	Human-Robot collaboration (HRC) plays a critical role in enhancing productivity and safety across various industries. While reactive motion re-planning strategies have proven useful, there is a pressing need for proactive control involving computing human intentions to enable efficient collaboration. This work addresses this challenge by proposing a deep learning-based approach for forecasting human hand trajectories and a heuristic optimization algorithm for proactive robotic task sequencing problem optimization. This work presents a human hand trajectory forecasting deep learning model that achieves state-of-the-art performance on the Ego4D Future Hand Prediction benchmark in all evaluation metrics. In addition, this work presents a problem formulation and a Dynamic Variable Neighborhood Search (DynamicVNS) heuristic optimization algorithm enabling robot to pre-plan their task sequence to avoid human hands. The proposed algorithm exhibits significant computational improvements over the generalized VNS approach. The final framework efficiently incorporates predictions made by the deep learning model into the task sequencer, which is evaluated in an experimental setup for the HRC use-case of the UR10e robot in a visual inspection task. The results indicate the effectiveness and practicality of the proposed approach, showcasing its potential to improve human-robot collaboration in various industrial settings.
Download

Paper Nr:	137
Title:	Analysis of Point Cloud Domain Gap Effects for 3D Object Detection Evaluation
Authors:	Aitor Iglesias, Mikel García, Nerea Aranjuelo, Ignacio Arganda-Carreras and Marcos Nieto
Abstract:	The development of autonomous driving systems heavily relies on high-quality LiDAR data, which is essential for robust object detection and scene understanding. Nevertheless, obtaining a substantial amount of such data for effective training and evaluation of autonomous driving algorithms is a major challenge. To overcome this limitation, recent studies are taking advantage of advancements in realistic simulation engines, such as CARLA, which have provided a breakthrough in generating synthetic LiDAR data that closely resembles real-world scenarios. However, these data are far from being identical to real data. In this study, we address the domain gap between real LiDAR data and synthetic data. We train deep-learning models for object detection using real data. Then, those models are rigorously evaluated using synthetic data generated in CARLA. By quantifying the discrepancies between the model’s performance on real and synthetic data, the present study shows that there is indeed a domain gap between the two types of data and does not affect equal to different model architectures. Finally, we propose a method for synthetic data processing to reduce this domain gap. This research contributes to enhancing the use of synthetic data for autonomous driving systems.
Download

Paper Nr:	148
Title:	Incorporating Temporal Information into 3D Hand Pose Estimation Using Scene Flow
Authors:	Niklas Hermes, Alexander Bigalke and Mattias P. Heinrich
Abstract:	In this paper we present a novel approach that uses 3D point cloud sequences to integrate temporal information and spatial constraints into existing 3D hand pose estimation methods in order to establish an improved prediction of 3D hand poses. We utilize scene flow to match correspondences between two point sets and present a method that optimizes and harnesses existing scene flow networks for the application of 3D hand pose estimation. For increased generalizability, we propose a module that learns to recognize spatial hand pose associations to transform existing poses into a low-dimensional pose space. In a comprehensive evaluation on the public dataset NYU, we show the benefits of our individual modules and provide insights into the generalization capabilities and the behaviour of our method with noisy data. Furthermore, we demonstrate that our method reduces the error of existing state-of-the-art 3D hand pose estimation methods by up to 7.6%. With a speed of over 40 fps our method is real-time capable and can be integrated into existing 3D hand pose estimation methods with little computational overhead.
Download

Paper Nr:	170
Title:	PIRO: Permutation-Invariant Relational Network for Multi-Person 3D Pose Estimation
Authors:	Nicolas Ugrinovic, Adria Ruiz, Antonio Agudo, Alberto Sanfeliu and Francesc Moreno-Noguer
Abstract:	Recovering multi-person 3D poses from a single RGB image is an ill-conditioned problem due to the inherent 2D-3D depth ambiguity, inter-person occlusions, and body truncation. To tackle these issues, recent works have shown promising results by simultaneously reasoning for different individuals. However, in most cases this is done by only considering pairwise inter-person interactions or between pairs of body parts, thus hindering a holistic scene representation able to capture long-range interactions. Some approaches that jointly process all people in the scene require defining one of the individuals as a reference and a pre-defined person ordering or limiting the number of individuals thus being sensitive to these choice. In this paper, we overcome both these limitations, and we propose an approach for multi-person 3D pose estimation that captures long-range interactions independently of the input order. We build a residual-like permutation-invariant network that successfully refines potentially corrupted initial 3D poses estimated by off-the-shelf detectors. The residual function is learned via a Set Attention (Lee et al., 2019) mechanism. Despite of our model being relatively straightforward, a thorough evaluation demonstrates that our approach is able to boost the performance of the initially estimated 3D poses by large margins, achieving state-of-the-art results on two standardized benchmarks.

Paper Nr:	183
Title:	Social Distancing Monitoring by Human Detection Through Bird’s-Eye View Technique
Authors:	Gona Rozhbayani, Amel Tuama and Fadwa Al-Azzo
Abstract:	The objective of this study is to offer a YOLOv5 deep learning-based system for social distance monitoring. The YOLOv5 model has been used to detect humans in real- time video frames, and to obtain information on the detected bounding box for the bird’s eye view perspective technique. The pairwise distances of the identified bounding box centroid of people are calculated by utilizing euclidean distance. In addition, a threshold value has been set and applied as an approximation of social distance to pixels for determining social distance violations between people. The effectiveness of this proposed system is tested by experiments on different four video frames. The suggested system’s performance showed a high level of efficiency in monitoring social distancing accurately up to 100%.
Download

Paper Nr:	277
Title:	World-Map Misalignment Detection for Visual Navigation Systems
Authors:	Rosario Forte, Michele Mazzamuto, Francesco Ragusa, Giovanni M. Farinella and Antonino Furnari
Abstract:	We consider the problem of inferring when the internal map of an indoor navigation system is misaligned with respect to the real world (world-map misalignment), which can lead to misleading directions given to the user. We note that world-map misalignment can be predicted from an RGB image of the environment and the floor segmentation mask obtained from the internal map of the navigation system. Since collecting and labelling large amounts of real data is expensive, we developed a tool to simulate human navigation, which is used to generate automatically labelled synthetic data from 3D models of environments. Thanks to this tool, we generate a dataset considering 15 different environments, which is complemented by a small set of videos acquired in a real-world scenario and manually labelled for validation purposes. We hence benchmark an approach based on different ResNet18 configurations and compare their results on both synthetic and real images. We achieved an F1 score of 92.37% in the synthetic domain and 75.42% on the proposed real dataset using our best approach. While the results are promising, we also note that the proposed problem is challenging, due to the domain shift between synthetic and real data, and the difficulty in acquiring real data. The dataset and the developed tool are publicly available to encourage research on the topic at the following URL: https://github.com/fpv-iplab/WMM-detection-for-visual-navigation-systems.
Download

Paper Nr:	309
Title:	Region-Transformer: Self-Attention Region Based Class-Agnostic Point Cloud Segmentation
Authors:	Dipesh Gyawali, Jian Zhang and Bijaya B. Karki
Abstract:	Point cloud segmentation, which helps us understand the environment of specific structures and objects, can be performed in class-specific and class-agnostic ways. We propose a novel region-based transformer model called Region-Transformer for performing class-agnostic point cloud segmentation. The model utilizes a region-growth approach and self-attention mechanism to iteratively expand or contract a region by adding or removing points. It is trained on simulated point clouds with instance labels only, avoiding semantic labels. Attention-based networks have succeeded in many previous methods of performing point cloud segmentation. However, a region-growth approach with attention-based networks has yet to be used to explore its performance gain. To our knowledge, we are the first to use a self-attention mechanism in a region-growth approach. With the introduction of self-attention to region-growth that can utilize local contextual information of neighborhood points, our experiments demonstrate that the Region-Transformer model outperforms previous class-agnostic and class-specific methods on indoor datasets regarding clustering metrics. The model generalizes well to large-scale scenes. Key advantages include capturing long-range dependencies through self-attention, avoiding the need for semantic labels during training, and applicability to a variable number of objects. The Region-Transformer model represents a promising approach for flexible point cloud segmentation with applications in robotics, digital twinning, and autonomous vehicles.
Download

Paper Nr:	325
Title:	Multidimensional Compressed Sensing for Spectral Light Field Imaging
Authors:	Wen Cao, Ehsan Miandji and Jonas Unger
Abstract:	This paper considers a compressive multi-spectral light field camera model that utilizes a one-hot spectral-coded mask and a microlens array to capture spatial, angular, and spectral information using a single monochrome sensor. We propose a model that employs compressed sensing techniques to reconstruct the complete multi-spectral light field from undersampled measurements. Unlike previous work where a light field is vectorized to a 1D signal, our method employs a 5D basis and a novel 5D measurement model, hence, matching the intrinsic dimensionality of multispectral light fields. We mathematically and empirically show the equivalence of 5D and 1D sensing models, and most importantly that the 5D framework achieves orders of magnitude faster reconstruction while requiring a small fraction of the memory. Moreover, our new multidimensional sensing model opens new research directions for designing efficient visual data acquisition algorithms and hardware.
Download

Paper Nr:	356
Title:	Visual Perception of Obstacles: Do Humans and Machines Focus on the Same Image Features?
Authors:	Constantinos A. Kyriakides, Marios Thoma, Zenonas Theodosiou, Harris Partaourides, Loizos Michael and Andreas Lanitis
Abstract:	Contemporary cities are fractured by a growing number of barriers, such as on-going construction and infrastructure damages, which endanger pedestrian safety. Automated detection and recognition of such barriers from visual data has been of particular concern to the research community in recent years. Deep Learning (DL) algorithms are now the dominant approach in visual data analysis, achieving excellent results in a wide range of applications, including obstacle detection. However, explaining the underlying operations of DL models remains a key challenge in gaining significant understanding on how they arrive at their decisions. The use of heatmaps that highlight the focal points in input images that helped the models reach their predictions has emerged as a form of post-hoc explainability for such models. In an effort to gain insights into the learning process of DL models, we studied the similarities between heatmaps generated by a number of architectures trained to detect obstacles on sidewalks in images collected via smartphones, and eye-tracking heatmaps generated by humans as they detect the corresponding obstacles on the same data. Our findings indicate that the focus points of humans more closely align with those of a Vision Transformer architecture, as opposed to the other network architectures we examined in our experiments.
Download

Paper Nr:	378
Title:	Reliability and Stability of Mean Opinion Score for Image Aesthetic Quality Assessment Obtained Through Crowdsourcing
Authors:	Egor Ershov, Artyom Panshin, Ivan Ermakov, Nikola Banić, Alex Savchik and Simone Bianco
Abstract:	Image quality assessment (IQA) is widely used to evaluate the results of image processing methods. While in recent years the development of objective IQA metrics has seen much progress, there are still many tasks where subjective IQA is significantly more preferred. Using subjective IQA has become even more attractive ever since crowdsourcing platforms such as Amazon Mechanical Turk and Toloka have become available. However, for some specific image processing tasks, there are still some questions related to subjective IQA that have not been solved in a satisfactory way. An example of such a task is the evaluation of image rendering styles where, unlike in the case of distortions, none of the evaluated styles is to be objectively regarded as a priori better or worse. The questions that have not been properly answered up until now are whether the scores for such a task obtained through crowdsourced subjective IQA are reliable and whether they remain stable, i.e., similar if the evaluation is repeated over time. To answer these questions, in this paper first several images and styles are selected and defined, they are then evaluated by using crowdsourced subjective IQA on the Toloka platform, and the obtained scores are numerically analyzed. Experimental results confirm the reliability and stability of the crowdsourced subjective IQA for the problem in question. The experimental data is available at https://zenodo.org/records/10458531.
Download

Paper Nr:	383
Title:	Detecting Anomalous 3D Point Clouds Using Pre-Trained Feature Extractors
Authors:	Dario Mantegazza and Alessandro Giusti
Abstract:	In this paper we explore the status of the research effort for the task of 3D visual anomaly detection; in particular, we investigate whether it is possible to find anomalies on 3D point clouds using off-the-shelf feature extractors, similar to what is already feasible on images, without the requirement of an ad-hoc one. Our work uses a model composed of two parts: a feature extraction module and an anomaly detection head. The latter is fixed and works on the embeddings from the feature extraction module. Using the MVTec-3D dataset, we contribute a comparison between a 3D point cloud features extractor, a 2D image features extractor, a combination of the two, and three baselines. We also compare our work with other models on the dataset’s DETECTION-AUROC benchmark. The experiment results demonstrate that, while our proposed approach surpasses the baselines and some other approaches, our best-performing model cannot beat purposely developed ones. We conclude that a combination of dataset size and 3D data complexity is the culprit to a lack of off-the-shelf feature extractors for solving complex 3D vision tasks.
Download

Paper Nr:	430
Title:	Decoding Visual Stimuli and Visual Imagery Information from EEG Signals Utilizing Multi-Perspective 3D-CNN Based Hierarchical Deep-Fusion Learning Network
Authors:	Fatma Y. Emanet and Kazim Sekeroglu
Abstract:	Brain-Computer Interface Systems (BCIs) facilitate communication between the brain and machines, enabling applications such as diagnosis, understanding brain function, and cognitive augmentation. This study explores the classification of visual stimuli and visual imagery using electroencephalographic (EEG) data. The proposed method utilizes 3D EEG data generated by transforming 1D EEG data into 2D Spatiotemporal EEG image mappings for feature extraction and classification. Additionally, a multi-perspective 3D CNN-based hierarchical deep fusion learning network is employed to classify multi-dimensional spatiotemporal EEG data, decoding brain activity for visual and visual imagery stimulation. The findings show that the suggested multi-perspective fusion method performs better than a standalone model, indicating promising progress in using BCIs to understand and utilize brain signals for visual and imagined stimulation.
Download

Paper Nr:	28
Title:	Finding and Navigating to Humans in Complex Environments for Assistive Tasks
Authors:	Asfand Yaar, Antonino Furnari, Marco Rosano, Aki Härmä and Giovanni M. Farinella
Abstract:	Finding and reaching humans in unseen environments is a major challenge for intelligent agents and social robots. Effective exploration and navigation strategies are necessary to locate the human performing various activities. In this paper, we propose a problem formulation in which the robot is required to locate and reach humans in unseen environments. To tackle this task, we design an approach that makes use of state-of-the-art components to allow the agent to explore the environment, identify the human’s location on the map, and approach them while maintaining a safe distance. To include human models, we utilized Blender to modify the scenes of the Gibson dataset. We conducted experiments using the Habitat simulator, where the proposed approach achieves promising results. The success of our approach is measured by the distance and orientation difference between the robot and the human at the end of the episode. We will release the source code and 3D human models for researchers to benchmark their assistive systems.
Download

Paper Nr:	114
Title:	Enhancing Object Detection Accuracy with Variational Autoencoders as a Filter in YOLO
Authors:	Shubham K. Dubey, J. V. Satyanarayana and C. K. Mohan
Abstract:	Object detection is an important task in computer vision systems, encompassing a diverse spectrum of applications, including but not limited to autonomous vehicular navigation and surveillance. Despite considerable advancements in object detection models such as YOLO, the issue of false positive detections remain a prevalent concern, thereby causing misclassifications and diminishing the reliability of these systems. This research endeavors to present an innovative methodology designed to augment object detection accuracy by incorporating Variational Autoencoders (VAEs) as a filtration mechanism within the YOLO framework. This integration seeks to rectify the issue of false positive detections, ultimately fostering a marked enhancement in detection precision and strengthening the overall dependability of object detection systems.
Download

Paper Nr:	189
Title:	Automatic Error Correction of GPT-Based Robot Motion Generation by Partial Affordance of Tool
Authors:	Takahiro Suzuki, Yuta Ando and Manabu Hashimoto
Abstract:	In this research, we proposed a technique that, given a simple instruction such as ”Please make a cup of coffee” as would commonly be used when one human gives another human an instruction, determines an appropriate robot motion sequence and the tools to be used for that task and generates a motion trajectory for a robot to execute the task. The proposed method uses a large language model (GPT) to determine robot motion sequences and tools to be used. However, GPT may select tools that do not exist in the scene or are not appropriate. To correct this error, our research focuses on function and functional consistency. An everyday object has a role assigned to each region of that object, such as ”scoop” or ”contain”. There are also constraints such as the fact that a ladle must have scoop and grasp functions. The proposed method judges whether the tools in the scene are inconsistent with these constraints, and automatically corrects the tools as necessary. Experimental results confirmed that the proposed method was able to generate motion sequences a from simple instruction and that the proposed method automatically corrects errors in GPT outputs.
Download

Paper Nr:	306
Title:	Combining Progressive Hierarchical Image Encoding and YOLO to Detect Fish in Their Natural Habitat
Authors:	Antoni Burguera
Abstract:	This paper explores the advantages of evaluating Progressive Image Encoding (PIE) methods in the context of the specific task for which they will be used. By focusing on a particular task —fish detection in their natural habitat— and a specific PIE algorithm — Progressive Hierarchical Image Encoding (PHIE)—, the paper investigates the performance of You Only Look Once (YOLO) in detecting fish in underwater images using PHIE-encoded images. This is particularly relevant in underwater environments where image transmission is slow. Results provide insights into the advantages and drawbacks of PHIE image encoding and decoding, not from the perspective of general metrics such as reconstructed image quality but from the viewpoint of its impact on a task —fish detection— that depends on the PHIE encoded and decoded images.
Download

Area 3 - Image and Video Understanding

Full Papers

Paper Nr:	23
Title:	Enabling On-Device Continual Learning with Binary Neural Networks and Latent Replay
Authors:	Lorenzo Vorabbi, Davide Maltoni, Guido Borghi and Stefano Santi
Abstract:	On-device learning remains a formidable challenge, especially when dealing with resource-constrained devices that have limited computational capabilities. This challenge is primarily rooted in two key issues: first, the memory available on embedded devices is typically insufficient to accommodate the memory-intensive back-propagation algorithm, which often relies on floating-point precision. Second, the development of learning algorithms on models with extreme quantization levels, such as Binary Neural Networks (BNNs), is critical due to the drastic reduction in bit representation. In this study, we propose a solution that combines recent advancements in the field of Continual Learning (CL) and Binary Neural Networks to enable on-device training while maintaining competitive performance. Specifically, our approach leverages binary latent replay (LR) activations and a novel quantization scheme that significantly reduces the number of bits required for gradient computation. The experimental validation demonstrates a significant accuracy improvement in combination with a noticeable reduction in memory requirement, confirming the suitability of our approach in expanding the practical applications of deep learning in real-world scenarios.
Download

Paper Nr:	40
Title:	Uncertainty-Based Detection of Adversarial Attacks in Semantic Segmentation
Authors:	Kira Maag and Asja Fischer
Abstract:	State-of-the-Art deep neural networks have proven to be highly powerful in a broad range of tasks, including semantic image segmentation. However, these networks are vulnerable against adversarial attacks, i.e., non-perceptible perturbations added to the input image causing incorrect predictions, which is hazardous in safety-critical applications like automated driving. Adversarial examples and defense strategies are well studied for the image classification task, while there has been limited research in the context of semantic segmentation. First works however show that the segmentation outcome can be severely distorted by adversarial attacks. In this work, we introduce an uncertainty-based approach for the detection of adversarial attacks in semantic segmentation. We observe that uncertainty as for example captured by the entropy of the output distribution behaves differently on clean and perturbed images and leverage this property to distinguish between the two cases. Our method works in a light-weight and post-processing manner, i.e., we do not modify the model or need knowledge of the process used for generating adversarial examples. In a thorough empirical analysis, we demonstrate the ability of our approach to detect perturbed images across multiple types of adversarial attacks.
Download

Paper Nr:	44
Title:	Synthesizing Classifiers from Prior Knowledge
Authors:	G. J. Burghouts, K. Schutte, M. Kruithof, W. Huizinga, F. Ruis and H. Kuijf
Abstract:	Various good methods have been proposed for either zero-shot or few-shot learning, but these are commonly unsuited for both; whereas in practice one often starts without labels and some might become available later. We propose a method that naturally ties zero- and few-shot learning together. We initiate a zero-shot model from prior knowledge about the classes, by recombining the weights from a classification head via a linear reconstruction that is sparse to avoid overfitting. Our mapping is an explicit transfer of knowledge from known to new classes, hence it can be inspected and visualized, which is impossible with recently popular implicit prompt learning strategies. Our mapping is used to construct a classifier for the new class, by adapting the neural weights of the classifiers for the known classes. Effectively we synthesize a new classifier. Our method is flexible: we show its efficacy for various knowledge representations and various neural networks (whereas prompt learning is limited to language-vision models). Our synthesized classifier can operate directly on test samples in a zero-shot fashion. We outperform CLIP especially for uncommon image classes, sometimes by margins up to 32%. Because the synthesized classifier consists of a tensor layer, it can be optimized further when a (few) labeled images become available. For few-shot learning, our synthesized classifier provides a kickstart. With one label per class, it outperforms strong baselines that require annotation of attributes or heavy pretraining (CLIP) by 8%, and increases accuracy by 39% relative to conventional classifier initialization. The code is available.
Download

Paper Nr:	45
Title:	StyleHumanCLIP: Text-Guided Garment Manipulation for StyleGAN-Human
Authors:	Takato Yoshikawa, Yuki Endo and Yoshihiro Kanamori
Abstract:	This paper tackles text-guided control of StyleGAN for editing garments in full-body human images. Existing StyleGAN-based methods suffer from handling the rich diversity of garments and body shapes and poses. We propose a framework for text-guided full-body human image synthesis via an attention-based latent code mapper, which enables more disentangled control of StyleGAN than existing mappers. Our latent code mapper adopts an attention mechanism that adaptively manipulates individual latent codes on different StyleGAN layers under text guidance. In addition, we introduce feature-space masking at inference time to avoid unwanted changes caused by text inputs. Our quantitative and qualitative evaluations reveal that our method can control generated images more faithfully to given texts than existing methods.
Download

Paper Nr:	62
Title:	S3Aug: Segmentation, Sampling, and Shift for Action Recognition
Authors:	Taiki Sugiura and Toru Tamaki
Abstract:	Action recognition is a well-established area of research in computer vision. In this paper, we propose S3Aug, a video data augmenatation for action recognition. Unlike conventional video data augmentation methods that involve cutting and pasting regions from two videos, the proposed method generates new videos from a single training video through segmentation and label-to-image transformation. Furthermore, the proposed method modifies certain categories of label images by sampling to generate a variety of videos, and shifts intermediate features to enhance the temporal coherency between frames of the generate videos. Experimental results on the UCF101, HMDB51, and Mimetics datasets demonstrate the effectiveness of the proposed method, paricularlly for out-of-context videos of the Mimetics dataset.
Download

Paper Nr:	77
Title:	Attention-Based Shape and Gait Representations Learning for Video-Based Cloth-Changing Person Re-Identification
Authors:	Vuong D. Nguyen, Samiha Mirza, Pranav Mantini and Shishir K. Shah
Abstract:	Current state-of-the-art Video-based Person Re-Identification (Re-ID) primarily relies on appearance features extracted by deep learning models. These methods are not applicable for long-term analysis in real-world scenarios where persons have changed clothes, making appearance information unreliable. In this work, we deal with the practical problem of Video-based Cloth-Changing Person Re-ID (VCCRe-ID) by proposing “Attention-based Shape and Gait Representations Learning” (ASGL) for VCCRe-ID. Our ASGL framework improves Re-ID performance under clothing variations by learning clothing-invariant gait cues using a Spatial-Temporal Graph Attention Network (ST-GAT). Given the 3D-skeleton-based spatial-temporal graph, our proposed ST-GAT comprises multi-head attention modules, which are able to enhance the robustness of gait embeddings under viewpoint changes and occlusions. The ST-GAT amplifies the important motion ranges and reduces the influence of noisy poses. Then, the multi-head learning module effectively reserves beneficial local temporal dynamics of movement. We also boost discriminative power of person representations by learning body shape cues using a GAT. Experiments on two large-scale VCCRe-ID datasets demonstrate that our proposed framework outperforms state-of-the-art methods by 12.2% in rank-1 accuracy and 7.0% in mAP.
Download

Paper Nr:	110
Title:	Reducing Bias in Pre-Trained Models by Tuning While Penalizing Change
Authors:	Niklas Penzel, Gideon Stein and Joachim Denzler
Abstract:	Deep models trained on large amounts of data often incorporate implicit biases present during training time. If later such a bias is discovered during inference or deployment, it is often necessary to acquire new data and retrain the model. This behavior is especially problematic in critical areas such as autonomous driving or medical decision-making. In these scenarios, new data is often expensive and hard to come by. In this work, we present a method based on change penalization that takes a pre-trained model and adapts the weights to mitigate a previously detected bias. We achieve this by tuning a zero-initialized copy of a frozen pre-trained network. Our method needs very few, in extreme cases only a single, examples that contradict the bias to increase performance. Additionally, we propose an early stopping criterion to modify baselines and reduce overfitting. We evaluate our approach on a well-known bias in skin lesion classification and three other datasets from the domain shift literature. We find that our approach works especially well with very few images. Simple fine-tuning combined with our early stopping also leads to performance benefits for a larger number of tuning samples.
Download

Paper Nr:	111
Title:	Important Pixels Sampling for NeRF Training Based on Edge Values and Squared Errors Between the Ground Truth and the Estimated Colors
Authors:	Kohei Fukuda, Takio Kurita and Hiroaki Aizawa
Abstract:	Neural Radiance Fields (NeRF) has impacted computer graphics and computer vision by enabling fine 3D representations using neural networks. However, depending on the data (especially on synthetic datasets with single-color backgrounds), the neural network training of NeRF is often unstable, and the rendering results become poor. This paper proposes a method to sample the informative pixels to remedy these shortcomings. The sampling method consists of two phases. In the early stage of learning (up to 1/10 of all iterations), the sampling probability is determined based on the edge strength obtained by edge detection. Also, we use the squared errors between the ground truth and the estimated color of the pixels for sampling. The introduction of these tweaks improves the learning of NeRF. In the experiment, we confirmed the effectiveness of the method. In particular, for small amounts of data, the training process of the neural network for NeRF was accelerated and stabilized.
Download

Paper Nr:	124
Title:	Pixel-Wise Gradient Uncertainty for Convolutional Neural Networks Applied to Out-of-Distribution Segmentation
Authors:	Kira Maag and Tobias Riedlinger
Abstract:	In recent years, deep neural networks have defined the state-of-the-art in semantic segmentation where their predictions are constrained to a predefined set of semantic classes. They are to be deployed in applications such as automated driving, although their categorically confined expressive power runs contrary to such open world scenarios. Thus, the detection and segmentation of objects from outside their predefined semantic space, i.e., out-of-distribution (OoD) objects, is of highest interest. Since uncertainty estimation methods like softmax entropy or Bayesian models are sensitive to erroneous predictions, these methods are a natural baseline for OoD detection. Here, we present a method for obtaining uncertainty scores from pixel-wise loss gradients which can be computed efficiently during inference. Our approach is simple to implement for a large class of models, does not require any additional training or auxiliary data and can be readily used on pre-trained segmentation models. Our experiments show the ability of our method to identify wrong pixel classifications and to estimate prediction quality at negligible computational overhead. In particular, we observe superior performance in terms of OoD segmentation to comparable baselines on the SegmentMeIfYouCan benchmark, clearly outperforming other methods.
Download

Paper Nr:	153
Title:	Enabling RAW Image Classification Using Existing RGB Classifiers
Authors:	Rasmus Munksø, Mathias V. Andersen, Lau Nørgaard, Andreas Møgelmose and Thomas B. Moeslund
Abstract:	Unprocessed RAW data stands out as a highly valuable image format in image editing and computer vision due to it preserving more details, colors, and a wider dynamic range as captured directly from the camera’s sensor compared to non-linearly processed RGB images. Despite its advantages, the computer vision community has largely overlooked RAW files, especially in domains where preserving precise details and accurate colors are crucial. This work addresses this oversight by leveraging transfer learning techniques. By exploiting the vast amount of available RGB data, we enhance the usability of a limited RAW image dataset for image classification. Surprisingly, applying transfer learning from an RGB-trained model to a RAW dataset yields impressive performance, reducing the dataset size barrier in RAW research. These results are promising, demonstrating the potential of cross-domain transfer learning between RAW and RGB data and opening doors for further exploration in this area of research.
Download

Paper Nr:	156
Title:	Non-Local Context-Aware Attention for Object Detection in Remote Sensing Images
Authors:	Yassin Terraf, El M. Mercha and Mohammed Erradi
Abstract:	Object detection in remote sensing images has been widely studied due to the valuable insights it provides for different fields. Detecting objects in remote sensing images is a very challenging task due to the diverse range of sizes, orientations, and appearances of objects within the images. Many approaches have been developed to address these challenges, primarily focusing on capturing semantic information while missing out on contextual details that can bring more insights to the analysis. In this work, we propose a Non-Local Context-Aware Attention (NLCAA) approach for object detection in remote sensing images. NLCAA includes semantic and contextual attention modules to capture both semantic and contextual information. Extensive experiments were conducted on two publicly available datasets, namely NWPU VHR and DIOR, to evaluate the performance of the proposed approach. The experimental results demonstrate the effectiveness of the NLCAA approach against various state-of-the-art methods.
Download

Paper Nr:	178
Title:	Mediapi-RGB: Enabling Technological Breakthroughs in French Sign Language (LSF) Research Through an Extensive Video-Text Corpus
Authors:	Yanis Ouakrim, Hannah Bull, Michèle Gouiffès, Denis Beautemps, Thomas Hueber and Annelies Braffort
Abstract:	We introduce Mediapi-RGB, a new dataset of French Sign Language (LSF) along with the first LSF-to-French machine translation model. With 86 hours of video, it the largest LSF corpora with translation. The corpus consists of original content in French Sign Language produced by deaf journalists, and has subtitles in written French aligned to the signing. The current release of Mediapi-RGB is available at the Ortolang corpus repository (https://www.ortolang.fr/workspaces/mediapi-rgb), and can be used for academic research purposes. The test and validation sets contain 13 and 7 hours of video respectively. The training set contains 66 hours of video that will be released progressively until December 2024. Additionally, the current release contains skeleton keypoints, sign temporal segmentation, spatio-temporal features and subtitles for all the videos in the train, validation and test sets, as well as a suggested vocabulary of nouns for evaluation purposes. In addition, we present the results obtained on this corpus with the first LSF-to-French translation baseline to give an overview of the possibilities offered by this corpus of unprecedented caliber for LSF. Finally, we suggest potential technological and linguistic applications for this new video-text dataset.
Download

Paper Nr:	209
Title:	When Medical Imaging Met Self-Attention: A Love Story That Didn’t Quite Work out
Authors:	Tristan Piater, Niklas Penzel, Gideon Stein and Joachim Denzler
Abstract:	A substantial body of research has focused on developing systems that assist medical professionals during labor-intensive early screening processes, many based on convolutional deep-learning architectures. Recently, multiple studies explored the application of so-called self-attention mechanisms in the vision domain. These studies often report empirical improvements over fully convolutional approaches on various datasets and tasks. To evaluate this trend for medical imaging, we extend two widely adopted convolutional architectures with different self-attention variants on two different medical datasets. With this, we aim to specifically evaluate the possible advantages of additional self-attention. We compare our models with similarly sized convolutional and attention-based baselines and evaluate performance gains statistically. Additionally, we investigate how including such layers changes the features learned by these models during the training. Following a hyperparameter search, and contrary to our expectations, we observe no significant improvement in balanced accuracy over fully convolutional models. We also find that important features, such as dermoscopic structures in skin lesion images, are still not learned by employing self-attention. Finally, analyzing local explanations, we confirm biased feature usage. We conclude that merely incorporating attention is insufficient to surpass the performance of existing fully convolutional methods.
Download

Paper Nr:	248
Title:	RailCloud-HdF: A Large-Scale Point Cloud Dataset for Railway Scene Semantic Segmentation
Authors:	Mahdi Abid, Mathis Teixeira, Ankur Mahtani and Thomas Laurent
Abstract:	Semantic scene perception is critical for various applications, including railway systems where safety and efficiency are paramount. Railway applications demand precise knowledge of the environment, making Light Detection and Ranging (LiDAR) a fundamental component of sensor suites. Despite the significance of 3D semantic scene understanding in railway context, there exists no publicly available railborne LiDAR dataset tailored for this purpose. In this work, we present a large-scale point cloud dataset designed to advance research in LiDAR-based semantic scene segmentation for railway applications. Our dataset offers dense point-wise annotations for diverse railway scenes, covering over 267km. To facilitate rigorous evaluation and benchmarking, we propose semantic segmentation of point clouds from a single LiDAR scan as a challenging task. Furthermore, we provide baseline experiments to showcase some state-of-the-art deep learning methods for this task. Our findings highlight the need for more advanced models to effectively address this task. This dataset not only catalyzes the development of sophisticated methods for railway applications, but also encourages exploration of novel research directions.
Download

Paper Nr:	255
Title:	Investigating the Corruption Robustness of Image Classifiers with Random p-norm Corruptions
Authors:	Georg Siedel, Weijia Shao, Silvia Vock and Andrey Morozov
Abstract:	Robustness is a fundamental property of machine learning classifiers required to achieve safety and reliability. In the field of adversarial robustness of image classifiers, robustness is commonly defined as the stability of a model to all input changes within a p-norm distance. However, in the field of random corruption robustness, variations observed in the real world are used, while p-norm corruptions are rarely considered. This study investigates the use of random p-norm corruptions to augment the training and test data of image classifiers. We evaluate the model robustness against imperceptible random p-norm corruptions and propose a novel robustness metric. We empirically investigate whether robustness transfers across different p-norms and derive conclusions on which p-norm corruptions a model should be trained and evaluated. We find that training data augmentation with a combination of p-norm corruptions significantly improves corruption robustness, even on top of state-of-the-art data augmentation schemes.
Download

Paper Nr:	267
Title:	Calisthenics Skills Temporal Video Segmentation
Authors:	Antonio Finocchiaro, Giovanni M. Farinella and Antonino Furnari
Abstract:	Calisthenics is a fast-growing bodyweight discipline that consists of different categories, one of which is focused on skills. Skills in calisthenics encompass both static and dynamic elements performed by athletes. The evaluation of static skills is based on their difficulty level and the duration of the hold. Automated tools able to recognize isometric skills from a video by segmenting them to estimate their duration would be desirable to assist athletes in their training and judges during competitions. Although the video understanding literature on action recognition through body pose analysis is rich, no previous work has specifically addressed the problem of calisthenics skill temporal video segmentation. This study aims to provide an initial step towards the implementation of automated tools within the field of Calisthenics. To advance knowledge in this context, we propose a dataset of video footage of static calisthenics skills performed by athletes. Each video is annotated with a temporal segmentation which determines the extent of each skill. We hence report the results of a baseline approach to address the problem of skill temporal segmentation on the proposed dataset. The results highlight the feasibility of the proposed problem, while there is still room for improvement.
Download

Paper Nr:	288
Title:	Detecting Anomalies in Textured Images Using Modified Transformer Masked Autoencoder
Authors:	Afshin Dini and Esa Rahtu
Abstract:	We present a new method for detecting and locating anomalies in textured-type images using transformer-based autoencoders. In this approach, a rectangular patch of an image is masked by setting its value to gray and then fetched into a pre-trained autoencoder with several blocks of transformer encoders and decoders in order to reconstruct the unknown part. It is shown that the pre-trained model is not able to reconstruct the defective parts properly when they are inside the masked patch. In this regard, the combination of the Structural Similarity Index Measure and absolute error between the reconstructed image and the original one can be used to define a new anomaly map to find and locate anomalies. In the experiment with the textured images of the MVTec dataset, we discover that not only can this approach find anomalous samples properly, but also the anomaly map itself can specify the exact locations of defects correctly at the same time. Moreover, not only is our method computationally efficient, as it utilizes a pre-trained model and does not require any training, but also it has a better performance compared to previous autoencoders and other reconstruction-based methods. Due to these reasons, one can use this method as a base approach to find and locate irregularities in real-world applications.
Download

Paper Nr:	304
Title:	Parts-Based Implicit 3D Face Modeling
Authors:	Yajie Gu and Nick Pears
Abstract:	Previous 3D face analysis has focussed on 3D facial identity, expression and pose disentanglement. However, the independent control of different facial parts and the ability to learn explainable parts-based latent shape embeddings for implicit surfaces remain as open problems. We propose a method for 3D face modeling that learns a continuous parts-based deformation field that maps the various semantic parts of a subject’s face to a template. By swapping affine-mapped facial features among different individuals from predefined regions we achieve significant parts-based training data augmentation. Moreover, by sequentially morphing the surface points of these parts, we learn corresponding latent representations, shape deformation fields, and the signed distance function of a template shape. This gives improved shape controllability and better interpretability of the face latent space, while retaining all of the known advantages of implicit surface modelling. Unlike previous works that generated new faces based on full-identity latent representations, our approach enables independent control of different facial parts, i.e. nose, mouth, eyes and also the remaining surface and yet generates new faces with high reconstruction quality. Evaluations demonstrate both facial expression and parts disentanglement, independent control of those facial parts, as well as state-of-the art facial parts reconstruction, when evaluated on FaceScape and Headspace datasets.
Download

Paper Nr:	328
Title:	Robust Long-Tailed Image Classification via Adversarial Feature Re-Calibration
Authors:	Jinghao Zhang, Zhenhua Feng and Yaochu Jin
Abstract:	Long-tailed data distribution is a common issue in many practical learning-based approaches, causing Deep Neural Networks (DNNs) to under-fit minority classes. Although this biased problem has been extensively studied by the research community, the existing approaches mainly focus on the class-wise (inter-class) imbalance problem. In contrast, this paper considers both inter-class and intra-class data imbalance problems for network training. To this end, we present Adversarial Feature Re-calibration (AFR), a method that improves the standard accuracy of a trained deep network by adding adversarial perturbations to the majority samples of each class. To be specific, an adversarial attack model is fine-tuned to perturb the majority samples by injecting the features from their corresponding intra-class long-tailed minority samples. This procedure makes the dataset more evenly distributed from both the inter- and intra-class perspectives, thus encouraging DNNs to learn better representations. The experimental results obtained on CIFAR-100-LT demonstrate the effectiveness and superiority of the proposed AFR method over the state-of-the-art long-tailed learning methods.
Download

Paper Nr:	331
Title:	Alias-Free GAN for 3D-Aware Image Generation
Authors:	Attila Szabó, Yevgeniy Puzikov, Sahan Ayvaz, Sonia Aurelio, Peter Gehler, Reza Shirvany and Malte Alf
Abstract:	In this work we build a 3D-aware generative model that produces high quality results with fast inference times. A 3D-aware model generates images and offers control over camera parameters to the user, so that an object can be shown from different viewpoints. The model we build combines the best of two worlds in a very direct way: alias-free Generative Adversarial Networks (GAN) and a Neural Radiance Field (NeRF) rendering, followed by image super-resolution. We show that fast and high-quality image synthesis is possible with careful modifications of the well designed architecture of StyleGAN3. Our design overcomes the problem of viewpoint inconsistency and aliasing artefacts that a direct application of lower-resolution NeRF would exhibit. We show experimental evaluation on two standard benchmark datasets, FFHQ and AFHQv2 and achieve the best or competitive performance on both. Our method does not sacrifice speed, we can render images at megapixel resolution at interactive frame rates.
Download

Paper Nr:	360
Title:	Conditional Vector Graphics Generation for Music Cover Images
Authors:	Ivan Jarsky, Valeria Efimova, Ilya Bizyaev and Andrey Filchenkov
Abstract:	Generative Adversarial Networks (GAN) have motivated a rapid growth of the domain of computer image synthesis. As almost all the existing image synthesis algorithms consider an image as a pixel matrix, the high-resolution image synthesis is complicated. A good alternative can be vector images. However, they belong to the highly sophisticated parametric space, which is a restriction for solving the task of synthesizing vector graphics by GANs. In this paper, we consider a specific application domain that softens this restriction dramatically allowing the usage of vector image synthesis. Music cover images should meet the requirements of Internet streaming services and printing standards, which imply high resolution of graphic materials without any additional requirements on the content of such images. Existing music cover image generation services do not analyze tracks themselves; however, some services mostly consider only genre tags. To generate music covers as vector images that reflect the music and consist of simple geometric objects, we suggest a GAN-based algorithm called CoverGAN. The assessment of resulting images is based on their correspondence to the music compared with AttnGAN and DALL-E text-to-image generation according to title or lyrics. Moreover, the significance of the patterns found by CoverGAN has been evaluated in terms of the correspondence of the generated cover images to the musical tracks. Listeners evaluate the music covers generated by the proposed algorithm as quite satisfactory and corresponding to the tracks. Music cover images generation code and demo are available at https://github.com/IzhanVarsky/CoverGAN.
Download

Paper Nr:	365
Title:	Anomaly Detection on Roads Using an LSTM and Normal Maps
Authors:	Yusuke Nonaka, Hideo Saito, Hideaki Uchiyama, Kyota Higa and Masahiro Yamaguchi
Abstract:	Detecting anomalies on the road is crucial for generating hazard maps within factory premises and facilitating navigation for visually impaired individuals or robots. This paper proposes a method for anomaly detection on road surfaces using normal maps and a Long Short-Term Memory (LSTM). While existing research primarily focuses on detecting anomalies on the road based on variations in height or color information of images, our approach leverages anomaly detection to identify changes in the spatial structure of the walking scenario. The normal (non-anomaly) data consists of time series normal maps depicting previously traversed roads, which are utilized to predict the upcoming road conditions. Subsequently, an anomaly score is computed by comparing the predicted normal map with the normal map at t +1. If the anomaly score exceeds a dynamically set threshold, it indicates the presence of anomalies on the road. The proposed method employs unsupervised learning for anomaly detection. To assess the effectiveness of the proposed method, we conducted accuracy assessments using a custom dataset, taking into account a qualitative comparison with the results of existing methods. The results confirm that the proposed method effectively detects anomalies on road surfaces through anomaly detection.
Download

Paper Nr:	368
Title:	Beyond the Known: Adversarial Autoencoders in Novelty Detection
Authors:	Muhammad Asad, Ihsan Ullah, Ganesh Sistu and Michael G. Madden
Abstract:	In novelty detection, the goal is to decide if a new data point should be categorized as an inlier or an outlier, given a training dataset that primarily captures the inlier distribution. Recent approaches typically use deep encoder and decoder network frameworks to derive a reconstruction error, and employ this error either to determine a novelty score, or as the basis for a one-class classifier. In this research, we use a similar framework but with a lightweight deep network, and we adopt a probabilistic score with reconstruction error. Our methodology calculates the probability of whether the sample comes from the inlier distribution or not. This work makes two key contributions. The first is that we compute the novelty probability by linearizing the manifold that holds the structure of the inlier distribution. This allows us to interpret how the probability is distributed and can be determined in relation to the local coordinates of the manifold tangent space. The second contribution is that we improve the training protocol for the network. Our results indicate that our approach is effective at learning the target class, and it outperforms recent state-of-the-art methods on several benchmark datasets.
Download

Paper Nr:	399
Title:	Image Generation from Hyper Scene Graphs with Trinomial Hyperedges Using Object Attention
Authors:	Ryosuke Miyake, Tetsu Matsukawa and Einoshin Suzuki
Abstract:	Conditional image generation, which aims to generate consistent images with a user’s input, is one of the critical problems in computer vision. Text-to-image models have succeeded in generating realistic images for simple situations in which a few objects are present. Yet, they often fail to generate consistent images for texts representing complex situations. Scene-graph-to-image models have the advantage of generating images for complex situations based on the structure of a scene graph. We extended a scene-graph-to-image model to an image generation model from a hyper scene graph with trinomial hyperedges. Our model, termed hsg2im, improved the consistency of the generated images. However, hsg2im has difficulty in generating natural and consistent images for hyper scene graphs with many objects. The reason is that the graph convolutional network in hsg2im struggles to capture relations of distant objects. In this paper, we propose a novel image generation model which addresses this shortcoming by introducing object attention layers. We also use a layout-to-image model auxiliary to generate higher-resolution images. Experimental validations on COCO-Stuff and Visual Genome datasets show that the proposed model generates more natural and consistent images to user’s inputs than the cutting-edge hyper scene-graph-to-image model.
Download

Paper Nr:	424
Title:	CSE: Surface Anomaly Detection with Contrastively Selected Embedding
Authors:	Simon Thomine and Hichem Snoussi
Abstract:	Detecting surface anomalies of industrial materials poses a significant challenge within a myriad of industrial manufacturing processes. In recent times, various methodologies have emerged, capitalizing on the advantages of employing a network pre-trained on natural images for the extraction of representative features. Subsequently, these features are subjected to processing through a diverse range of techniques including memory banks, normalizing flow, and knowledge distillation, which have exhibited exceptional accuracy. This paper revisits approaches based on pre-trained features by introducing a novel method centered on target-specific embedding. To capture the most representative features of the texture under consideration, we employ a variant of a contrastive training procedure that incorporates both artificially generated defective samples and anomaly-free samples during training. Exploiting the intrinsic properties of surfaces, we derived a meaningful representation from the defect-free samples during training, facilitating a straightforward yet effective calculation of anomaly scores. The experiments conducted on the MVTEC AD and TILDA datasets demonstrate the competitiveness of our approach compared to state-of-the-art methods.
Download

Short Papers

Paper Nr:	25
Title:	Improving Pseudo-Labelling and Enhancing Robustness for Semi-Supervised Domain Generalization
Authors:	Adnan Khan, Mai A. Shaaban and Muhammad Haris Khan
Abstract:	Beyond attaining domain generalization (DG), visual recognition models should also be data-efficient during learning by leveraging limited labels. We study the problem of Semi-Supervised Domain Generalization (SSDG) which is crucial for real-world applications like automated healthcare. SSDG requires learning a cross-domain generalizable model when the given training data is only partially labelled. Empirical investigations reveal that the DG methods tend to underperform in SSDG settings, likely because they are unable to exploit the unlabelled data. Semi-supervised learning (SSL) shows improved but still inferior results compared to fully-supervised learning. A key challenge, faced by the best performing SSL-based SSDG methods, is selecting accurate pseudo-labels under multiple domain shifts and reducing overfitting to source domains under limited labels. In this work, we propose new SSDG approach, which utilizes a novel uncertainty-guided pseudo-labelling with model averaging (UPLM). Our uncertainty-guided pseudo-labelling (UPL) uses model uncertainty to improve pseudo-labelling selection, addressing poor model calibration under multi-source unla-belled data. The UPL technique, enhanced by our novel model averaging (MA) strategy, mitigates overfitting to source domains with limited labels. Extensive experiments on key representative DG datasets suggest that our method demonstrates effectiveness against existing methods. Our code and chosen labelled data seeds are available on GitHub: https://github.com/Adnan-Khan7/UPLM.
Download

Paper Nr:	32
Title:	Classification of Towels in a Robotic Workcell Using Deep Neural Networks
Authors:	Jens M. Rossen, Patrick S. Terp, Norbert Krüger, Laus S. Bigum and Tudor Morar
Abstract:	The industrial laundry industry is becoming increasingly more automated. Inwatec, a company specializing in this field, is developing a new robot (BLIZZ) to automate the process of grasping individual clean towels from a pile, and hand them over to an external folding machine. However, to ensure that towels are folded consistently, information about the type and faces of the towels is required. This paper presents a proof of concept for a towel type and towel face classification system integrated in BLIZZ. These two classification problems are solved by means of a Deep Neural Network (DNN). The performance of the proposed DNN on each of the two classification problems is presented, along with the performance of it solving both classification problems at the same time. It is concluded that the proposed network achieves classification accuracies of 94 .48%, 97.71% and 98.52% on the face classification problem for three different towel types with non-identical faces. On the type classification problem, it achieves an accuracy of 99.10% on the full dataset. Additionally, it is concluded that the system achieves an accuracy of 96.96% when simultaneously classifying the type and face of a towel on the full dataset.
Download

Paper Nr:	33
Title:	Evaluating Learning Potential with Internal States in Deep Neural Networks
Authors:	Shogo Takasaki and Shuichi Enokida
Abstract:	Deploying deep learning models on small-scale computing devices necessitates considering computational resources. However, reducing the model size to accommodate these resources often results in a trade-off with accuracy. The iterative process of training and validating to optimize model size and accuracy can be inefficient. A potential solution to this dilemma is the extrapolation of learning curves, which evaluates a model’s potential based on initial learning curves. As a result, it is possible to efficiently search for a network that achieves a balance between accuracy and model size. Nonetheless, we posit that a more effective approach to analyzing the latent potential of training models is to focus on the internal state, rather than merely relying on the validation scores. In this vein, we propose a module dedicated to scrutinizing the network’s internal state, with the goal of automating the optimization of both accuracy and network size. Specifically, this paper delves into analyzing the latent potential of the network by leveraging the internal state of the Long Short-Term Memory (LSTM) in a traffic accident prediction network.
Download

Paper Nr:	51
Title:	Cybersecurity Intrusion Detection with Image Classification Model Using Hilbert Curve
Authors:	Punyawat Jaroensiripong, Karin Sumongkayothin, Prarinya Siritanawan and Kazunori Kotani
Abstract:	Cybersecurity intrusion detection is crucial for protecting an online system from cyber-attacks. Traditional monitoring methods used in the Security Operation Center (SOC) are insufficient to handle the vast volume of traffic data, producing an overwhelming number of false alarms, and eventually resulting in the neglect of intrusion incidents. The recent integration of Machine Learning (ML) and Deep Learning (DL) into SOC monitoring systems has enhanced the intrusion detection capabilities by learning the patterns of network traffic data. Despite many ML methods implemented for intrusion detection, the Convolutional Neural Network (CNN), one of the most high-performing ML algorithms, has not been widely adopted for the intrusion detection systems. This research aims to explore the potentials of CNN implementation with the network data flows. Since the CNN was originally designed for image processing applications, it is necessary to convert the 1-dimensional network data flows into 2-dimensional image data. This research presents a novel approach to convert the network data flow into an image (flow-to-image) by the Hilbert curve mapping algorithm which can preserve the locality of the data. Then, we apply the converted images to the CNN-based intrusion detection system. Eventually, the proposed method and model can outperform the recent methods with 92.43% accuracy and 93.05% F1-score on the CIC-IDS2017 dataset, and 81.78% accuracy and 83.46% F1-score on the NSL-KDD dataset. In addition to the classification capability, the flow-to-image mapping algorithm can also visualize the characteristics of the network attack on the images visually, which can be an alternative monitoring approach for SOC.
Download

Paper Nr:	61
Title:	Lens Flare-Aware Detector in Autonomous Driving
Authors:	Shanxing Ma and Jan Aelterman
Abstract:	Autonomous driving has the potential of reducing traffic accidents, and object detection plays a key role. This paper focuses on the study of object detection in the presence of lens flare. We analyze the impact of lens flare on object detection in autonomous driving tasks and propose a lens flare adaptation method based on Bayesian reasoning theory to optimize existing object detection models. This allows us to adjust the detection scores to re-rank the detections of detection models based on the intensity of lens flare and achieve a higher average precision. Furthermore, this method only requires simple modifications based on the detection results of the existing object detection models, making it easier to deploy on existing devices.
Download

Paper Nr:	74
Title:	Towards Rapid Prototyping and Comparability in Active Learning for Deep Object Detection
Authors:	Tobias Riedlinger, Marius Schubert, Karsten Kahl, Hanno Gottschalk and Matthias Rottmann
Abstract:	Active learning as a paradigm in deep learning is especially important in applications involving intricate perception tasks such as object detection where labels are difficult and expensive to acquire. Development of active learning methods in such fields is highly computationally expensive and time consuming which obstructs the progression of research and leads to a lack of comparability between methods. In this work, we propose and investigate a sandbox setup for rapid development and transparent evaluation of active learning in deep object detection. Our experiments with commonly used configurations of datasets and detection architectures found in the literature show that results obtained in our sandbox environment are representative of results on standard configurations. The total compute time to obtain results and assess the learning behavior can be reduced by factors of up to 14 compared to Pascal VOC and up to 32 compared to BDD100k. This allows for testing and evaluating data acquisition and labeling strategies in under half a day and contributes to the transparency and development speed in the field of active learning for object detection.
Download

Paper Nr:	76
Title:	Deep Active Learning with Noisy Oracle in Object Detection
Authors:	Marius Schubert, Tobias Riedlinger, Karsten Kahl and Matthias Rottmann
Abstract:	Obtaining annotations for complex computer vision tasks such as object detection is an expensive and timeintense endeavor involving numerous human workers or expert opinions. Reducing the amount of annotations required while maintaining algorithm performance is, therefore, desirable for machine learning practitioners and has been successfully achieved by active learning. However, it is not merely the amount of annotations which influences model performance but also the annotation quality. In practice, oracles that are queried for new annotations frequently produce significant amounts of noise. Therefore, cleansing procedures are oftentimes necessary to review and correct given labels. This process is subject to the same budget as the initial annotation itself since it requires human workers or even domain experts. Here, we propose a composite active learning framework including a label review module for deep object detection. We show that utilizing part of the annotation budget to correct the noisy annotations partially in the active dataset leads to early improvements in model performance, especially when coupled with uncertainty-based query strategies. The precision of the label error proposals significantly influences the measured effect of the label review. In our experiments we achieve improvements of up to 4.5mAP points by incorporating label reviews at equal annotation budget.
Download

Paper Nr:	82
Title:	Multi-Task Learning Based on Log Dynamic Loss Weighting for Sex Classification and Age Estimation on Panoramic Radiographs
Authors:	Igor Prado, David Lima, Julian Liang, Ana Hougaz, Bernardo Peters and Luciano Oliveira
Abstract:	This paper introduces a multi-task learning (MTL) approach for simultaneous sex classification and age estimation in panoramic radiographs, aligning with the tasks pertinent to forensic dentistry. For that, we dynamically optimize the logarithm of the task-specific weights during the loss training. Our results demonstrate the superior performance of our proposed MTL network compared to the individual task-based networks, particularly evident across a diverse data set comprising 7,666 images, spanning ages from 1 to 90 years and encompassing significant sex variability. Our network achieved an F1-score of 90.37%±0.54 and a mean absolute error of 5.66±0.22 through a cross-validation assessment procedure, which resulted in a gain of 1.69 percentage points and 1.15 years with respect to the individual sex classification and age estimation procedures. To the best of our knowledge, it is the first successful MTL-based network for these two tasks.
Download

Paper Nr:	88
Title:	A Comparative Evaluation of Self-Supervised Methods Applied to Rock Images Classification
Authors:	Van T. Nguyen, Dominique Fourer, Désiré Sidibé, Jean-François Lecomte and Souhail Youssef
Abstract:	Digital Rock Physics DRP is a discipline that employs advanced computational techniques to analyze and simulate rock properties at the pore-scale level. Recently, Self-Supervised Learning (SSL) has shown promising outcomes in various application domains, but its potential in DRP applications remains largely unexplored. In this study, we propose to assess several self-supervised representation learning methods designed for automatic rock category recognition. Hence, we demonstrate how different SSL approaches can be specifically adapted for DRP, and comparatively evaluated on a new dataset. Our objective is to leverage unlabeled micro-CT (Computed Tomography) image data to train models that capture intricate rock features and obtain representations that enhance the accuracy of classical machine-learning-based rock images classification. Experimental results on a newly proposed rock images dataset indicate that a model initialized using SSL pretraining outperforms its non-self-supervised learning counterpart. Particularly, we find that MoCo-v2 pretraining provides the most benefit with limited labeled training data compared to other models, including supervised model.
Download

Paper Nr:	123
Title:	Kore Initial Clustering for Unsupervised Domain Adaptation
Authors:	Kyungsik Lee, Youngmi Jun, EunJi Kim, Suhyun Kim, Seong J. Hwang and Jonghyun Choi
Abstract:	In unsupervised domain adaptation (UDA) literature, there exists an array of techniques to derive domain adaptive features. Among them, a particularly successful family of approaches of pseudo-labeling the unlabeled target data has shown promising results. Yet, the majority of the existing methods primarily focus on leveraging only the target domain knowledge for pseudo-labeling while insufficiently considering the source domain knowledge. Here, we hypothesize that quality pseudo labels obtained via classical K-means clustering considering both the source and target domains bring simple yet significant benefits. In particular, we propose to assign pseudo labels to the target domain’s instances better aligned with the source domain labels by a simple method that modifies K-means clustering by emphasizing the strengthened notion of centroids, namely, Kore Initial Clustering (KIC). The proposed KIC is readily utilizable with a wide array of UDA models, consistently improving the UDA performance on multiple UDA datasets including Office-Home and Office-31, demonstrating the efficacy of pseudo labels in UDA.
Download

Paper Nr:	135
Title:	Interpretable Anomaly Analysis for Surveillance Video
Authors:	Meng Dong
Abstract:	Nowadays, there exist plenty of techniques for surveillance video anomaly detection. However, most works focus on anomaly detection, ignoring the interpreting process for anomaly reasons, especially for real-time anomaly monitoring. The automatic surveillance systems would respond to plenty of alarms based on the types and scores of anomalies and then report to the proper parties. Usually, there exist various types of anomalies, such as abnormal objects, motion, and behaviors, captured by surveillance cameras and defined by application requirements. In this work, we investigate the perspective of reasons for anomalies and propose a general and interpretable anomaly analysis framework formed by three branches: abnormal object categories detection, anomalous motion detection, and strange/violent behaviors recognition, and then related scores are combined to obtain final results. The above three branches could cover various anomaly types in the real world. Besides, the fusion of branches is multivariate based on specific domains or user requirements. They can work together or individually based on specific requirements. In particular, an online non-parametric hierarchical event updating motion model is proposed to explore general motion anomaly. The events with low frequency or have not been seen before could be detected in an unsupervised and continue-updating way. Besides, abnormal human behaviors, such as falling and violence, could be recognized by a spatial-temporal transformer model. Three branches cover different regions but complement each other for joint detection and output interpretable anomaly results. Evaluated on existing datasets, our results are competitive to the online and offline state-of-the-art on several public datasets, demonstrating the proposed method’s scene-independent and interpretable abilities even with simple motion update methods. Moreover, the performance of individual anomaly detectors also validates the effectiveness of our proposed method.

Paper Nr:	142
Title:	How Quality Affects Deep Neural Networks in Fine-Grained Image Classification
Authors:	Joseph Smith, Zheming Zuo, Jonathan Stonehouse and Boguslaw Obara
Abstract:	In this paper, we propose a No-Reference Image Quality Assessment (NRIQA) guided cut-off point selection (CPS) strategy to enhance the performance of a fine-grained classification system. Scores given by existing NRIQA methods on the same image may vary and not be as independent of natural image augmentations as expected, which weakens their connection and explainability to fine-grained image classification. Taking the three most commonly adopted image augmentation configurations – cropping, rotating, and blurring – as the entry point, we formulate a two-step mechanism for selecting the most discriminative subset from a given image dataset by considering both the confidence of model predictions and the density distribution of image qualities over several NRIQA methods. Concretely, the cut-off points yielded by those methods are aggregated via majority voting to inform the process of image subset selection. The efficacy and efficiency of such a mechanism have been confirmed by comparing the models being trained on high-quality images against a combination of high- and low-quality ones, with a range of 0.7% to 4.2% improvement on a commercial product dataset in terms of mean accuracy through four deep neural classifiers. The robustness of the mechanism has been proven by the observations that all the selected high-quality images can work jointly with 70% low-quality images with 1.3% of classification precision sacrificed when using ResNet34 in an ablation study.
Download

Paper Nr:	151
Title:	Efficient Parameter Mining and Freezing for Continual Object Detection
Authors:	Angelo G. Menezes, Augusto J. Peterlevitz, Mateus A. Chinelatto and André F. de Carvalho
Abstract:	Continual Object Detection is essential for enabling intelligent agents to interact proactively with humans in real-world settings. While parameter-isolation strategies have been extensively explored in the context of continual learning for classification, they have yet to be fully harnessed for incremental object detection scenarios. Drawing inspiration from prior research that focused on mining individual neuron responses and integrating insights from recent developments in neural pruning, we proposed efficient ways to identify which layers are the most important for a network to maintain the performance of a detector across sequential updates. The presented findings highlight the substantial advantages of layer-level parameter isolation in facilitating incremental learning within object detection models, offering promising avenues for future research and application in real-world scenarios.
Download

Paper Nr:	154
Title:	Challenging the Black Box: A Comprehensive Evaluation of Attribution Maps of CNN Applications in Agriculture and Forestry
Authors:	Lars Nieradzik, Henrike Stephani, Jördis Sieburg-Rockel, Stephanie Helmling, Andrea Olbrich and Janis Keuper
Abstract:	In this study, we explore the explainability of neural networks in agriculture and forestry, specifically in fertilizer treatment classification and wood identification. The opaque nature of these models, often considered ’black boxes’, is addressed through an extensive evaluation of state-of-the-art Attribution Maps (AMs), also known as class activation maps (CAMs) or saliency maps. Our comprehensive qualitative and quantitative analysis of these AMs uncovers critical practical limitations. Findings reveal that AMs frequently fail to consistently highlight crucial features and often misalign with the features considered important by domain experts. These discrepancies raise substantial questions about the utility of AMs in understanding the decision-making process of neural networks. Our study provides critical insights into the trustworthiness and practicality of AMs within the agriculture and forestry sectors, thus facilitating a better understanding of neural networks in these application areas.
Download

Paper Nr:	157
Title:	GAF-Net: Video-Based Person Re-Identification via Appearance and Gait Recognitions
Authors:	Moncef Boujou, Rabah Iguernaissi, Lionel Nicod, Djamal Merad and Séverine Dubuisson
Abstract:	Video-based person re-identification (Re-ID) is a challenging task aiming to match individuals across various cameras based on video sequences. While most existing Re-ID techniques focus solely on appearance information, including gait information, could potentially improve person Re-ID systems. In this study, we propose, GAF-Net, a novel approach that integrates appearance with gait features for re-identifying individuals; the appearance features are extracted from RGB tracklets while the gait features are extracted from skeletal pose estimation. These features are then combined into a single feature allowing the re-identification of individuals. Our numerical experiments on the iLIDS-Vid dataset demonstrate the efficacy of skeletal gait features in enhancing the performance of person Re-ID systems. Moreover, by incorporating the state-of-the-art PiT network within the GAF-Net framework, we improve both rank-1 and rank-5 accuracy by 1 percentage point.
Download

Paper Nr:	161
Title:	Comprehensive Evaluation of End-to-End Driving Model Explanations for Autonomous Vehicles
Authors:	Chenkai Zhang, Daisuke Deguchi, Jialei Chen and Hiroshi Murase
Abstract:	Deep learning technology has rapidly advanced, leading to the development of End-to-End driving models (E2EDMs) for autonomous vehicles with high prediction accuracy. To comprehend the prediction results of these E2EDMs, one of the most representative explanation methods is attribution-based. There are two kinds of attribution-based explanation methods: pixel-level and object-level. Usually, the heatmaps illustrate the importance of pixels and objects in the prediction results, serving as explanations for E2EDMs. Since there are many attribution-based explanation methods, evaluation methods are proposed to determine which one is better at improving the explainability of E2EDMs. Fidelity measures the explanation’s faithfulness to the model’s prediction method, which is a bottommost property. However, no evaluation method could measure the fidelity difference between object-level and pixel-level explanations, making the current evaluation incomplete. In addition, without considering fidelity, previous evaluation methods may advertise manipulative explanations that solely seek human satisfaction (persuasibility). Therefore, we propose an evaluation method that further considers fidelity, our method enables a comprehensive evaluation that proves the object-level explanations genuinely outperform pixel-level explanations in fidelity and persuasibility, thus could better improve the explainability of the E2EDMs.
Download

Paper Nr:	169
Title:	Diverse Data Selection Considering Data Distribution for Unsupervised Continual Learning
Authors:	Naoto Hayashi, Naoki Okamoto, Tsubasa Hirakawa, Takayoshi Yamashita and Hironobu Fujiyoshi
Abstract:	In continual learning, the train data changes during the learning process, making it difficult to solve previously learned tasks as the model adapts to the new task data. Many methods have been proposed to prevent catastrophic forgetting in continual learning. To overcome this problem, Lifelong Unsupervised Mixup (LUMP) has been proposed, which is capable of learning unlabeled data that can be acquired in the real world. LUMP trains a model by self-supervised learning method, and prevents catastrophic forgetting by using a mixup of a data augmentation method and a replay buffer that stores a part of the data used to train previous tasks. However, LUMP randomly selects data to store in the replay buffer from the train data, which may bias the stored data and cause the model to specialize in some data. Therefore, we propose a method for selecting data to be stored in the replay buffer for unsupervised continuous learning method.The proposed method splits the distribution of train data into multiple clusters using the k-means clustering. Next, one piece of data is selected from each cluster. The data selected by the proposed method preserves the distribution of the original data, making it more useful for self-supervised learning.
Download

Paper Nr:	180
Title:	Privacy-Preserving Face Recognition in Hybrid Frequency-Color Domain
Authors:	Dong Han, Yong Li and Joachim Denzler
Abstract:	Face recognition technology has been deployed in various real-life applications. The most sophisticated deep learning-based face recognition systems rely on training millions of face images through complex deep neural networks to achieve high accuracy. It is quite common for clients to upload face images to the service provider in order to access the model inference. However, the face image is a type of sensitive biometric attribute tied to the identity information of each user. Directly exposing the raw face image to the service provider poses a threat to the user’s privacy. Current privacy-preserving approaches to face recognition focus on either concealing visual information on model input or protecting model output face embedding. The noticeable drop in recognition accuracy is a pitfall for most methods. This paper proposes a hybrid frequency-color fusion approach to reduce the input dimensionality of face recognition in the frequency domain. Moreover, sparse color information is also introduced to alleviate significant accuracy degradation after adding differential privacy noise. Besides, an identity-specific embedding mapping scheme is applied to protect original face embedding by enlarging the distance among identities. Lastly, secure multiparty computation is implemented for safely computing the embedding distance during model inference. The proposed method performs well on multiple widely used verification datasets. Moreover, it has around 2.6% to 4.2% higher accuracy than the state-of-the-art in the 1:N verification scenario.
Download

Paper Nr:	192
Title:	A Fusion Approach for Enhanced Remote Sensing Image Classification
Authors:	Vian A. Ahmed, Khaled Jouini, Amel Tuama and Ouajdi Korbaa
Abstract:	Satellite imagery provides a unique and comprehensive view of the Earth’s surface, enabling global-scale land cover mapping and environmental monitoring. Despite substantial advancements, satellite imagery analysis remains a highly challenging task due to intrinsic and extrinsic factors, including data volume and variability, atmospheric conditions, sensor characteristics and complex land cover patterns. Early methods in remote sensing image classification leaned on human-engineered descriptors, typified by the widely used Scale-Invariant Feature Transform (SIFT). SIFT and similar approaches had inherent limitations in directly representing entire scenes, driving the use of encoding techniques like the Bag-of-Visual-Words (BoVW). While these encoding methods offer simplicity and efficiency, they are constrained in their representation capabilities. The rise of deep learning, fuelled by abundant data and computing power, revolutionized satellite image analysis, with Convolutional Neural Networks (CNNs) emerging as highly effective tools. Nevertheless, CNNs’ extensive need for annotated data limits their scope of application. In this work we investigate the fusion of two distinctive feature extraction methodologies, namely SIFT and CNN, within the framework of Support Vector Machines (SVM). This fusion approach seeks to harness the unique advantages of each feature extraction method while mitigating their individual limitations. SIFT excels at capturing local features critical for identifying specific image characteristics, whereas CNNs enrich representations with global context, spatial relationships and hierarchical features. The integration of SIFT and CNN features helps thus in enhancing resilience to perturbations and generalization across diverse landscapes. An additional advantage is the adaptability of this approach to scenarios with limited labelled data. Experiments on the EuroSAT dataset demonstrate that the proposed fusion approach outperforms SIFT-based and CNN-based models used separately and that it achieves either better or comparable results when compared to existing notable approaches in remote sensing image classification.
Download

Paper Nr:	197
Title:	Agrinet: A Hyperspectral Image Based Precise Crop Classification Model
Authors:	Aditi Palit, Himanshu Dolekar and Kalidas Yeturu
Abstract:	Modern smart agriculture utilizes Unmanned Arial Vehicles (UAVs) with hyperspectral cameras to enhance crop production to address the food security challenges. These cameras provide detailed crop information for type identification, disease detection, and nutrient assessment. However, processing Hyper Spectral Image (HSI) is complex due to challenges such as high inter-class similarity, intra-class variability, and overlapping spectral profiles. Thus, we introduce the Agrinet model, a convolutional neural network architecture, to handle complex hyperspectral image processing. Our novelty lies in the image pre-processing step of selecting suitable bands for better classification. In tests, Agrinet achieved an impressive accuracy of 99.93% on the LongKou crop dataset, outperforming the existing methods in classification.
Download

Paper Nr:	200
Title:	Scale Learning in Scale-Equivariant Convolutional Networks
Authors:	Mark Basting, Robert-Jan Bruintjes, Thaddäus Wiedemer, Matthias Kümmerer, Matthias Bethge and Jan van Gemert
Abstract:	Objects can take up an arbitrary number of pixels in an image: Objects come in different sizes, and, photographs of these objects may be taken at various distances to the camera. These pixel size variations are problematic for CNNs, causing them to learn separate filters for scaled variants of the same objects which prevents learning across scales. This is addressed by scale-equivariant approaches that share features across a set of pre-determined fixed internal scales. These works, however, give little information about how to best choose the internal scales when the underlying distribution of sizes, or scale distribution, in the dataset, is unknown. In this work we investigate learning the internal scales distribution in scale-equivariant CNNs, allowing them to adapt to unknown data scale distributions. We show that our method can learn the internal scales on various data scale distributions and can adapt the internal scales in current scale-equivariant approaches.
Download

Paper Nr:	203
Title:	Unsupervised Few-Shot Image Segmentation with Dense Feature Learning and Sparse Clustering
Authors:	Kuangdai Leng, Robert Atwood, Winfried Kockelmann, Deniza Chekrygina and Jeyan Thiyagalingam
Abstract:	Fully unsupervised semantic segmentation of images has been a challenging problem in computer vision. Many deep learning models have been developed for this task, most of which using representation learning guided by certain unsupervised or self-supervised loss functions towards segmentation. In this paper, we conduct dense or pixel-level representation learning using a fully-convolutional autoencoder; the learned dense features are then reduced onto a sparse graph where segmentation is encouraged from three aspects: nor-malised cut, similarity and continuity. Our method is one- or few-shot, minimally requiring only one image (i.e., the target image). To mitigate overfitting caused by few-shot learning, we compute the reconstruction loss using augmented size-varying patches sampled from the image(s). We also propose a new adjacency-based loss function for continuity, which allows the number of superpixels to be arbitrarily large whereby the creation of the sparse graph can remain fully unsupervised. We conduct quantitative and qualitative experiments using computer vision images and videos, which show that segmentation becomes more accurate and robust using our sparse loss functions and patch reconstruction. For comprehensive application, we use our method to analyse 3D images acquired from X-ray and neutron tomography. These experiments and applications show that our model trained with one or a few images can be highly robust for predicting many unseen images with similar semantic contents; therefore, our method can be useful for the segmentation of videos and 3D images of this kind with lightweight model training in 2D.
Download

Paper Nr:	205
Title:	NeRF-Diffusion for 3D-Consistent Face Generation and Editing
Authors:	Héctor Laria, Kai Wang, Joost van de Weijer, Bogdan Raducanu and Yaxing Wang
Abstract:	Generating high-fidelity 3D-aware images without 3D supervision is a valuable capability in various applications. Current methods based on NeRF features, SDF information, or triplane features have limited variation after training. To address this, we propose a novel approach that combines pretrained models for shape and content generation. Our method leverages a pretrained Neural Radiance Field as a shape prior and a diffusion model for content generation. By conditioning the diffusion model with 3D features, we enhance its ability to generate novel views with 3D awareness. We introduce a consistency token shared between the NeRF module and the diffusion model to maintain 3D consistency during sampling. Moreover, our framework allows for text editing of 3D-aware image generation, enabling users to modify the style over 3D views while preserving semantic content. Our contributions include incorporating 3D awareness into a text-to-image model, addressing identity consistency in 3D view synthesis, and enabling text editing of 3D-aware image generation. We provide detailed explanations, including the shape prior based on the NeRF model and the content generation process using the diffusion model. We also discuss challenges such as shape consistency and sampling saturation. Experimental results demonstrate the effectiveness and visual quality of our approach.
Download

Paper Nr:	212
Title:	High Precision Single Shot Object Detection in Automotive Scenarios
Authors:	Soumya A, C Krishna Mohan and Linga Reddy Cenkeramaddi
Abstract:	Object detection in low-light scenarios is a challenging task with numerous real-world applications, ranging from surveillance and autonomous vehicles to augmented reality. However, due to reduced visibility and limited information in the image data, carrying out object detection in low-lighting settings brings distinct challenges. This paper introduces a novel object detection model designed to excel in low-light imaging conditions, prioritizing inference speed and accuracy. The model leverages advanced deep-learning techniques and is optimized for efficient inference on resource-constrained devices. The inclusion of cross-stage partial (CSP) connections is key to its effectiveness, which maintains low computational complexity, resulting in minimal training time. This model adapts seamlessly to low-light conditions through specialized feature extraction modules, making it a valuable resource in challenging visual environments.
Download

Paper Nr:	220
Title:	Understanding Marker-Based Normalization for FLIM Networks
Authors:	Leonardo M. Joao, Matheus A. Cerqueira, Barbara C. Benato and Alexandre X. Falcao
Abstract:	Successful methods for object detection in multiple image domains are based on convolutional networks. However, such approaches require large annotated image sets for network training. One can build object detectors by exploring a recent methodology, Feature Learning from Image Markers (FLIM), that considerably reduces human effort in data annotation. In FLIM, the encoder’s filters are estimated among image patches extracted from scribbles drawn by the user on discriminative regions of a few representative images. The filters are meant to create feature maps in which the object is activated or deactivated. This task depends on a z-score normalization using the scribbles’ statistics, named marker-based normalization (MBN). An adaptive decoder (point-wise convolution with activation) finds its parameters for each image and outputs a saliency map for object detection. This encoder-decoder network is trained without backpropagation. This work investigates the effect of MBN on the network’s results. We detach the scribble sets for filter estimation and MBN, introduce a bot that draws scribbles with distinct ratios of object-and-background samples, and evaluate the impact of five different ratios on three datasets through six quantitative metrics and feature projection analysis. The experiments suggest that scribble detachment and MBN with object oversampling are beneficial.
Download

Paper Nr:	225
Title:	Ontology-Driven Deep Learning Model for Multitask Visual Food Analysis
Authors:	Daniel Ponte, Eduardo Aguilar, Mireia Ribera and Petia Radeva
Abstract:	The food analysis from images is a challenging task that has gained significant attention due to its multiple applications, especially in the field of health and nutrition. Ontology-driven deep learning techniques have shown promising results in improving model performance. Food ontology can leverage domain-specific information to guide model learning and thus substantially enhance the food analysis. In this paper, we propose a new ontology-driven multi-task learning approach for food recognition. To this end, we deal multi-modal information, text and images, in order to extract from the text the food ontology, which represents prior knowledge about the relationship of food concepts at different semantic levels (e.g. food groups and food names), and apply this information to guide the learning of the multi-task model to perform the task at hand. The proposed method was validated on the public food dataset named MAFood-121, specifically on dishes belonging to Mexican cuisine, outperforming the results obtained in single-label food recognition and multi-label food group recognition. Moreover, the proposed integration of the ontology into the deep learning framework allows providing more consistent results across the tasks.
Download

Paper Nr:	242
Title:	AR-VPT: Simple Auto-Regressive Prompts for Adapting Frozen ViTs to Videos
Authors:	Muhammad Z. Yousuf, Syed T. Wasim, Syed N. Hasany and Muhammad Farhan
Abstract:	The rapid progress of deep learning in image recognition has driven increasing interest in video recognition. While image recognition has benefited from the abundance of pre-trained models, video recognition remains challenging due to the absence of strong pre-trained models and the computational cost of training from scratch. Transfer learning techniques have been used to leverage pre-trained networks for video recognition by extracting features from individual frames and combining them for decision-making. In this paper, we explore the use of Visual-Prompt Tuning (VPT) for video recognition, a computationally efficient technique previously proposed for image recognition. Our contributions are two-fold: we introduce Auto-Regressive Visual Prompt Tuning (AR-VPT) method to perform temporal modeling, addressing the weakness of VPT in this aspect. Finally, we achieve significantly improved performance compared to vanilla VPT on three benchmark datasets: UCF-101, Diving-48, and Something-Something-v2. Our proposed method achieves an optimal trade-off between performance and computation cost, making it a promising approach for video recognition tasks.
Download

Paper Nr:	247
Title:	Benchmarking Sensitivity of Continual Graph Learning for Skeleton-Based Action Recognition
Authors:	Wei Wei, Tom De Schepper and Kevin Mets
Abstract:	Continual learning (CL) is the research field that aims to build machine learning models that can accumulate knowledge continuously over different tasks without retraining from scratch. Previous studies have shown that pre-training graph neural networks (GNN) may lead to negative transfer (Hu et al., 2020) after fine-tuning, a setting which is closely related to CL. Thus, we focus on studying GNN in the continual graph learning (CGL) setting. We propose the first continual graph learning benchmark for spatio-temporal graphs and use it to benchmark well-known CGL methods in this novel setting. The benchmark is based on the N-UCLA and NTU-RGB+D datasets for skeleton-based action recognition. Beyond benchmarking for standard performance metrics, we study the class and task-order sensitivity of CGL methods, i.e., the impact of learning order on each class/task’s performance, and the architectural sensitivity of CGL methods with backbone GNN at various widths and depths. We reveal that task-order robust methods can still be class-order sensitive and observe results that contradict previous empirical observations on architectural sensitivity in CL.
Download

Paper Nr:	268
Title:	A Cascade Methodology to Evaluate Black-Box Recognition Systems Based on a Copycat Algorithm
Authors:	Dinh C. Nguyen, Nhan T. Le, Van H. Mai, Tuong Q. Nguyen, Van Q. Nguyen and The C. Nguyen
Abstract:	With the significant advancements of deep learning (DL) and convolutional neural networks (CNNs), many complex systems in the field of computer vision (CV) have been effectively solved with promising performance, even equivalent to human capabilities. Images sophistically perturbed in order to cause accurately trained deep learning systems to misclassify have emerged as a significant challenge and major concern in application domains requiring high reliability. These samples are referred to as adversarial examples. Many studies apply white-box attack methods to create these adversarial images. However, white-box attacks might be impractical in real-world applications. In this paper, a cascade methodology is deployed in which the Copycat algorithm is utilized to replicate the behavior of a black-box model (known as an original model) by using a substitute model. The substitute model is employed to generate white-box perturbations, which are then used to evaluate the black-box models. The experiments are conducted with benchmark datasets as MNIST and CIFAR10 and a facial recognition system as a real use-case. The results show impressive outcomes, as the majority of the adversarial samples generated can significantly reduce the overall accuracy and reliability of facial recognition systems up to over 80%.
Download

Paper Nr:	270
Title:	Generating Videos from Stories Using Conditional GAN
Authors:	Takahiro Kozaki, Fumihiko Sakaue and Jun Sato
Abstract:	In this paper, we propose a method for generating videos that represent stories described in multiple sentences. While research on generating images and videos from single sentences has been advancing, the generation of videos from long stories written in multiple sentences has not been achieved. In this paper, we use adversarial learning to train pairs of multi-sentence stories and videos to generate videos that replicate the flow of the stories. We also introduce caption loss for generating more contextually aligned videos from stories.
Download

Paper Nr:	272
Title:	Anomaly Detection and Localization for Images of Running Paper Web in Paper Manufacturing
Authors:	Afshin Dini, Marja Mettänen and Esa Rahtu
Abstract:	We introduce a new method based on convolutional autoencoders to detect and locate paper web anomalies that can cause web breaks during the paper production process. In this approach, we pre-process the images, captured by two high-speed cameras located at the opposite sides of the running paper web at a paper machine, in several steps to remove noises and separate the paper web areas from the background. After designing and training a convolutional autoencoder with non-anomalous samples, a novel anomaly score and map are defined to find and locate web irregularities based on an edge detector and a reconstruction error, defined by the combination of absolute error and Structural Similarity Index Measure between the reconstructed and the original images, in each test sample. By assessing the proposed approach on the images taken from a real paper machine, we discover that this method can detect paper defects properly and, therefore it has the potential to improve machine functionality and even to prevent certain types of web breaks, which reduces the machine downtime, paper losses, maintenance costs, and energy consumption, i.e., increases the performance and efficiency of paper machinery.
Download

Paper Nr:	279
Title:	Unsupervised Annotation and Detection of Novel Objects Using Known Objectness
Authors:	Harsh S. Jadon, Jagdish Deshmukh, Kamakshya P. Nayak, Kamalakar V. Thakare and Debi P. Dograψ
Abstract:	The paper proposes a new approach to detecting and annotating novel objects in images that are not precisely part of a training dataset. The ability to detect novel objects is essential in computer vision, enabling machines to recognise objects that have not been seen before. Current models often fail to detect novel objects as they rely on predefined categories in the training data. Our approach overcomes this limitation by leveraging a large and diverse dataset of objects obtained through web scraping. We extract features using a backbone network and perform clustering to remove redundant data. The resulting dataset is used to retrain the object detection models to obtain results. The method provides deep insights into the effect of clustering and data redundancy removal on performance. Overall, the work contributes to the field of object detection by providing a new approach for detecting novel objects. The method has the potential to be applied to a variety of real-world CV applications.
Download

Paper Nr:	322
Title:	Large Scale Graph Construction and Label Propagation
Authors:	Z. Ibrahim, A. Bosaghzadeh and F. Dornaika
Abstract:	Despite the advances in semi-supervised learning methods, these algorithms face three limitations. The first is the assumption of pre-constructed graphs and the second is their inability to process large databases. The third limitation is that these methods ignore the topological imbalance of the data in a graph. In this paper, we address these limitations and propose a new approach called Weighted Simultaneous Graph Construction and Reduced Flexible Manifold Embedding (W-SGRFME). To overcome the first limitation, we construct the affinity graph using an automatic algorithm within the learning process. The second limitation concerns the ability of the model to handle a large number of unlabeled samples. To this end, the anchors are included in the algorithm as data representatives, and an inductive algorithm is used to estimate the labeling of a large number of unseen samples. To address the topological imbalance of the data samples, we introduced the Renode method to assign weights to the labeled samples. We evaluate the effectiveness of the proposed method through experimental results on two large datasets commonly used in semi-supervised learning: Covtype and MNIST. The results demonstrate the superiority of the W-SGRFME method over two recently proposed models and emphasize its effectiveness in dealing with large datasets.
Download

Paper Nr:	323
Title:	Image Augmentation Preserving Object Parts Using Superpixels of Variable Granularity
Authors:	D. Sun and F. Dornaika
Abstract:	Methods employing regional dropout data augmentation, especially those employing a cut-and-paste approach, have proven highly effective in addressing overfitting challenges arising from limited data. However, existing cutmix-based augmentation strategies face issues related to the loss of contour details and discrepancies between augmented images and their associated labels. In this study, we introduce a novel end-to-end cutmix-based data augmentation method, incorporating the blending of images with discriminative superpixels of diverse granularity. Our experiments for classification tasks reveal outstanding performance across various benchmarks and deep neural network models.
Download

Paper Nr:	326
Title:	AbSynth: Using Abstract Image Synthesis for Synthetic Training
Authors:	Dominik Penk, Maik Horn, Christoph Strohmeyer, Bernhard Egger, Marc Stamminger and Frank Bauer
Abstract:	We present a novel pipeline for training neural networks to tackle geometry-induced vision tasks, relying solely on synthetic training images generated from (geometric) CAD models of the objects under consideration. Instead of aiming for photorealistic renderings, our approach maps both synthetic and real-world data onto a common abstract image space reducing the domain gap. We demonstrate that this projection can be decoupled from the downstream task, making our method an easy drop-in solution for a variety of applications. In this paper, we use line images as our chosen abstract image representation due to their ability to capture geometric properties effectively. We introduce an efficient training data synthesis method, that generates images tailored for transformation into a line representation. Additionally, we explore how the use of sparse line images opens up new possibilities for augmenting the dataset, enhancing the overall robustness of the downstream models. Finally, we provide an evaluation of our pipeline and augmentation techniques across a range of vision tasks and state-of-the-art models, showcasing their effectiveness and potential for practical applications.
Download

Paper Nr:	327
Title:	AV-PEA: Parameter-Efficient Adapter for Audio-Visual Multimodal Learning
Authors:	Abduljalil Radman and Jorma Laaksonen
Abstract:	Fine-tuning has emerged as a widely used transfer learning technique for leveraging pre-trained vision transformers in various downstream tasks. However, its success relies on tuning a significant number of trainable parameters, which could lead to significant costs in terms of both model training and storage. When it comes to audio-visual multimodal learning, the challenge also lies in effectively incorporating both audio and visual cues into the transfer learning process, especially when the original model has been trained with unimodal samples only. This paper introduces a novel audio-visual parameter-efficient adapter (AV-PEA) designed to improve multimodal transfer learning for audio-visual tasks. Through the integration of AV-PEA into a frozen vision transformer, like the visual transformer (ViT), the transformer becomes adept at processing audio inputs without prior knowledge of audio pre-training. This also facilitates the exchange of essential audio-visual cues between audio and visual modalities, all while introducing a limited set of trainable parameters into each block of the frozen transformer. The experimental results demonstrate that our AV-PEA consistently achieves superior or comparable performance to state-of-the-art methods in a range of audio-visual tasks, including audio-visual event localization (AVEL), audio-visual question answering (AVQA), audio-visual retrieval (AVR), and audio-visual captioning (AVC). Furthermore, it distinguishes itself from competitors by enabling seamless integration into these tasks while maintaining a consistent number of trainable parameters, typically accounting for less than 3.7% of the total parameters per task.
Download

Paper Nr:	335
Title:	Exploring the Impact of Knowledge Graphs on Zero-Shot Visual Object State Classification
Authors:	Filippos Gouidis, Konstantinos Papoutsakis, Theodore Patkos, Antonis Argyros and Dimitris Plexousakis
Abstract:	In this work, we explore the potential of Knowledge Graphs (KGs) towards an effective Zero-Shot Learning (ZSL) approach for Object State Classification (OSC) in images. For this problem, the performance of traditional supervised learning methods is hindered mainly by data scarcity, as they attempt to encode the highly varying visual features of a multitude of combinations of object state and object type classes (e.g. open bottle, folded newspaper). The ZSL paradigm does indicate a promising alternative to enable the classification of object state classes by leveraging structured semantic descriptions acquired by external commonsense knowledge sources. We formulate an effective ZS-OSC scheme by employing a Transformer-based Graph Neural Network model and a pre-trained CNN classifier. We also investigate best practices for both the construction and integration of visually-grounded common-sense information based on KGs. An extensive experimental evaluation is reported using 4 related image datasets, 5 different knowledge repositories and 30 KGs that are constructed semi-automatically via querying known object state classes to retrieve contextual information at different node depths. The performance of vision-language models for ZS-OSC is also assessed. Overall, the obtained results suggest performance improvement for ZS-OSC models on all datasets, while both the size of a KG and the sources utilized for their construction are important for task performance.
Download

Paper Nr:	344
Title:	Effects of Model Drift on Ship Detection Models
Authors:	Namita Agarwal, Anh V. Vo, Michela Bertolotto, Alan Barnett, Ahmed Khalid and Merry Globin
Abstract:	: The rapid and accurate detection of ships within the wide sea area is essential for maritime applications. Many machine learning (ML) based object detection models have been investigated to detect ships in remote sensing imagery in previous research. Despite the availability of large-scale training datasets, the performance of object detection models can decrease significantly when the statistical properties of input images vary according to, for example, weather conditions. This is known as model drift. The occurrence of ML model drift degrades the object detection accuracy and this reduction in accuracy can produce skewed outputs such as, incorrectly classified images or inaccurate semantic tagging, thus making the detection task vulnerable to malicious attacks. The majority of existing approaches that deal with model drift relate to time series. While there is some work on model drift for imagery data and in the context of object detection, the problem has not been extensively investigated for object detection tasks in remote sensing images, especially with large-scale image datasets. In this paper, the effects of model drift on the detection of ships from satellite imagery data are investigated. Firstly, a YOLOv5 ship detection model is trained and validated using a publicly available dataset. Subsequently, the performance of the model is validated against images subjected to artificial blurriness, which is used in this research as a form of synthetic concept drift. The reduction of the model’s performance according to increasing levels of blurriness demonstrates the effect of model drift. Specifically, the average precision of the model dropped by more than 74% when the images were blurred at the maximum level with a 11×11 Gaussian kernel size. More importantly, as the level of blurriness increased, the mean confidence score of the detections decreased up to 20.8% and the number of detections also reduced. Since the confidence scores and the number of detections are independent of ground truth data, such information has the potential to be utilised to detect model drift in future research.
Download

Paper Nr:	354
Title:	Minimalist CNN for Medical Imaging Classification with Small Dataset: Does Size Really Matter and How?
Authors:	Marie Économidès and Pascal Desbarats
Abstract:	Deep learning has become a key method in computer vision, and has seen an increase in the size of both the networks used and the databases. However, its application in medical imaging faces limitations due to the size of datasets, especially for larger networks. This article aims to answer two questions: How can we design a simple model without compromising classification performance, making training more efficient? And, how much data is needed for our network to learn effectively? The results show that we can find a minimalist CNN adapted to a dataset that gives results comparable to larger architectures. The minimalist CNN does not have a fixed architecture. Its architecture varies according to the dataset and various criteria such as overall performance, training stability, and visual interpretation of network predictions. We hope this work can serve as inspiration for others concerned with these challenges.
Download

Paper Nr:	362
Title:	Occlusion-Robust and Efficient 6D Pose Estimation with Scene-Level Segmentation Refinement and 3D Partial-to-6D Full Point Cloud Transformation
Authors:	Sukhan Lee, Soojin Lee and Yongjun Yang
Abstract:	Accurate estimation of the 6D pose of objects is essential for 3D scene modeling, visual odometry, and map building, as well as robotic manipulation of objects. Recently, various end-to-end deep networks have been proposed for object 6D pose estimation with their accuracies reaching the level of conventional regimes but with much higher efficiency. Despite progress, the accurate yet efficient 6D pose estimation of highly occluded objects in a cluttered scene remains a challenge. In this study, we present an end-to-end deep network framework for 6D pose estimation with particular emphasis on highly occluded objects in a cluttered scene. The proposed framework integrates an occlusion-robust panoptic segmentation network performing scene-level segmentation refinement and a dual associative point autoencoder (AE) directly reconstructing the 6D full camera and object frame-based point clouds corresponding to a captured 3D partial point cloud through latent space association. We evaluated the proposed deep 6D pose estimation framework based on the standard benchmark dataset, LineMod-Occlusion (LMO), and obtained the top-tier performance in the current leaderboard, validating the effectiveness of the proposed approach in terms of efficiency and accuracy.
Download

Paper Nr:	406
Title:	Enhanced Deepfake Detection Using Frequency Domain Upsampling
Authors:	Mamadou D. Bah and Mohamed Dahmane
Abstract:	Recent advances in deep learning have dramatically reshaped the image processing landscape, specifically targeting the key challenge of detecting deepfakes in digital media. This study investigates the integration of Fourier or frequency domain upsampling techniques with deep learning models for the purpose of detecting deepfakes. Using the FF++ dataset as a benchmark for evaluating deepfake detection, our research rigorously evaluates the effectiveness of models such as Xception. This evaluation includes an in-depth exploration of various upsampling methods as well as the combination of spatial and frequency domain upsampling. The results reveal clear disparities in performance between the different models and techniques, and our experiments highlight the profound impact of the various upsampling and downsampling approaches on the accuracy of the resulting classification. Remarkably, combining the Xception model with upsampling and downsampling techniques increases detection accuracy by over 2%, while maintaining the constant input size inherent in the Xception architecture.
Download

Paper Nr:	407
Title:	SMART-RD: Towards a Risk Assessment Framework for Autonomous Railway Driving
Authors:	Justin Bescop, Nicolas Goeman, Amel Aissaoui, Benjamin Allaert and Jean-Philippe Vandeborre
Abstract:	While the automotive industry has made significant contributions to vision-based dynamic risk assessment, progress has been limited in the railway domain. This is mainly due to the lack of data and to the unavailability of security-based annotation for the existing datasets. This paper proposes the first annotation framework for the railway domain that takes into account the different components that significantly contribute to the vision-based risk estimation in driving scenarios, thus enabling an accurate railway risk assessment. A first baseline based on neural network is performed to prove the consistency of the risk-based annotation. The performances show promising results for vision-based risk assessment according to different levels of risk.
Download

Paper Nr:	408
Title:	Image Augmentation for Object Detection and Segmentation with Diffusion Models
Authors:	Leon Useinov, Valeria Efimova and Sergey Muravyov
Abstract:	Training current state-of-the-art models for object detection and segmentation requires a lot of labeled data, which can be difficult to obtain. It is especially hard, when occurrence of an object of interest in a certain required environment is rare. To solve this problem we present a train-free augmentation technique that is based on a diffusion model, pretrained on a large dataset (more than 1 million images). In order to establish the effectiveness of our method and its modifications, experiments on small datasets (less than 500 training images) with YOLOv8 are conducted. We conclude that none of the proposed versions of the diffusion-based augmentation method are universal, however, each of them may be used to improve an object detection (and segmentation) model performance in certain scenarios. The code is publicly available: github.com/PnthrLeo/ diffusion-augmentation.
Download

Paper Nr:	410
Title:	Enhancement-Driven Pretraining for Robust Fingerprint Representation Learning
Authors:	Ekta Gavas, Kaustubh Olpadkar and Anoop Namboodiri
Abstract:	Fingerprint recognition stands as a pivotal component of biometric technology, with diverse applications from identity verification to advanced search tools. In this paper, we propose a unique method for deriving robust fingerprint representations by leveraging enhancement-based pre-training. Building on the achievements of U-Net-based fingerprint enhancement, our method employs a specialized encoder to derive representations from fingerprint images in a self-supervised manner. We further refine these representations, aiming to enhance the verification capabilities. Our experimental results, tested on publicly available fingerprint datasets, reveal a marked improvement in verification performance against established self-supervised training techniques. Our findings not only highlight the effectiveness of our method but also pave the way for potential advancements. Crucially, our research indicates that it is feasible to extract meaningful fingerprint representations from degraded images without relying on enhanced samples.
Download

Paper Nr:	433
Title:	Variational Autoencoders for Pedestrian Synthetic Data Augmentation of Existing Datasets: A Preliminary Investigation
Authors:	Ivan Nikolov
Abstract:	The requirements for more and more data for training deep learning surveillance and object detection models have resulted in slower deployment and more costs connected to dataset gathering, annotation, and testing. One way to help with this is the use of synthetic data giving more varied scenarios and not requiring manual annotation. We present our initial exploratory work in generating synthetic pedestrian augmentations for an existing dataset through the use of variational autoencoders. Our method consists of creating a large number of backgrounds and training a variational autoencoder on a small subset of annotated pedestrians. We then interpolate the latent space of the autoencoder to generate variations of these pedestrians, calculate their positions on the backgrounds, and blend them to create new images. We show that even though we do not achieve as good results as just adding more real images, we can boost the performance and robustness of a YoloV5 model trained on a mix of real and small amounts of synthetic images. As part of this paper, we also propose the next steps to expand this approach and make it much more useful for a wider array of datasets.
Download

Paper Nr:	445
Title:	Vehicle Pose Estimation: Exploring Angular Representations
Authors:	Ivan Orlov, Marco Buzzelli and Raimondo Schettini
Abstract:	This paper addresses the challenge of azimuth estimation in the context of car pose estimation. Our research utilizes the PASCAL3D+ dataset, which offers a diverse range of object categories, including cars, with annotated azimuth estimations for each photograph. We introduce two architectures that approach azimuth estimation as a regression problem, each employing a deep convolutional neural network (DCNN) backbone but diverging in their output definition strategies. The first architecture employs a sin-cos representation of the car’s azimuth, while the second utilizes two directional discriminators, distinguishing between front/rear and left/right views of the vehicle. Our comparative analysis reveals that both architectures demonstrate near-identical performance levels on the PASCAL3D+ validation set, achieving a median error of 3.5◦ , which is a significant advancement in the state of the art. The minimal performance disparity between the two methods highlights their individual strengths while also underscoring the similarity in their practical efficacy. This study not only proposes effective solutions for accurate azimuth estimation but also contributes to the broader understanding of pose estimation challenges in automotive contexts. The code is available at https://github.com/vani-or/car pose estimation.
Download

Paper Nr:	27
Title:	Synthetic Data-Driven Approach for Missing Nut and Bolt Classification in Flange Joints
Authors:	Frankly Toro, Hassane Trigui, Yazeed Alnumay, Siddharth Mishra and Sahejad Patel
Abstract:	Inspection of bolted flange joints is a routine procedure typically done manually in process-based industries. However, this is a time-consuming task since there are many flanges in a typical operational facility. We present a computer vision-based tool that can be integrated into other systems to enable automated inspection of these flanges. We propose a multi-view image classification architecture for detecting a missing bolt or nut in a flange joint image. To guide the training process, a synthetic dataset with 60,000 image pairs was created to simulate realistic environmental conditions of flange joints. To demonstrate the effectiveness of our approach, an additional real-world dataset of 1,080 flange joint image pairs was manually collected. The proposed approach achieved remarkable performance in classifying missing bolt instances with an accuracy of 95.28% and 95.14% for missing nut instances.
Download

Paper Nr:	60
Title:	Pedestrian's Gaze Object Detection in Traffic Scene
Authors:	Hiroto Murakami, Jialei Chen, Daisuke Deguchi, Takatsugu Hirayama, Yasutomo Kawanishi and Hiroshi Murase
Abstract:	In this paper, we present a new task of detecting an object that a target pedestrian is gazing at in a traffic scene called PEdestrian’s Gaze Object (PEGO). We argue that the detection of gaze object can provide important information for pedestrian’s behavior prediction and can contribute to the realization of automated vehicles. For this task, we construct a dataset of in-vehicle camera images with annotations of the objects that pedestrians are gazing at. Also, we propose a Transformer-based method called PEGO Transformer to solve the PEGO detection task. The PEGO Transformer directly performs gaze object detection with the utilization of whole-body features without a high-resolution head image and a gaze heatmap which the traditional methods rely on. Experimental results showed that the proposed method could estimate pedestrian’s gaze object accurately even if various objects exist in the scene.
Download

Paper Nr:	64
Title:	Improved Pest Detection in Insect Larvae Rearing with Pseudo-Labelling and Spatio-Temporal Masking
Authors:	Paweł Majewski, Piotr Lampa, Robert Burduk and Jacek Reiner
Abstract:	Pest detection is an important application problem as it enables early reaction by the farmer in situations of unacceptable pest infestation. Developing an effective pest detection model is challenging due to the problem of creating a representative dataset, as episodes of pest occurrence under real rearing conditions are rare. Detecting the pest Alphitobius diaperinus Panzer in mealworm (Tenebrio molitor) rearing, addressed in this work, is particularly difficult due to the relatively small size of detection objects, the high similarity between detection objects and background elements, and the dense scenes. Considering the problems described, an original method for developing pest detection models was proposed. The first step was to develop a basic model by training it on a small subset of manually labelled samples. In the next step, the basic model identified low/moderate pest-infected rearing boxes from many boxes inspected daily. Pseudo-labelling was carried out for these boxes, significantly reducing labelling time, and re-training was performed. A spatio-temporal masking method based on activity maps calculated using the Gunnar-Farneback optical flow technique was also proposed to reduce the numerous false-positive errors. The quantitative results confirmed the positive effect of pseudo-labelling and spatio-temporal masking on the accuracy of pest detection and the ability to recognise episodes of unacceptable pest infestation.
Download

Paper Nr:	71
Title:	Online Human Activity Recognition Using Efficient Neural Architecture Search with Low Environmental Impact
Authors:	Nassim Mokhtari, Alexis Nédélec, Marlène Gilles and Pierre De Loor
Abstract:	Human activity recognition using sensor data can be approached as a problem of classifying time series data. Deep learning models allow for great progress in this domain, but there are still some areas for improvement. In addition, the environmental impact of deep learning is a problem that must be addressed in today’s machine learning studies. In this research, we propose to automate deep learning model design for human activity recognition by using an existing training-free Neural Architecture Search method. By this way, we decrease the time consumed by classical NAS approaches (GPU based) by a factor of 470, and the energy consumed by a factor of 170. Finally, We propose a new criterion to estimate the relevance of a deep learning model based on a balance between both performance and computational cost. This criterion allows to reduce the size of neural architectures by preserving its capacity to recognize human activities.
Download

Paper Nr:	105
Title:	Sumo Action Classification Using Mawashi Keypoints
Authors:	Yuki Utsuro, Hidehiko Shishido and Yoshinari Kameda
Abstract:	We propose a new method of classification for kimarites in sumo videos based on kinematic pose estimation. Japanese wrestling sumo is a combat sport. Sumo is played by two wrestlers wearing a mawashi, a loincloth fastened around the waist. In a sumo match, two wrestlers grapple with each other. Sumo wrestlers perform actions by grabbing their opponents’ mawashi. A kimarite is a sumo winning action that decides the outcome of a sumo match. All the kimarites are defined based on their motions. In an official sumo match, the kimarite of the match is classified by the referee, who oversees the classification just after the match. Classifying kimarites from videos by computer vision is a challenging task. There are two reasons. The first reason is that the definition of kimarites requires us to examine the relationship between the mawashi and the pose. The second reason is the heavy occlusion caused by the close contact between wrestlers. For the precise examination of pose estimation, we introduce a wrestler-specific skeleton model with mawashi keypoints. The relationship between mawashi and body parts is uniformly represented in the pose sequence with this extended skeleton model. As for heavy occlusion, we represent sumo actions as pose sequences to classify the sumo actions. Our method achieves an accuracy of 0.77 in action classification by LSTM. We confirmed that the skeleton model extension by mawashi keypoints improves the accuracy of action classification in sumo through the experiment results.
Download

Paper Nr:	118
Title:	Diffusion-Based Image Generation for In-Distribution Data Augmentation in Surface Defect Detection
Authors:	Luigi Capogrosso, Federico Girella, Francesco Taioli, Michele D. Chiara, Muhammad Aqeel, Franco Fummi, Francesco Setti and Marco Cristani
Abstract:	In this study, we show that diffusion models can be used in industrial scenarios to improve the data augmentation procedure in the context of surface defect detection. In general, defect detection classifiers are trained on ground-truth data formed by normal samples (negative data) and samples with defects (positive data), where the latter are consistently fewer than normal samples. For these reasons, state-of-the-art data augmentation procedures add synthetic defect data by superimposing artifacts to normal samples. This leads to out-of-distribution augmented data so that the classification system learns what is not a normal sample but does not know what a defect really is. We show that diffusion models overcome this situation, providing more realistic in-distribution defects so that the model can learn the defect’s genuine appearance. We propose a novel approach for data augmentation that mixes out-of-distribution with in-distribution samples, which we call In&Out. The approach can deal with two data augmentation setups: i) when no defects are available (zero-shot data augmentation) and ii) when defects are available, which can be in a small number (few-shot) or a large one (full-shot). We focus the experimental part on the most challenging benchmark in the state-of-the-art, i.e., the Kolektor Surface-Defect Dataset 2, defining the new state-of-the-art classification AP score under weak supervision of .782. The code is available at https://github.com/intelligolabs/in and out.
Download

Paper Nr:	119
Title:	Class Weighted Focal Loss for Improving Class Imbalance in Semi-Supervised Object Detection
Authors:	Shinichi Hoketsu, Tsubasa Hirakawa, Takayoshi Yamashita and Hironobu Fujiyoshi
Abstract:	Object detection is a task for acquiring environmental information in automated driving. Object detection is used to detect the position and class of objects in an image. It can be made more accurate by learning with a large amount of supervised data. However, the high cost of annotating the data makes it difficult to create large supervised datasets. Therefore, research using semi-supervised learning for object detection has been attracting attention. Previous studies on semi-supervised learning in object detection tasks have mainly conducted evaluation experiments only on large datasets with many classes, such as MS COCO, and PASCAL VOC. Therefore, the effectiveness of semi-supervised learning for in-vehicle camera data as input has not yet been demonstrated. We examined the effectiveness of semi-supervised learning in object detection when in-vehicle camera data are used as input. We also proposed a class weighted focal loss that employs a unique weighting method that takes into account the class imbalance problem. Experimental results indicate that semi-supervised learning is also effective when vehicle-mounted camera images are used as input. We also confirmed that the proposed mitigates improves the class imbalance problem and improves accuracy.
Download

Paper Nr:	139
Title:	Association of Grad-CAM, LIME and Multidimensional Fractal Techniques for the Classification of H&E Images
Authors:	Thales S. Lopes, Guilherme F. Roberto, Carlos Soares, Thaína A. Tosta, Adriano B. Silva, Adriano M. Loyola, Sérgio V. Cardoso, Paulo R. de Faria, Marcelo Z. do Nascimento and Leandro A. Neves
Abstract:	In this work, a method based on the use of explainable artificial intelligence techniques with multiscale and multidimensional fractal techniques is presented in order to investigate histological images stained with Hematoxylin-Eosin. The CNN GoogLeNet neural activation patterns were explored, obtained from the gradient-weighted class activation mapping and locally-interpretable model-agnostic explanation techniques. The feature vectors were generated with multiscale and multidimensional fractal techniques, specifically fractal dimension, lacunarity and percolation. The features were evaluated by ranking each entry, using the ReliefF algorithm. The discriminative power of each solution was defined via classifiers with different heuristics. The best results were obtained from LIME, with a significant increase in accuracy and AUC rates when compared to those provided by GoogLeNet. The details presented here can contribute to the development of models aimed at the classification of histological images.
Download

Paper Nr:	149
Title:	Comparative Study Between Object Detection Models, for Olive Fruit Fly Identification
Authors:	Margarida Victoriano, Lino Oliveira and Hélder P. Oliveira
Abstract:	Climate change is causing the emergence of new pest species and diseases, threatening economies, public health, and food security. In Europe, olive groves are crucial for producing olive oil and table olives; however, the presence of the olive fruit fly ( Bactrocera Oleae) poses a significant threat, causing crop losses and financial hardship. Early disease and pest detection methods are crucial for addressing this issue. This work presents a pioneering comparative performance study between two state-of-the-art object detection models, YOLOv5 and YOLOv8, for the detection of the olive fruit fly from trap images, marking the first-ever application of these models in this context. The dataset was obtained by merging two existing datasets: the DIRT dataset, collected in Greece, and the CIMO-IPB dataset, collected in Portugal. To increase its diversity and size, the dataset was augmented, and then both models were fine-tuned. A set of metrics were calculated, to assess both models performance. Early detection techniques like these can be incorporated in electronic traps, to effectively safeguard crops from the adverse impacts caused by climate change, ultimately ensuring food security and sustainable agriculture.
Download

Paper Nr:	152
Title:	Body-Part Enabled Wildlife Detection and Tracking in Video Sequences
Authors:	Alberto Lee, Kofi Appiah and Sze C. Kwok
Abstract:	Tracking wild animals through videos presents a non-intrusive and cost-effective way of gathering scientific information key for conservation. State-of-the-art research has shown convolutional neural networks to be highly accurate, however, the application of this field on wild animal tracking has had relatively little interest. This is potentially due to the challenges of varying illumination, noisy backgrounds and camouflaged animals intrinsic to the problem. The aim of this work is to explore and apply state-of-the-art research to detect and track wild animals (specifically bears and primates, including their body parts) in video sequences in real-time. Due to obstructors such as foliage being prevalent in wild animal environments, body part tracking presents a solution to detecting animals when they are obstructed. Two deep convolutional neural networks (YOLOv4 and YOLOv4-Tiny) are trained to detect and track animals in their natural habitat. By using the knowledge that an animal is composed of body parts, the score of weakly predicted bounding is boosted from the relative distance of related body parts. For tracking, the K-Means algorithm is used to locate the average position of each animal in frame. With the introduction of a body-part confidence boosting, the detection rate can be increased by approximately 2% for a weakly predicted class.

Paper Nr:	159
Title:	Applying Prompts and Parameter-Efficient Methods to Enhance Single-Stream Vision-Language Transformers
Authors:	Xuehao Liu, Sarah Jane Delany and Susan McKeever
Abstract:	Large-Scale transformer models pose challenges due to resource-intensive training, time, and data requirements for fine-tuning on new tasks, mainly due to their extensive parameter count. To address this, zero-shot and few-shot learning, aided by techniques like prompts and parameter-efficient modules, have emerged. However, these techniques are often tailored for vision-only or language-only tasks, leaving a gap for their effectiveness in multi-modal tasks like image captioning. This paper explores the effectiveness of prompts and parameter-efficient modules in reducing the training effort for image captioning. Rather than extensive fine-tuning, we trained only the prompt and parameter-efficient modules on the pretrained Oscar transformer model using the COCO dataset. We tested five prompt tuning approaches and two parameter-efficient methods. Notably, combining visual prompt tuning(VPT) with Adapter and LoRA led to a 2% Cider score improvement after just one epoch training, with a minimal increase in trainable parameters (5.7%). Our work paves the way towards using single-stream transformer models for a variety of fine-tuned tasks, but with a huge potential reduction in retraining time and processing resources.
Download

Paper Nr:	162
Title:	Modification of DDIM Encoding for Generating Counterfactual Pathology Images of Malignant Lymphoma
Authors:	Ryoichi Koga, Mauricio Kugler, Tatsuya Yokota, Kouichi Ohshima, Hiroaki Miyoshi, Miharu Nagaishi, Noriaki Hashimoto, Ichiro Takeuchi and Hidekata Hontani
Abstract:	We propose a method that modifies encoding in DDIM (Denoising Diffusion Implicit Model) to improve the quality of counterfactual histopathological images of malignant lymphoma. Counterfactual medical images are widely employed for analyzing the changes in images accompanying disease. For the analysis of pathological images, it is desired to accurately represent the types of individual cells in the tissue. We employ DDIM because it can refer to exogenous variables in causal models and can generate counterfactual images. Here, one problem of DDIM is that it does not always generate accurate images due to approximations in the forward process. In this paper, we propose a method that reduces the errors in the encoded images obtained in the forward process. Since the computation in the backward process of DDIM does not include any approximation, the accurate encoding in the forward process can improve the accuracy of the image generation. Our proposed method improves the accuracy of encoding by explicitly referring to the given original image. Experiments demonstrate that our proposed method accurately reconstructs original images, including microstructures such as cell nuclei, and outperforms the conventional DDIM in several measures of image generation.
Download

Paper Nr:	186
Title:	Relationship Between Semantic Segmentation Model and Additional Features for 3D Point Clouds Obtained from on-Vehicle LIDAR
Authors:	Hisato Hashimoto and Shuichi Enokida
Abstract:	The study delves into semantic segmentation’s role in recognizing regions within data, with a focus on images and 3D point clouds. While images from wide-angle cameras are prevalent, they falter in challenging environments like low light. In such cases, LIDAR (Laser Imaging Detection and Ranging), despite its lower resolution, excels. The combination of LIDAR and semantic segmentation proves effective for outdoor environment understanding. However, highly accurate models often demand substantial parameters, leading to computational challenges. Techniques like knowledge distillation and pruning offer solutions, though with possible accuracy trade-offs. This research introduces a strategy to merge feature descriptors, such as reflectance intensity and histograms, into the semantic segmentation model. This process balances accuracy and computational efficiency. The findings suggest that incorporating feature descriptors suits smaller models, while larger models can benefit from optimizing computation and utilizing feature descriptors for recognition tasks. Ultimately, this research contributes to the evolution of resource-efficient semantic segmentation models for autonomous driving and similar fields.
Download

Paper Nr:	211
Title:	Nearest Neighbor-Based Data Denoising for Deep Metric Learning
Authors:	George Galanakis, Xenophon Zabulis and Antonis A. Argyros
Abstract:	The effectiveness of supervised deep metric learning relies on the availability of a correctly annotated dataset, i.e., a dataset where images are associated with correct class labels. The presence of incorrect labels in a dataset disorients the learning process. In this paper, we propose an approach to combat the presence of such label noise in datasets. Our approach operates online, during training and on the batch level. It examines the neighborhood of samples, considers which of them are noisy and eliminates them from the current training step. The neighborhood is established using features obtained from the entire dataset during previous training epochs and therefore is updated as the model learns better data representations. We evaluate our approach using multiple datasets and loss functions, and demonstrate that it performs better or comparably to the competition. At the same time, in contrast to the competition, it does not require knowledge of the noise contamination rate of the employed datasets.
Download

Paper Nr:	253
Title:	Multi-Task Planar Reconstruction with Feature Warping Guidance
Authors:	Luan Wei, Anna Hilsmann and Peter Eisert
Abstract:	Piece-wise planar 3D reconstruction simultaneously segments plane instances and recovers their 3D plane parameters from an image, which is particularly useful for indoor or man-made environments. Efficient reconstruction of 3D planes coupled with semantic predictions offers advantages for a wide range of applications requiring scene understanding and concurrent spatial mapping. However, most existing planar reconstruction models either neglect semantic predictions or do not run efficiently enough for real-time applications. We introduce SOLOPlanes, a real-time planar reconstruction model based on a modified instance segmentation architecture which simultaneously predicts semantics for each plane instance, along with plane parameters and piece-wise plane instance masks. We achieve an improvement in instance mask segmentation by including multi-view guidance for plane predictions in the training process. This cross-task improvement, training for plane prediction but improving the mask segmentation, is due to the nature of feature sharing in multi-task learning. Our model simultaneously predicts semantics using single images at inference time, while achieving real-time predictions at 43 FPS. Code is available at: https://github.com/fraunhoferhhi/SOLOPlanes.
Download

Paper Nr:	273
Title:	Conic Linear Units: Improved Model Fusion and Rotational-Symmetric Generative Model
Authors:	Changqing Fu and Laurent D. Cohen
Abstract:	We introduce Conic Linear Unit (CoLU), a natural generalization of commonly used activation functions in neural networks. The common pointwise ReLU activation is a projection onto the positive cone and is permutation symmetric. We propose a nonlinearity that goes beyond this symmetry: CoLU is a skew projection onto a hypercone towards the cone’s axis. Due to the nature of this projection, CoLU enforces symmetry in a neural network with width C from the finite-order permutation group S(C) to the infinite-order rotation/reflection group O(C− 1), thus producing deep features that are motivated by the HSV color representation. Recent results on merging independent neural networks via permutation modulus can be relaxed and generalized to soft alignment modulo an optimal transport plan (Singh and Jaggi, 2020), which is useful in aligning models of different widths. CoLU aims to further alleviate the apparent deficiency of soft alignment. Our simulation indicates that CoLU outperforms existing generative models including Autoencoder and Latent Diffusion Model on small or large-scale image datasets. Additionally, CoLU does not increase the number of parameters and requires negligible additional computation overhead. The CoLU concept is quite general and can be plugged into various neural network architectures. Ablation studies on extensions to soft projections, general L p cones, and the non-convex double-cone cases are briefly discussed.
Download

Paper Nr:	373
Title:	Identification of Honeybees with Paint Codes Using Convolutional Neural Networks
Authors:	Gabriel Santiago-Plaza, Luke Meyers, Andrea Gomez-Jaime, Rafael Meléndez-Ríos, Fanfan Noel, Jose Agosto, Tugrul Giray, Josué Rodríguez-Cordero and Rémi Mégret
Abstract:	This paper proposes and evaluates methods for the automatic re-identification of honeybees marked with paint codes. It leverages deep learning models to recognize specific individuals from images, which is a key component for the automation of wild-life video monitoring. Paint code marking is traditionally used for individual re-identification in the field as it is less intrusive compared to alternative tagging approaches and is human-readable. To assess the performance of re-id using paint codes, we built a mostly balanced dataset of 8062 images of honeybees marked with one or two paint dots from 8 different colors, generating 64 distinct codes, repeated twice on distinct individual bees. This dataset was used to perform an extensive comparison of convolutional network re-identification approaches. The first approach uses supervised learning to estimate the paint code directly; the second approach uses contrastive learning to learn an identity feature vector that is then used to query a database of known identities. Best performance reached 85% correct identification for all 64 identities, and up to 97.6% for 8 identities, showing the potential of the technique. Ablation studies with variation in training data and selection of IDs provide guidance for future use of this technique in the field.
Download

Paper Nr:	376
Title:	Depth Estimation Using Weighted-Loss and Transfer Learning
Authors:	Muhammad A. Hafeez, Michael G. Madden, Ganesh Sistu and Ihsan Ullah
Abstract:	Depth estimation from 2D images is a common computer vision task that has applications in many fields including autonomous vehicles, scene understanding and robotics. The accuracy of a supervised depth estimation method mainly relies on the chosen loss function, the model architecture, quality of data and performance metrics. In this study, we propose a simplified and adaptable approach to improve depth estimation accuracy using transfer learning and an optimized loss function. The optimized loss function is a combination of weighted losses to which enhance robustness and generalization: Mean Absolute Error (MAE), Edge Loss and Structural Similarity Index (SSIM). We use a grid search and a random search method to find optimized weights for the losses, which leads to an improved model. We explore multiple encoder-decoder-based models including DenseNet121, DenseNet169, DenseNet201, and EfficientNet for the supervised depth estimation model on NYU Depth Dataset v2. We observe that the EfficientNet model, pre-trained on ImageNet for classification when used as an encoder, with a simple upsampling decoder, gives the best results in terms of RSME, REL and log 10 : 0.386, 0.113 and 0.049, respectively. We also perform a qualitative analysis which illustrates that our model produces depth maps that closely resemble ground truth, even in cases where the ground truth is flawed. The results indicate significant improvements in accuracy and robustness, with EfficientNet being the most successful architecture.
Download

Paper Nr:	386
Title:	Strategies for Classifier Selection Based on Genetic Programming for Multimedia Data Recognition
Authors:	Rafael J. Martarelli, Douglas Rodrigues, Clayton R. Pereira, Jurandy Almeida and João P. Papa
Abstract:	We live in a digital world with an explosion of data in various forms, such as videos, images, signals, and texts, making manual analysis unfeasible. Machine learning techniques can use this huge amount of data to train models as an excellent solution for automating decision-making processes such as fraud detection, product recommendation, and assistance with medical diagnosis, among others. However, training these classifiers is challenging, resulting in discarding low-quality models. Classifier committees and ensemble pruning have been introduced to optimize classification, but traditional functions used to fuse predictions are limited. This paper proposes the use of Genetic Programming (GP) to combine committee members’ forecasts in a new fashion, opening new perspectives in data classification. We evaluate the proposed method employing several mathematical functions and fuzzy logic operations in HMDB51 and UCF101 datasets. The results reveal that GP can significantly enhance the performance of classifier committees, outperforming traditional methods in various scenarios. The proposed approach improves accuracy on training and test sets, offering adaptability to different data features and user requirements.
Download

Paper Nr:	437
Title:	Unsupervised Domain Adaptation for Human Pose Action Recognition
Authors:	Mattias Billast, Tom De Schepper and Kevin Mets
Abstract:	Personalized human action recognition is important to give accurate feedback about motion patterns, but there is likely no labeled data available to update the model in a supervised way. Unsupervised domain adaptation can solve this problem by closing the gap between seen data and new unseen users. We test several domain adaptation techniques and compare them with each other on this task. We show that all tested techniques improve on the model without any domain adaptation and are only trained on labeled source data. We add multiple improvements by designing a better feature representation tailored to the new user. These improvements include added contrastive loss and varying the backbone encoder. We would need between 30% and 40% labeled data of the new user to get the same results.
Download

Paper Nr:	443
Title:	CL-FedFR: Curriculum Learning for Federated Face Recognition
Authors:	Devilliers C. Dube, Çiğdem E. Erdem and Ömer Korçak
Abstract:	Face recognition (FR) has been significantly enhanced by the advent and continuous improvement of deep learning algorithms and accessibility of large datasets. However, privacy concerns raised by using and distributing face image datasets have emerged as a significant barrier to the deployment of centralized machine learning algorithms. Recently, federated learning (FL) has gained popularity since the private data at edge devices (clients) does not need to be shared to train a model. FL also continues to drive FR research toward decentralization. In this paper, we propose novel data-based and client-based curriculum learning (CL) approaches for federated FR intending to improve the performance of generic and client-specific personalized models. The data-based curriculum utilizes head pose angles as the difficulty measure and feeds the images from “easy” to “difficult” during training, which resembles the way humans learn. Client-based curriculum chooses “easy clients” based on performance during the initial rounds of training and includes more “difficult clients” at later rounds. To the best of our knowledge, this is the first paper to explore CL for FR in a FL setting. We evaluate the proposed algorithm on MS-Celeb-1M and IJB-C datasets and the results show an improved performance when CL is utilized during training.
Download

Paper Nr:	446
Title:	Exploring Unsupervised Domain Adaptation Approaches for Water Parameters Estimation from Satellite Images
Authors:	Mauren S. Coelho de Andrade, Anderson P. Souza, Bruno Oliveira, Maria C. Starling, Camila C. Amorim and Jefersson D. Santos
Abstract:	In this paper, we compare several domain adaptation approaches in classifying water quality in reservoirs using spectral data from satellite images to two optical parameters: turbidity and chlorophyll-a. This assessment adds a new possibility in monitoring these water quality parameters, in addition to the traditional in-situ investigation, which is expensive and time-consuming. The study acquired images from two data sources characterized by different geographic regions (USA and Brazil) and verified the inference quality of the model trained in the source domain on samples from the target domain. The experiments used two classifiers, OSCVM and ANN, for domain adaptation methods based on instances, features, and depth. The results suggest domain adaptation is an efficient alternative when labeled data is scarce. Furthermore, we evaluate the need to handle imbalanced data, a characteristic of real-world problems like the data explored here. Based on promising accuracy results, we show that applying domain adaptation techniques in databases with little data, such as the Brazilian database, and without labeled data, is an efficient and low-cost alternative that can be useful in monitoring reservoirs in different regions.
Download

Paper Nr:	449
Title:	M&M: Multimodal-Multitask Model Integrating Audiovisual Cues in Cognitive Load Assessment
Authors:	Long Nguyen-Phuoc, Renald Gaboriau, Dimitri Delacroix and Laurent Navarro
Abstract:	This paper introduces the M&M model, a novel multimodal-multitask learning framework, applied to the AVCAffe dataset for cognitive load assessment (CLA). M&M uniquely integrates audiovisual cues through a dual-pathway architecture, featuring specialized streams for audio and video inputs. A key innovation lies in its cross-modality multihead attention mechanism, fusing the different modalities for synchronized multitasking. Another notable feature is the model’s three specialized branches, each tailored to a specific cognitive load label, enabling nuanced, task-specific analysis. While it shows modest performance compared to the AVCAffe’s single-task baseline, M&M demonstrates a promising framework for integrated multimodal processing. This work paves the way for future enhancements in multimodal-multitask learning systems, emphasizing the fusion of diverse data types for complex task handling.
Download

Area 4 - Applications and Services

Full Papers

Paper Nr:	171
Title:	End-to-End Chess Recognition
Authors:	Athanasios Masouris and Jan C. van Gemert
Abstract:	Chess recognition is the task of extracting the chess piece configuration from a chessboard image. Current approaches use a pipeline of separate, independent, modules such as chessboard detection, square localization, and piece classification. Instead, we follow the deep learning philosophy and explore an end-to-end approach to directly predict the configuration from the image, thus avoiding the error accumulation of the sequential approaches and eliminating the need for intermediate annotations. Furthermore, we introduce a new dataset, Chess Recognition Dataset (ChessReD), that consists of 10,800 real photographs and their corresponding annotations. In contrast to existing datasets that are synthetically rendered and have only limited angles, ChessReD has photographs captured from various angles using smartphone cameras; a sensor choice made to ensure real-world applicability. Our approach in chess recognition on the introduced challenging benchmark dataset outperforms related approaches, successfully recognizing the chess pieces’ configuration in 15.26% of ChessReD’s test images. This accuracy may seem low, but it is ≈7x better than the current state-of-the-art and reflects the difficulty of the problem. The code and data are available through: https://github.com/ThanosM97/ end-to-end-chess-recognition.
Download

Paper Nr:	182
Title:	Animating NeRFs from Texture Space: A Framework for Pose-Dependent Rendering of Human Performances
Authors:	Paul Knoll, Wieland Morgenstern, Anna Hilsmann and Peter Eisert
Abstract:	Creating high-quality controllable 3D human models from multi-view RGB videos poses a significant challenge. Neural radiance fields (NeRFs) have demonstrated remarkable quality in reconstructing and free-viewpoint rendering of static as well as dynamic scenes. The extension to a controllable synthesis of dynamic human performances poses an exciting research question. In this paper, we introduce a novel NeRF-based framework for pose-dependent rendering of human performances. In our approach, the radiance field is warped around an SMPL body mesh, thereby creating a new surface-aligned representation. Our representation can be animated through skeletal joint parameters that are provided to the NeRF in addition to the viewpoint for pose dependent appearances. To achieve this, our representation includes the corresponding 2D UV coordinates on the mesh texture map and the distance between the query point and the mesh. To enable efficient learning despite mapping ambiguities and random visual variations, we introduce a novel remapping process that refines the mapped coordinates. Experiments demonstrate that our approach results in high-quality renderings for novel-view and novel-pose synthesis.
Download

Paper Nr:	196
Title:	Symmetry Completion Test: A Novel Approach for Visual Distortion Mapping and Correction Using Symmetry Constraints
Authors:	Ye Ling, David M. Frohlich, Tom H. Williamson and Jean-Yves Guillemaut
Abstract:	Metamorphopsia, commonly referred to as distorted vision, is a serious visual impairment which remains uncorrectable by optical glasses or contact lenses. This paper presents a novel approach to digitally map visual distortion based on patient feedback. The approach is based on the use of low-level geometrical constraints (central symmetry) which provide a simple and intuitive mechanism for a patient to provide feedback on their perceived visual distortion. We derive a set of fundamental constraints and show how visual distortion mapping can be framed as an optimisation problem. Critically, a parametric distortion model based on MLS is used to reduce the dimensionality of the problem and enable detailed visual distortion estimation. An extensive evaluation using simulated data demonstrates the accuracy and robustness of the approach. This approach opens up the possibility of correcting for visual distortion by applying the inverse mapping on the input stream to for instance VR see-through devices or screen-based devices.
Download

Paper Nr:	264
Title:	Defying Limits: Super-Resolution Refinement with Diffusion Guidance
Authors:	Marcelo D. Santos, João R. Neves, Hugo Proença and David Menotti
Abstract:	Due to the growing number of surveillance cameras and rapid technological advancement, facial recognition algorithms have been widely applied. However, their performance decreases in challenging environments, such as those involving surveillance cameras with low-resolution images. To address this problem, in this paper, we introduce SRDG, a super-resolution approach supported by two state-of-the-art methods: diffusion models and classifier guidance. The diffusion process reconstructs the image, and the classifier refines the image reconstruction based on a set of facial attributes. This combination of models is capable of working with images with a very limited resolution (8×8 and 16×16), being suitable for surveillance scenarios where subjects are typically distant from the camera. The experimental validation of the proposed approach shows that super-resolution images exhibit enhanced details and improved visual quality. More importantly, when using our super-resolution algorithm, the facial discriminability of images is improved compared to state-of-the-art super-resolution approaches, resulting in a significant increase in face recognition accuracy. To the best of our knowledge, this is the first time classifier guidance has been applied to refine super-resolution results of images from surveillance cameras. Source code is available at https://github.com/marcelowds/SRDG.
Download

Paper Nr:	293
Title:	On Feasibility of Transferring Watermarks from Training Data to GAN-Generated Fingerprint Images
Authors:	Venkata S. Mannam, Andrey Makrushin and Jana Dittmann
Abstract:	Due to the rise of high-quality synthetic data produced by generative models and a growing mistrust in images published in social media, there is an urgent need for reliable means of synthetic image detection. Passive detection approaches cannot properly handle images created by ”unknown” generative models. Embedding watermarks in synthetic images is an active detection approach which transforms the task from fake detection to watermark extraction. The focus of our study is on watermarking biometric fingerprint images produced by Generative Adversarial Networks (GAN). We propose to watermark images used for training of a GAN model and study the interplay between the watermarking algorithm, GAN architecture, and training hyperparameters to ensure the watermark transfer from training data to GAN-generated fingerprint images. A hybrid watermarking algorithm based on discrete cosine transformation, discrete wavelet transformation, and singular value decomposition is shown to produce transparent logo watermarks which are robust to pix2pix network training. The pix2pix network is applied to reconstruct realistic fingerprints from minutiae. The watermark imperceptibility and robustness to GAN training are validated by peak signal-to-noise ratio and bit error rate respectively. The influence of watermarks on reconstruction success and realism of fingerprints is measured by Verifinger matching scores and NFIQ2 scores respectively.
Download

Paper Nr:	302
Title:	Towards the Detection of Diffusion Model Deepfakes
Authors:	Jonas Ricker, Simon Damm, Thorsten Holz and Asja Fischer
Abstract:	In the course of the past few years, diffusion models (DMs) have reached an unprecedented level of visual quality. However, relatively little attention has been paid to the detection of DM-generated images, which is critical to prevent adverse impacts on our society. In contrast, generative adversarial networks (GANs), have been extensively studied from a forensic perspective. In this work, we therefore take the natural next step to evaluate whether previous methods can be used to detect images generated by DMs. Our experiments yield two key findings: (1) state-of-the-art GAN detectors are unable to reliably distinguish real from DM-generated images, but (2) re-training them on DM-generated images allows for almost perfect detection, which remarkably even generalizes to GANs. Together with a feature space analysis, our results lead to the hypothesis that DMs produce fewer detectable artifacts and are thus more difficult to detect compared to GANs. One possible reason for this is the absence of grid-like frequency artifacts in DM-generated images, which are a known weakness of GANs. However, we make the interesting observation that diffusion models tend to underestimate high frequencies, which we attribute to the learning objective.
Download

Short Papers

Paper Nr:	43
Title:	Enhanced Multimodal Timely Prediction of Pulmonary Fibrosis Progression with Uncertainty Estimation from Chest CT Images and Clinical Metadata
Authors:	Mohamed Dahmane
Abstract:	Pulmonary Fibrosis (PF) is a progressive chronic illness in which the lung tissues become increasingly scarred and damaged, leading to irreversible loss of their capacity to oxygenate vital organs. The specific causes of the illness are often unknown in many cases. Assessment of the severity of the lung disease is critical for physicians to diagnose PF early, control disease decline, and manage damage progression. The Forced Vital Capacity (FVC) of the lungs measured by a spirometer, is a good indicator of the severity of the condition of the lungs. In this work, we investigated an approach for early diagnosis of PF and showcased a multimodal architecture that predicts the FVC of patients at different stages of the disease. We propose an anti-Elu intermediate block and an anti-Relu confidence block to predict the pulmonary fibrosis progression. The uncertainty estimation block proved effective in predicting the FVC using data from initial spirometry measurements, clinical meta-data and CT images. Evaluation of the model on the OSIC pulmonary fibrosis progression dataset showed improved performance compared to state-of-the-art methods, with an average modified Laplace log-likelihood score of -6.8227 on a private test set.
Download

Paper Nr:	184
Title:	Attentive-YOLO: On-Site Water Pipeline Inspection Using Efficient Channel Attention and Reduced ELAN-Based YOLOv7
Authors:	Atanu S. Roy and Priyanka Bagade
Abstract:	The effective and dependable distribution of clean water to communities depends on the timely inspection and repair of water pipes. Traditional inspection techniques frequently require expensive physical labour, resulting in false and delayed defect detections. Current water pipeline inspection methods include radiography testing, eddy current testing, and CCTV inspection. These methods require experts to be present on-site to conduct the tests. Radiographed and CCTV images are usually used for pipeline defect detection on-site, yet real-time automatic detection is lacking. Current approaches, including YOLOv5 models with Retinex-based illumination, achieve acceptable performance but hinder fast inference due to bulky models, which is especially concerning for edge devices. This paper proposes an Attentive-YOLO model based on the state-of-the-art object detection YOLOv7 model with a reduced Efficient Layer Aggregation Network (ELAN). We propose a lightweight attention model in the head and backbone of the YOLOv7 network to improve accuracy while reducing model complexity and size. The paper aims to present an efficient model to be deployed on edge devices such as the Raspberry Pi to be used in Internet of Things (IoT) systems and on-site robotics applications like pipeline inspection robots. Based on the experiments, the proposed model, Attentive-YOLO, achieves an mAP score of 0.962 over 0.93 (1/3rd channel width) compared to the Yolov7-tiny model, with an almost 20% decrease in model parameters.
Download

Paper Nr:	191
Title:	Deep Learning-Powered Assembly Step Classification for Intricate Machines
Authors:	Luca Rodiga, Eva Eggeling, Ulrich Krispel and Torsten Ullrich
Abstract:	Augmented Reality-based assistance systems can help qualified technicians by providing them with technical details. However, the applicability is limited by the low availability of real data. In this paper, we focus on synthetic renderings of CAD data. Our objective is to investigate different model architectures within the machine-learning component and compare their performance. The training data consists of CAD renderings from different viewpoints distributed over a sphere around the model. Utilizing the advantages of transfer learning and pre-trained backbones we trained different versions of EfficientNet and EfficientNetV2 on these images for every assembly step in two resolutions. The classification performance was evaluated on a smaller test set of synthetic renderings and a dataset of real-world images of the model. The best Top1-accuracy on the real-world dataset is achieved by the medium-sized EfficientNetV2 with 57.74%, while the best Top5-accuracy is provided by EfficientNetV2 Small. Consequently, our approach has a good classification performance indicating the real-world applicability of such a deep learning classifier in the near future.
Download

Paper Nr:	239
Title:	Diffusion-Inspired Dynamic Models for Enhanced Fake Face Detection
Authors:	Mamadou D. Bah, Rajjeshwar Ganguly and Mohamed Dahmane
Abstract:	The conventional convolutional U-Net model was originally designed to segment images. Enhanced versions of the architecture with timestep embedding and self-attention inspired by transformers were proposed in the literature for image classification tasks. In this paper, we investigate a U-Net–encoder architecture for deepfake detection that involves two key features, the use of self-attention blocks to capture both local and global content representation, and the integration of timestep embedding to capture the dynamic perturbation of the input data. The model is trained and evaluated on FF++ dataset, comprising of real and deepfake synthesized videos. Notably, compared to traditional models pretrained on ImageNet, our model demonstrates superior performance. The experimental results highlight the effectiveness of our approach in achieving improved classification results for the challenging task of distinguishing real and deepfake images. The achieved performances suggest that the model aims to leverage both spatial information and dynamic perturbation for improved detection performance.
Download

Paper Nr:	244
Title:	Physical Ergonomics Anticipation with Human Motion Prediction
Authors:	Mattias Billast, Jonas De Bruyne, Klaas Bombeke, Tom De Schepper and Kevin Mets
Abstract:	Good physical ergonomics is a crucial aspect of performing repetitive tasks sustainably for a long period. We developed a VR training environment that improves the ergonomics and experience of the user during a task. Through human motion prediction, we can predict the posture of the user accurately up to three seconds ahead of time. Based on this posture, a physical ergonomics score, called REBA (Hignett and McAtamney, 2000), is computed and can warn the user ahead of time to adapt their posture. We used the lightweight STS-GCN model (Sofianos et al., 2021) as it can infer predictions in real-time to give feedback to the users. We show in our experiments that using multi-task learning improves human motion prediction significantly. Our method is generally applicable for various manual tasks as almost all tasks can be simulated in a VR environment.
Download

Paper Nr:	278
Title:	Deep Discriminative Feature Learning for Document Image Manipulation Detection
Authors:	Kais Rouis, Petra Gomez-Krämer, Mickaël Coustaty, Saddock Kébairi and Vincent Poulain d’Andecy
Abstract:	Image authenticity analysis has become a very important task in the last years with one main objective that is tracing the counterfeit content induced by illegal manipulations and forgeries that can be easily practiced using available software tools. In this paper, we propose a reliable residual-based deep neural network that is able to detect document image manipulations and copy-paste forgeries. We consider the perceptual characteristics of documents including mainly textual regions with homogeneous backgrounds. To capture abstract features, we introduce a shallow architecture using residual blocks and take advantage of shortcut connections. A first layer is implemented to boost the model performance, which is initialized with high-pass filters to forward low-level error feature maps. Manipulation experiments are conducted on a publicly available document dataset. We compare our method with two interesting forensic approaches that incorporate deep neural models along with first layer initialization techniques. We carry out further experiments to handle the forgery detection problem on private administrative document datasets. The experimental results demonstrate the superior performance of our model to detect image manipulations and copy-paste forgeries in a realistic document fraud scenario.
Download

Paper Nr:	305
Title:	Deep Learning Model Compression for Resource Efficient Activity Recognition on Edge Devices: A Case Study
Authors:	Dieter Balemans, Benjamin Vandersmissen, Jan Steckel, Siegfried Mercelis, Phil Reiter and José Oramas
Abstract:	This paper presents an approach to adapt an existing activity recognition model for efficient deployment on edge devices. The used model, called YOWO (You Only Watch Once), is a prominent deep learning model. Given its computational complexity, direct deployment on resource-constrained edge devices is challenging. To address this, we propose a two-stage compression methodology consisting of structured channel pruning and quantization. The goal is to significantly reduce the model’s size and computational needs while maintaining acceptable task performance. Our experimental results, obtained by deploying the compressed model on Raspberry Pi 4 Model B, confirm that our approach effectively reduces the model’s size and operations while maintaining satisfactory performance. This study paves the way for efficient activity recognition on edge devices.
Download

Paper Nr:	307
Title:	Explainability and Interpretability for Media Forensic Methods: Illustrated on the Example of the Steganalysis Tool Stegdetect
Authors:	Christian Kraetzer and Mario Hildebrandt
Abstract:	For the explainability and interpretability of the outcomes of all forensic investigations, including those in media forensics, the quality assurance and proficiency testing work performed needs to ensure not only the necessary technical competencies of the individual practitioners involved in an examination. It also needs to enable the investigators to have sufficient understanding of machine learning (ML) or ‘artificial intelligence’ (AI) systems used and are able to ascertain and demonstrate the validity and integrity of evidence in the context of criminal investigations. In this paper, it is illustrated on the example of applying the multi-class steganalysis tool Stegdetect to find steganographic messages hidden in digital images, why the explainability and interpretability of the outcomes of media forensic investigations are a challenge to researchers and forensic practitioners.
Download

Paper Nr:	308
Title:	Recognizing Actions in High-Resolution Low-Framerate Videos: A Feasibility Study in the Construction Sector
Authors:	Benjamin Vandersmissen, Arian Sabaghi, Phil Reiter and Jose Oramas
Abstract:	Action recognition addresses the automated comprehension of human actions within images or video sequences. Its applications extend across critical areas, mediating between visual perception and intelligent decision-making. However, action recognition encounters multifaceted challenges, including limited annotated data, background clutter, and varying illumination conditions. In the context of the construction sector, distinct challenges arise, requiring specialized approaches. This study investigates the applicability of established action recognition methodologies in this dynamic setting. We evaluate both sequence-based (YOWO) and frame-based (YOLOv8) approaches, considering the effect that resolution and frame rate have on performance. Additionally, we explore self-supervised learning techniques to enhance recognition accuracy. Our analysis aims to guide the development of more effective and efficient practical action recognition methods.
Download

Paper Nr:	423
Title:	Fact-Checked Claim Detection in Videos Using a Multimodal Approach
Authors:	Frédéric Rayar
Abstract:	The recent advances in technology and social networks have led to a phenomenon that is so called, information disorder. Our information environment is polluted by misinformation, disinformation and fake news and at a global scale. Hence, assessing the veracity of information becomes mandatory nowadays. So far, most of the efforts in the last decades have focused on analysing textual content, scraped from blogs and social media, and trying to predict the veracity of information. However, such false information also appears in multimedia content and have a peculiar life cycle throughout time that is worth leveraging. In this paper, we present a multimodal approach for detecting claims that have already been fact-checked, from videos input. Focusing on political discourse in French language, we demonstrate the feasibility of a complete system, offline and explainable.
Download

Paper Nr:	432
Title:	Automatic Computation of the Posterior Nipple Line from Mammographies
Authors:	Quynh J. Tran, Tina Santner and Antonio Rodríguez-Sánchez
Abstract:	Breast cancer is the most commonly diagnosed cancer in female patients. Detecting early signs of malignity by undergoing breast screening is therefore of great importance. For a reliable diagnosis, high-quality exami-nated mammograms are essential since poor breast positioning can cause cancers to be missed, which is why mammograms are subject to strict evaluation criteria. One such criterion is the posterior (or pectoralis) nipple line (PNL). We present a method for computing the PNL length, which consisted of the following steps: Pectoral Muscle Detection, Nipple Detection, and final PNL Computation. A multidirectional Gabor filter allowed us to detect the pectoral muscle. For detecting the nipple we made use of the geometric properties of the breast, applied watershed segmentation and Hough Circle Transform. Using both landmarks (pectoral muscle and nipple), the PNL length could be computed. We evaluated 100 mammogram images provided by the Medical University of Innsbruck. The computed PNL length was compared with the real PNL length, which was measured by an expert. Our methodology achieved an absolute mean error of just 6.39 mm.
Download

Paper Nr:	453
Title:	Automatic Assessment of Skill and Performance in Fencing Footwork
Authors:	Filip Malawski, Marek Krupa and Ksawery Kapela
Abstract:	Typically human action recognition methods focus on the detection and classification of actions. In this work, we consider qualitative evaluation of sports actions, namely in fencing footwork, including technical skill and physical performance. In cooperation with fencing coaches, we designed, recorded, and labeled an extensive dataset including 28 variants of incorrect executions of fencing footwork actions as well as corresponding correct variants. Moreover, the dataset contains action sequences for action recognition tasks. This is the most extensive fencing action dataset collected to date. We propose and evaluate an expert system, based on pose estimation in video data, for measuring relevant motion parameters and distinguishing between correct and incorrect executions of actions. Additionally, we validate a method for temporal segmentation and classification of actions in sequences. The obtained results indicate that the proposed solution can provide relevant feedback in fencing training.
Download

Paper Nr:	456
Title:	Semantic Image Synthesis for Realistic Image Generation in Robotic Assisted Partial Nephrectomy
Authors:	Stefano Mazzocchetti, Laura Cercenelli, Lorenzo Bianchi, Riccardo Schiavina and Emanuela Marcelli
Abstract:	With the continuous evolution of robotic-assisted surgery, the integration of advanced technologies into the field becomes pivotal for improving surgical outcomes. The lack of labelled surgical datasets limits the range of possible applications of deep learning techniques in the surgical field. As a matter of fact, the annotation process to label datasets is time consuming. This paper introduces an approach for realistic image generation in the context of Robotic Assisted Partial Nephrectomy (RAPN) using the Semantic Image Synthesis (SIS) technique. Leveraging descriptive semantic maps, our method aims to bridge the gap between abstract scene representation and visually compelling laparoscopic images. It is shown that our approach can effectively generate photo-realistic Minimally Invasive Surgery (MIS) synthetic images starting from a sparse set of annotated real images. Furthermore, we demonstrate that synthetic data can be used to train a semantic segmentation network that generalizes on real data reducing the annotation time needed.
Download

Paper Nr:	58
Title:	Multi-View 3D Reconstruction for Construction Site Monitoring
Authors:	Guangan Chen, Michiel Vlaminck, Wilfried Philips and Hiep Luong
Abstract:	Monitoring construction sites is pivotal in effective construction management. Building Information Modeling (BIM) is vital for creating detailed building models and comparing actual construction with planned designs. For this comparison, a 3D model of the building is often generated using images captured by handheld cameras or Unmanned Aerial Vehicles. However, this approach does not provide real-time spatial monitoring of on-site activities within the BIM model. To address this challenge, our study utilizes fixed cameras placed at predetermined locations within an actual construction site. We captured images from these fixed viewpoints and used classical multi-view stereo techniques to create a 3D point cloud representing the as-built building. This point cloud is then aligned with the as-planned BIM model through point cloud registration. In addition, we proposed an algorithm to convert SfM reprojection error into a value with metric units, resulting in a mean SfM reprojection error of 4.17cm. We also created voxel volumes to track and visualize construction activities within BIM coordinate system, enhancing real-time site monitoring and improving construction management.
Download

Paper Nr:	83
Title:	A Study of Real World Information Mapping for Information Sharing Using Edge Devices
Authors:	Takenori Hara and Hideo Saito
Abstract:	We are developing a service platform to achieve use cases such as finding people in trouble and sharing them with AR applications to ask for help from the people around them, finding information about a sale at a store and sharing it with a map service, or alerting the police when a crime occurs, as well as quickly alerting the people around them. Our system detects "objects" and "events" in the real world using edge devices that can share the detected information in cyberspace so that the users can remotely obtain real-world information from distant locations. We believe our system can make life more safe and convenient. In this paper, we report on the findings from our experiments on searching for lost dogs by mapping object recognition results, on mapping bird habitat areas by environmental sound recognition results, and on mapping public facility announcement voices by speech recognition results.
Download

Paper Nr:	177
Title:	Real-Time Detection and Mapping of Crowd Panic Emergencies
Authors:	Ilias Lazarou, Anastasios L. Kesidis and Andreas Tsatsaris
Abstract:	We present a real-time system that uses machine learning and georeferenced biometric data from wearables and smartphones to detect and map crowd panic emergencies. Our system predicts stress levels, tracks stressed individuals, and introduces the CLOT parameter for better noise filtering and response speed. We also introduce the DEI metric to assess panic severity. The system creates dynamic areas showing the evolving panic situation in real-time. By integrating CLOT and DEI, emergency responders gain insights into crowd behaviour, enabling more effective responses to panic-induced crowd movements. This system enhances public safety by swiftly detecting, mapping, and assessing crowd panic emergencies.
Download

Paper Nr:	193
Title:	Teacher-Student Models for AI Vision at the Edge: A Car Parking Case Study
Authors:	Mbasa J. Molo, Emanuele Carlini, Luca Ciampi, Claudio Gennaro and Lucia Vadicamo
Abstract:	The surge of the Internet of Things has sparked a multitude of deep learning-based computer vision applications that extract relevant information from the deluge of data coming from Edge devices, such as smart cameras. Nevertheless, this promising approach introduces new obstacles, including the constraints posed by the limited computational resources on these devices and the challenges associated with the generalization capabilities of the AI-based models against novel scenarios never seen during the supervised training, a situation frequently encountered in this context. This work proposes an efficient approach for detecting vehicles in parking lot scenarios monitored by multiple smart cameras that train their underlying AI-based models by exploiting knowledge distillation. Specifically, we consider an architectural scheme comprising a powerful and large detector used as a teacher and several shallow models acting as students, more appropriate for computational-bounded devices and designed to run onboard the smart cameras. The teacher is pre-trained over general-context data and behaves like an oracle, transferring its knowledge to the smaller nodes; on the other hand, the students learn to localize cars in new specific scenarios without using further labeled data, relying solely on the distilled loss coming from the oracle. Preliminary results show that student models trained only with distillation loss increase their performances, sometimes even outperforming the results achieved by the same models supervised with the ground truth.
Download

Paper Nr:	198
Title:	Robotics and Computer Vision in the Brazilian Electoral Context: A Case Study
Authors:	Jairton S. Falcão Filho, Matheus J. Costa, Jonas F. Silva, Cecília V. Santos da Silva, Felipe S. Mendonca, Jefferson M. Norberto, Riei M. Rodrigues and Marcondes D. Silva Júnior
Abstract:	Since 2000, Brazil has had fully computerized elections using electronic ballot boxes. To prove the functioning and security of electronic ballot boxes transparently, some tests are carried out, including the integrity test. This test is carried out throughout the national territory in locations defined for each Brazilian state, with a reasonable number of people. Here, an automation system for integrity testing is presented, consisting of a robotic arm and software using computer vision for the two versions of electronic ballot boxes used in the 2022 presidential elections. Two days of tests were carried out, simulating the integrity test in the laboratory and a test during the real integrity test of the 2022 election. The system managed to cast between 197 and 232 votes during the tests with an average vote time of between 2 minutes and 15 seconds to 2 minutes and 43 seconds, depending on the version and test day, and errors between 1.48% and 3.72%, including reading and typing errors. The system allows you to reduce the number of people per electronic ballot box and increase the transparency and efficiency of the integrity test.
Download

Paper Nr:	216
Title:	Direct 3D Body Measurement Estimation from Sparse Landmarks
Authors:	David Bojanić, Kristijan Bartol, Tomislav Petković and Tomislav Pribanić
Abstract:	The current state-of-the-art 3D anthropometry extraction methods are either template-based or landmark-based. Template-based methods fit a statistical human body model to a 3D scan and extract complex features from the template to learn the body measurements. The fitting process is usually an optimization process, sensitive to its hyperparameters. Landmark-based methods use body proportion heuristics to estimate the landmark locations on the body in order to derive the measurements. Length measurements are derived as distances between landmarks, whereas circumference measurements are derived as cross-sections of the body and a plane defined at the desired landmark location. This makes it very susceptible to noise in the 3D scan data. To address these issues, we propose a simple learning method that infers the body measurements directly using the landmarks defined on the body. Our method avoids fitting a body model, extracting complex features, using heuristics, and handling noise in the data. We compare our method on the CAESAR dataset and show that using a simple method coupled with sparse landmark data can compete with state-of-the-art methods. To take a step towards open-source 3D anthropometry, we make our code available at https:/github.com/DavidBoja/Landmarks2Anthropometry.
Download

Paper Nr:	251
Title:	Coreset Based Medical Image Anomaly Detection and Segmentation
Authors:	Ciprian-Mihai Ceaușescu, Bogdan Alexe and Riccardo Volpi
Abstract:	We address the problem of binary classification of medical images employing an anomaly detection approach that uses only normal images for training. We build our method on top of a state-of-the-art anomaly detection method for visual inspection of industrial natural images, PatchCore, tailored to our tasks. We deal with the distribution shift between natural and medical images either by fine-tuning a pre-trained encoder on a general medical image dataset with ten classes or by training the encoder directly on a set of discriminative medical tasks. We employ our method for binary classification and evaluate it on two datasets: lung cancer from CT scan images and brain tumor from MRI images showing competitive results when compared to the baselines. Conveniently, this approach is able to produce segmentation masks used for localizing the anomalous regions. Additionally, we show how transformer encoders are up to the task allowing for improved F1 and AUC metrics on the anomaly task, also producing a better segmentation.
Download

Paper Nr:	266
Title:	A Mobile-Phone Pose Estimation for Gym-Exercise Form Correction
Authors:	Matthew Turner, Kofi Appiah and Sze C. Kwok
Abstract:	People learn to perform exercises with good pose (or form) via research or instruction from an experienced individual such as a personal trainer, but 33.3% of injuries still occur due to incorrect form. It is known that the presence of a personal trainer causes a significant reduction in the rate that injuries occur. There are many possible reasons for this such as cost, scheduling limitations and desire to train alone. However, given that 91% of UK adults use a smartphone, a mobile APP could take on the role of a personal trainer. This paper presents a solution using machine learning and a novel proposed method of form anomaly detection to offer form corrections from live exercise video while only using the capabilities of a mobile device. Overall, the work in this paper is capable detecting incorrect exercise pose and offer valid corrections based on the detected anomalies. Experiments have been conducted on live video to judge the system performance in real-time.

Paper Nr:	355
Title:	An Assistive Technology Based on Object Detection for Automated Task List Generation
Authors:	Frédéric Rayar
Abstract:	People suffering from Intellectual Disability (ID) face challenges to perform sequential tasks in their daily life, impacting their education and employability. To help them, the usage of Assistive Technology (AT) on mobile devices is a promising direction. However, most of them do not take advantage of the important recent advances in Artificial Intelligence. In this paper, we address this lack by presenting a prototype of an embed AT system that leverage computer vision advances to assist a user suffering from ID to perform sequential tasks in a guesthouse rooms’ tiding-up activity. It first relies on a state-of-the-art object detector, namely YOLOv7, that we have adapted for a real-time usage on mobile devices. Then, using a ”spot the difference” approach, it identifies objects of interest that are either present, absent or displaced, compared to a template image. A list of tasks to be achieved is automatically generated and conveyed to the user using an accessible and ergonomic interface. Early qualitative experiments of this ongoing work lead us to believe that our contribution could improve the life of people suffering from ID, allowing them to improve both their functioning and independence.
Download

Paper Nr:	371
Title:	Microplankton Discrimination in FlowCAM Images Using Deep Learning
Authors:	Francisco Bonin-Font, Gorka Buenvaron, Mary K. Kane and Idan Tuval
Abstract:	Marine plankton are omnipresent throughout the oceans, and the Mediterranean Sea is no exception. Innovation on microscopy technology for observing marine plankton over the last several decades has enabled scientist to obtain large quantities of images. While these new instruments permit generating and recording large amounts of visual information about plankton, they have produced a bottleneck and overwhelmed our abilities to provide meaningful taxonomic information quickly. The development of methods based on Artificial Intelligence or Deep Learning to process these images in efficient, cost-effective manners is an active area of continued research. In this study, Convolutional Neural Networks (CNNs) were trained to analyze images of natural assemblages of microplankton (< 100µm) and laboratory monocultures. The CNN configurations and training were focused on differentiating phytoplankton, zooplankton, and zooplankton consuming phytoplankton. Experiments reveal high performance in the discrimination of these different varieties of plankton, in terms of Accuracy, Precision, F1 scores and mean Average Precision.
Download

Paper Nr:	425
Title:	Evaluating the Usability of a 3D Map Visualizer Augmented Reality Application
Authors:	Franklin N. Fracaro, Fabiana F. Peres, Claudio M. Mauricio and João N. Teixeira
Abstract:	This study focuses on evaluating the usability of a 3D map visualizer augmented reality application through informal, remote, and synchronous usability tests, utilizing the Think Aloud process and the System Usability Scale questionnaire. The aim is to identify potential usability issues in the prototype and develop an action plan for future iterations of the application’s development. The challenges of integrating end-users in the application development process is addressed and emphasizes the significance of early usability testing to uncover and address potential issues. The findings from the usability tests provide valuable insights into user experience, highlighting key usability issues and user feedback. A comprehensive action plan for addressing the identified usability issues is presented and implications for future research in the domain of augmented reality applications is discussed.
Download

Area 5 - Motion, Tracking and Stereo Vision

Full Papers

Paper Nr:	95
Title:	BEVSeg2TP: Surround View Camera Bird’s-Eye-View Based Joint Vehicle Segmentation and Ego Vehicle Trajectory Prediction
Authors:	Sushil Sharma, Arindam Das, Ganesh Sistu, Mark Halton and Ciarán Eising
Abstract:	Trajectory prediction is, naturally, a key task for vehicle autonomy. While the number of traffic rules is limited, the combinations and uncertainties associated with each agent’s behaviour in real-world scenarios are nearly impossible to encode. Consequently, there is a growing interest in learning-based trajectory prediction. The proposed method in this paper predicts trajectories by considering perception and trajectory prediction as a unified system. In considering them as unified tasks, we show that there is the potential to improve the performance of perception. To achieve these goals, we present BEVSeg2TP - a surround-view camera bird’s-eye-view-based joint vehicle segmentation and ego vehicle trajectory prediction system for autonomous vehicles. The proposed system uses a network trained on multiple camera views. The images are transformed using several deep learning techniques to perform semantic segmentation of objects, including other vehicles, in the scene. The segmentation outputs are fused across the camera views to obtain a comprehensive representation of the surrounding vehicles from the bird’s-eye-view perspective. The system further predicts the future trajectory of the ego vehicle using a spatiotemporal probabilistic network (STPN) to optimize trajectory prediction. This network leverages information from encoder-decoder transformers and joint vehicle segmentation. The predicted trajectories are projected back to the ego vehicle’s bird’s-eye-view perspective to provide a holistic understanding of the surrounding traffic dynamics, thus achieving safe and effective driving for vehicle autonomy. The present study suggests that transformer-based models that use cross-attention information can improve the accuracy of trajectory prediction for autonomous driving perception systems. Our proposed method outperforms existing state-of-the-art approaches on the publicly available nuScenes dataset. This link is to be followed for the source code: https://github.com/sharmasushil/BEVSeg2TP/.
Download

Paper Nr:	125
Title:	Are Semi-Dense Detector-Free Methods Good at Matching Local Features?
Authors:	Matthieu Vilain, Rémi Giraud, Hugo Germain and Guillaume Bourmaud
Abstract:	Semi-dense detector-free approaches (SDF), such as LoFTR, are currently among the most popular image matching methods. While SDF methods are trained to establish correspondences between two images, their performances are almost exclusively evaluated using relative pose estimation metrics. Thus, the link between their ability to establish correspondences and the quality of the resulting estimated pose has thus far received little attention. This paper is a first attempt to study this link. We start with proposing a novel structured attention-based image matching architecture (SAM). It allows us to show a counter-intuitive result on two datasets (MegaDepth and HPatches): on the one hand SAM either outperforms or is on par with SDF methods in terms of pose/homography estimation metrics, but on the other hand SDF approaches are significantly better than SAM in terms of matching accuracy. We then propose to limit the computation of the matching accuracy to textured regions, and show that in this case SAM often surpasses SDF methods. Our findings highlight a strong correlation between the ability to establish accurate correspondences in textured regions and the accuracy of the resulting estimated pose/homography. Our code will be made available.
Download

Paper Nr:	261
Title:	ImgAdaPoinTr: Improving Point Cloud Completion via Images and Segmentation
Authors:	Vaagn Chopuryan, Mikhail Kuznetsov, Vasilii Latonov and Natalia Semenova
Abstract:	Point cloud completion is an essential task consisting of inferring and filling in missing parts of a 3D point cloud representation. In this paper, we present an ImgAdaPoinTr model, which extends the original Transformer encoder-decoder architecture by accurately incorporating visual information. Besides, we assumed using segmentation of 3D objects as a part of the pipeline due to acquiring an additional increase in performance. We also introduce the novel ImgPCN dataset generated by our rendering tool. The results show that our approach outperforms AdaPoinTr by average 2.9% and 10.3% in terms of Chamfer-Distance L1 and L2 metrics, respectively. The code and dataset are available via the link https://github.com/ImgAdaPoinTr.
Download

Paper Nr:	299
Title:	Evaluation of 3D Point Cloud Distances: A Comparative Study in Multi-Point Cloud Fusion Environments
Authors:	Ulugbek Alibekov, Vanessa Staderini, Geetha Ramachandran, Philipp Schneider and Doris Antensteiner
Abstract:	In the domain of 3D shape reconstruction and metrology, the precise alignment and measurement of point clouds is critical, especially within the context of industrial inspection where accuracy requirements are high. This work addresses challenges stemming from intricate object properties, including complex geometries or surfaces, resulting in diverse artefacts, holes, or sparse point clouds. We present a comprehensive evaluation of point cloud measurement metrics on different object shapes and error patterns. We focus on the task of point cloud evaluation of objects to assess their quality. This is achieved through the acquisition of partial point clouds acquired from multiple perspectives. This is followed by a point cloud fusion process including an initial alignment and a point cloud refinement step. We evaluate these point clouds with respect to a reference sampled point cloud and mesh. In this work, we evaluate various point cloud metrics across experimentally relevant scenarios like cloud density variations, different noise levels, and hole sizes on objects with different geometries. We additionally show how the approach can be applied in industrial object evaluation.
Download

Paper Nr:	342
Title:	Toward Multi-Plane Image Reconstruction from a Casually Captured Focal Stack
Authors:	Shiori Ueda, Hideo Saito and Shohei Mori
Abstract:	3D imaging combining a focal stack and multi-plane images (MPI) facilitates various real-time applications, including view synthesis, 3D scene editing, and augmented and virtual reality. Building upon the foundation of MPI, originally derived from multi-view images, we introduce a novel pipeline for reconstructing MPI by casually capturing a focal stack optically using a handheld camera with a manually modulated focus ring. We hypothesized two distinct strategies for focus ring modulation that users could employ, to sample defocus images along the front-facing axis uniformly. Our quantitative analysis using a synthetic dataset suggests tendencies in possible simulated errors in focus modulations, while qualitative results illustrate visual differences. We further showcase applications utilizing the resultant MPI, including depth rendering, occlusion-aware defocus filtering, and de-fencing.
Download

Short Papers

Paper Nr:	67
Title:	Attacking the Loop: Adversarial Attacks on Graph-Based Loop Closure Detection
Authors:	Jonathan Y. Kim, Martin Urschler, Patricia J. Riddle and Jörg S. Wicker
Abstract:	With the advancement in robotics, it is becoming increasingly common for large factories and warehouses to incorporate visual SLAM (vSLAM) enabled automated robots that operate closely next to humans. This makes any adversarial attacks on vSLAM components potentially detrimental to humans working alongside them. Loop Closure Detection (LCD) is a crucial component in vSLAM that minimizes the accumulation of drift in mapping, since even a small drift can accumulate into a significant drift over time. A prior work by Kim et al., SymbioLCD2, unified visual features and semantic objects into a single graph structure for finding loop closure candidates. While this provided a performance improvement over visual feature-based LCD, it also created a single point of vulnerability for potential graph-based adversarial attacks. Unlike previously reported visual-patch based attacks, small graph perturbations are far more challenging to detect, making them a more significant threat. In this paper, we present Adversarial-LCD, a novel black-box evasion attack framework that employs an eigencentrality-based perturbation method and an SVM-RBF surrogate model with a Weisfeiler-Lehman feature extractor for attacking graph-based LCD. Our evaluation shows that the attack performance of Adversarial-LCD with the SVM-RBF surrogate model was superior to that of other machine learning surrogate algorithms, including SVM-linear, SVM-polynomial, and Bayesian classifier, demonstrating the effectiveness of our attack framework. Furthermore, we show that our eigencentrality-based perturbation method outperforms other algorithms, such as Random-walk and Shortest-path, highlighting the efficiency of Adversarial-LCD’s perturbation selection method.
Download

Paper Nr:	104
Title:	PanoTherm: Panoramic Thermal Imaging for Object Detection and Tracking
Authors:	Thomas Kernbauer, Philipp Fleck and Clemens Arth
Abstract:	Visible-light cameras are used traditionally in object detection and tracking. Thermal imaging can equally be used for this purpose, however at the cost of additional calibration efforts, expenses, and limitations concerning the field of view. Still, thermal imaging is advantageous in various scenarios and basically the only plausible technology to apply in harsh outdoor environments, in which the use of standard RGB cameras is prohibitive due to low-light conditions or in complete darkness. While panoramic imaging using visible light cameras is becoming more popular for advanced photography or action recording, limited work has been done on developing panoramic thermal cameras. In this work, we present the first panoramic thermal camera targeting the constant 360◦ monitoring of the environment. We describe the calibration and stitching process in detail and demonstrate how to use the camera in a vehicular scenario. Finally, we give an example of the detection and tracking of objects and discuss the advantages and disadvantages of thermal imaging for this purpose.
Download

Paper Nr:	222
Title:	BASE: Probably a Better Approach to Visual Multi-Object Tracking
Authors:	Martin V. Larsen, Sigmund Rolfsjord, Daniel Gusland, Jörgen Ahlberg and Kim Mathiassen
Abstract:	The field of visual object tracking is dominated by methods that combine simple tracking algorithms and ad hoc schemes. Probabilistic tracking algorithms, which are leading in other fields, are surprisingly absent from the leaderboards. We found that accounting for distance in target kinematics, exploiting detector confidence and modelling non-uniform clutter characteristics is critical for a probabilistic tracker to work in visual tracking. Previous probabilistic methods fail to address most or all these aspects, which we believe is why they fall so far behind current state-of-the-art (SOTA) methods (there are no probabilistic trackers in the MOT17 top 100). To rekindle progress among probabilistic approaches, we propose a set of pragmatic models addressing these challenges, and demonstrate how they can be incorporated into a probabilistic framework. We present BASE (Bayesian Approximation Single-hypothesis Estimator), a simple, performant and easily extendible visual tracker, achieving state-of-the-art (SOTA) on MOT17 and MOT20, without using Re-Id. Code available at https://github.com/ffi-no/paper-base-visapp-2024.
Download

Paper Nr:	281
Title:	Learning 3D Human UV with Loose Clothing from Monocular Video
Authors:	Meng-Yu J. Kuo, Jingfan Guo and Ryo Kawahara
Abstract:	We introduce a novel method for recovering a consistent and dense 3D geometry and appearance of a dressed person from a monocular video. Existing methods mainly focus on tight clothing and recover human geometry as a single representation. Our key idea is to regress the holistic 3D shape and appearance as a canonical displacement and albedo maps in the UV space, while fitting the visual observations across frames. Specifically, we represent the naked body shape by a UV-space SMPL model, and represent the other geometric details, including the clothing, as a shape displacement UV map. We obtain the temporally coherent overall shape by leveraging a differential mask loss and a pose regularization. The surface details in UV space are jointly learned in the course of non-rigid deformation with the differentiable neural rendering. Meanwhile, the skinning deformation in the garment region is updated periodically to adjust its residual non-rigid motion in each frame. We additionally enforce the temporal consistency of surface details by utilizing the optical flow. Experimental results on monocular videos demonstrate the effectiveness of the method. Our UV representation allows for simple and accurate dense 3D correspondence tracking of a person wearing loose clothing. We believe our work would benefit applications including VR/AR content creation.
Download

Paper Nr:	370
Title:	A Multilevel Strategy to Improve People Tracking in a Real-World Scenario
Authors:	Cristiano B. de Oliveira, Joao C. Neves, Rafael O. Ribeiro and David Menotti
Abstract:	The Palácio do Planalto, office of the President of Brazil, was invaded by protesters on January 8, 2023. Surveillance videos taken from inside the building were subsequently released by the Brazilian Supreme Court for public scrutiny. We used segments of such footage to create the UFPR-Planalto801 dataset for people tracking and re-identification in a real-world scenario. This dataset consists of more than 500,000 images. This paper presents a tracking approach targeting this dataset. The method proposed in this paper relies on the use of known state-of-the-art trackers combined in a multilevel hierarchy to correct the ID association over the trajectories. We evaluated our method using IDF1, MOTA, MOTP and HOTA metrics. The results show improvements for every tracker used in the experiments, with IDF1 score increasing by a margin up to 9.5%.
Download

Paper Nr:	392
Title:	A Tool for 3D Representation of the 2D Thermographic Breast Acquisitions
Authors:	Eudoxia S. Moura, Gleidson M. Costa, Tiago B. Borchartt and Aura Conci
Abstract:	This study enhances the three-dimensional reconstruction of the breast surface from two-dimensional thermal images at five different angles. The process involves the use of a convolutional neural network U-net for image segmentation, extraction of relevant points and curves, geometric transformations to add a dimension to the curves, and modeling with Rational Non-uniform B-splines Surface (NURBS). The work aims to improve the previously proposed algorithm by making modifications to the rotation points and cutting regions to achieve a more comprehensive breast contour. In a quantitative evaluation of the results, comparing thermal images and Structured Light Scanner images with the Costa’s and new algorithm, we observed that, concerning scanning, the Costa’s algorithm has an average of 37.07% with a standard deviation of 6.27, while the new algorithm exhibits an average of 14.06% with a standard deviation of 7.28. When compared to thermal images, the Costa’s algorithm has an average of 37.47% with a standard deviation of 7.24, whereas the new algorithm shows an average of 14.77% with a standard deviation of 7.37. These results indicate that the new algorithm significantly improves the representation of the breast shape compared to the Costa’s algorithm.
Download

Paper Nr:	405
Title:	Self-Mounted Motion Capture System Using Mutual Projection of Asynchronous Cameras
Authors:	Kazusa Ozaki, Fumihiko Sakaue and Jun Sato
Abstract:	In this research, we propose a method for restoring three-dimensional motion from time-series images captured by an asynchronous camera in order to realise a motion capture system using a camera attached to the body surface. For this purpose, we represent the motion trajectory of each marker using a neural network, and estimate the motion trajectory by optimising the neural network from the input images. It is also shown that stable 3D restoration can be achieved using a method called mutual projection, assuming that the cameras are reflecting each other. We show that it is possible to estimate 3D motion from asynchronous cameras with high accuracy.
Download

Paper Nr:	39
Title:	Anisotropic Diffusion for Depth Estimation in Shape from Focus Systems
Authors:	Bilal Ahmad, Ivar Farup and Pål A. Floor
Abstract:	Shape from focus is a monocular method that uses the camera’s focus as the primary indicator for depth estimation. The initial depth map is usually improved by penalizing the L2 regularizer as a smoothness constraint, which tends to smoothen the structural details due to linear diffusion. In this article, we propose an energy minimization-based framework to improve the initial depth map by utilizing a nonlinear, spatial technique, called anisotropic diffusion as a smoothness constraint, which is combined with a fidelity term that incorporates the focus values of the initial depth to enhance structural aspects of the observed scene. Experiments are conducted on synthetic and real datasets which demonstrate that the proposed method can significantly improve the depth maps.
Download