VISAPP 2023 Abstracts


Area 1 - Image and Video Processing and Analysis

Full Papers
Paper Nr: 14
Title:

Salient Mask-Guided Vision Transformer for Fine-Grained Classification

Authors:

Dmitry Demidov, Muhammad H. Sharif, Aliakbar Abdurahimov, Hisham Cholakkal and Fahad S. Khan

Abstract: Fine-grained visual classification (FGVC) is a challenging computer vision problem, where the task is to automatically recognise objects from subordinate categories. One of its main difficulties is capturing the most discriminative inter-class variances among visually similar classes. Recently, methods with Vision Transformer (ViT) have demonstrated noticeable achievements in FGVC, generally by employing the self-attention mechanism with additional resource-consuming techniques to distinguish potentially discriminative regions while disregarding the rest. However, such approaches may struggle to effectively focus on truly discriminative regions due to only relying on the inherent self-attention mechanism, resulting in the classification token likely aggregating global information from less-important background patches. Moreover, due to the immense lack of the datapoints, classifiers may fail to find the most helpful inter-class distinguishing features, since other unrelated but distinctive background regions may be falsely recognised as being valuable. To this end, we introduce a simple yet effective Salient Mask-Guided Vision Transformer (SM-ViT), where the discriminability of the standard ViT’s attention maps is boosted through salient masking of potentially discriminative foreground regions. Extensive experiments demonstrate that with the standard training procedure our SM-ViT achieves state-of-the-art performance on popular FGVC benchmarks among existing ViT-based approaches while requiring fewer resources and lower input image resolution.
Download

Paper Nr: 17
Title:

A Model-agnostic Approach for Generating Saliency Maps to Explain Inferred Decisions of Deep Learning Models

Authors:

Savvas Karatsiolis and Andreas Kamilaris

Abstract: The widespread use of black-box AI models has raised the need for algorithms and methods that explain the decisions made by these models. In recent years, the AI research community is increasingly interested in models’ explainability since black-box models take over more and more complicated and challenging tasks. In the direction of understanding the inference process of deep learning models, many methods that provide human comprehensible evidence for the decisions of AI models have been developed, with the vast majority relying their operation on having access to the internal architecture and parameters of these models (e.g., the weights of neural networks). We propose a model-agnostic method for generating saliency maps that has access only to the output of the model and does not require additional information such as gradients. We use Differential Evolution (DE) to identify which image pixels are the most influential in a model’s decision-making process and produce class activation maps (CAMs) whose quality is comparable to the quality of CAMs created with model-specific algorithms. DE-CAM achieves good performance without requiring access to the internal details of the model’s architecture at the cost of more computational complexity.
Download

Paper Nr: 22
Title:

Flexible Extrinsic Structured Light Calibration Using Circles

Authors:

Robert Fischer, Michael Hödlmoser and Margrit Gelautz

Abstract: We introduce a novel structured light extrinsic calibration framework that emphasizes calibration flexibility while maintaining satisfactory accuracy. The proposed method facilitates extrinsic calibration by projecting circles into non-planar and dynamically changing scenes over multiple distances without relying on the structured light’s intrinsics. Our approach relies on extracting depth information using stereo-cameras. The implementation reconstructs light-rays by detecting the center of circles and reconstructing their 3D-positions using triangulation. We evaluate our method by using synthetically rendered images under relevant lighting- and scene conditions, including detection drop-out, circle-center detection error, impact of distances and impact of different scenes. Our implementation achieves a rotational accuracy of below 1 degree and a translational accuracy of approximately 1 cm. Based on our experimental results we expect our approach to be applicable for use cases in which more flexible extrinsic structured light calibration techniques are required, such as automotive headlight calibration.
Download

Paper Nr: 46
Title:

Deformable and Structural Representative Network for Remote Sensing Image Captioning

Authors:

Jaya Sharm, Peketi Divya, C. Vishnu, C. L. Reddy, B. H. Sekhar and C. K. Mohan

Abstract: Remote sensing image captioning has greater significance in image understanding that generates textual description of aerial images automatically. Majority of the existing architectures work within the framework of encoder-decoder structure. However, it is noted that the existing encoder-decoder based methods for remote sensing image captioning avoid fine-grained structural representation of objects and lack deep encoding representation of an image. In this paper, we propose a novel structural representative network for capturing fine-grained structures of remote sensing imagery to produce fine grained captions. Initially, a deformable network has been incorporated on intermediate layers of convolutional neural network to take out spatially invariant features from an image. Secondly, a contextual network is incorporated in the last layers of the proposed network for producing multi-level contextual features. In order to extract dense contextual features, an attention mechanism is accomplished in contextual networks. Thus, the holistic representations of aerial images are obtained through a structural representative network by combining spatial and contextual features. Further, features from the structural representative network are provided to multi-level decoders for generating spatially semantic meaningful captions. The textual descriptions obtained due to our proposed approach is demonstrated on two standard datasets, namely, the Sydney-Captions dataset and the UCM-Captions dataset. The comparative analysis is made with recently proposed approaches to exhibit the performance of the proposed approach and hence argue that the proposed approach is more suitable for remote sensing image captioning tasks.
Download

Paper Nr: 66
Title:

Fine-Tuning Restricted Boltzmann Machines Using No-Boundary Jellyfish

Authors:

Douglas Rodrigues, Gustavo Henrique de Rosa, Kelton A. Pontara da Costa, Danilo S. Jodas and João P. Papa

Abstract: Metaheuristic algorithms present elegant solutions to many problems regardless of their domain. The Jellyfish Search (JS) algorithm is inspired by how jellyfish searches for food in ocean currents and performs movements within the swarm. In this work, we propose a new version of the JS algorithm called No-Boundary Jellyfish Search (NBJS) to improve the convergence rate. The NBJS was applied to fine-tune a Restricted Boltzmann Machine (RBM) in the context of image reconstruction. For validating the proposal, the experiments were carried out on three public datasets to compare the performance of the NBJS algorithm with its original version and two other metaheuristic algorithms. The results showed that proposed approach is viable, for it obtained similar or even lower errors compared to models trained without fine-tuning.
Download

Paper Nr: 76
Title:

Masking and Mixing Adversarial Training

Authors:

Hiroki Adachi, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi, Yasunori Ishii and Kazuki Kozuka

Abstract: While convolutional neural networks (CNNs) have achieved excellent performances in various computer vision tasks, they often misclassify with malicious samples, a.k.a. adversarial examples. Adversarial training is a popular and straightforward technique to defend against the threat of adversarial examples. Unfortunately, CNNs must sacrifice the accuracy of standard samples to improve robustness against adversarial examples when adversarial training is used. In this work, we propose Masking and Mixing Adversarial Training (M2 AT) to mitigate the trade-off between accuracy and robustness. We focus on creating diverse adversarial examples during training. Specifically, our approach consists of two processes: 1) masking a perturbation with a binary mask and 2) mixing two partially perturbed images. Experimental results on CIFAR-10 dataset demonstrate that our method achieves better robustness against several adversarial attacks than previous methods.
Download

Paper Nr: 80
Title:

Robust RGB-D-IMU Calibration Method Applied to GPS-Aided Pose Estimation

Authors:

Abanob Soliman, Fabien Bonardi, Désiré Sidibé and Samia Bouchafa

Abstract: The challenging problem of multi-modal sensor fusion for 3D pose estimation in robotics, known as odometry, relies on the precise calibration of all sensor modalities within the system. Optimal values for time-invariant intrinsic and extrinsic parameters are estimated using various methodologies, from deterministic filters to nondeterministic optimization models. We propose a novel optimization-based method for intrinsic and extrinsic calibration of an RGB-D-IMU visual-inertial setup with a GPS-aided optimizer bootstrapping algorithm. Our front-end pipeline provides reliable initial estimates for the RGB camera intrinsics and trajectory based on an optical flow Visual Odometry (VO) method. Besides calibrating all time-invariant properties, our back-end optimizes the spatio-temporal parameters such as the target’s pose, 3D point cloud, and IMU biases. Experimental results on real-world and realistically high-quality simulated sequences validate the proposed first complete RGB-D-IMU setup calibration algorithm. Ablation studies on ground and aerial vehicles are conducted to estimate each sensor’s contribution in the multi-modal (RGB-D-IMU-GPS) setup on the vehicle’s pose estimation accuracy. GitHub repository: https://github.com/AbanobSoliman/HCALIB.
Download

Paper Nr: 126
Title:

Uncertainty-Aware DPP Sampling for Active Learning

Authors:

Robby Neven and Toon Goedemé

Abstract: Recently, deep learning approaches excel in important computer vision tasks like classification and segmentation. The downside, however, is that they are very data hungry, which is very costly. One way to address this issue is by using active learning: only label and train on diverse and informative data points, not wasting any effort on redundant data. While recent active learning approaches have difficulty combining diversity and informativeness, we propose a sampling technique which efficiently combines these two metrics into a single algorithm. This is achieved by adapting a Determinantal Point Process to also consider model uncertainty. We first show competitive results on the academic classification datasets CIFAR10 and CalTech101, and the CityScapes segmentation task. To further increase the performance of our sampler on segmentation tasks, we extend our method to a patch-based active learning approach, improving the performance by not wasting labelling effort on redundant image regions. Lastly, we demonstrate our method on a more challenging realworld industrial use-case, segmenting defects in steel sheet material, which greatly benefits from an active learning approach due to a vast amount of redundant data, and show promising results.
Download

Paper Nr: 167
Title:

Complement Objective Mining Branch for Optimizing Attention Map

Authors:

Takaaki Iwayoshi, Hiroki Adachi, Tsubasa Hirakawa, Takayoshi Yamashita and Hironobu Fujiyoshi

Abstract: Attention branch network (ABN) can achieve high accuracy by visualizing the attention area of the network during inference and utilizing it in the recognition process. However, if the attention area does not highlight the target object to be recognized, it may cause recognition failure. While there is a method for fine-tuning the ABN using attention maps modified by human knowledge, it requires a lot of human labor and time because the attention map needs to be modified manually. The method introducing the attention mining branch (AMB) to ABN improves the attention area without using human knowledge by learning while considering whether the attention area is effective for recognition. However, even with AMB, attention regions other than the target object, i.e., unnecessary attention regions, may remain. In this paper, we investigate the effects of unwanted attention areas and propose a method to further improve the attention areas of ABN and AMB. In the evaluation experiments, we show that the proposed method improves the recognition accuracy and obtains an attention map with more gazed objects. Our evaluation experiments show that the proposed method improves the recognition accuracy and obtains an attention map that appropriately focuses on the target object to be recognized.
Download

Paper Nr: 173
Title:

Study of Coding Units Depth for Depth Maps Quality Scalable Compression Using SHVC

Authors:

Dorsaf Sebai, Faouzi Ghorbel and Sounia Messbahi

Abstract: Scalable High Efficiency Video Coding (SHVC) is used to adaptively encode texture images. SHVC architecture is composed of Base and Enhancement Layers (BL and EL), with an interlayer picture processing module between them. In order to ensure effective encoding, each picture is divided into a certain number of Coding Units (CUs), with different depths, composing the Coding Tree Unit (CTU). Being initially dedicated to texture images, SHVC does not provide the same efficiency when applied to depth maps. To understand the causes behind, we propose to study the SHVC CTU partitioning for depth maps. This can be a starting point to propose an efficient 3D video scalable compression. Main observations of this study show that the depth of most CUs is 2 and 3 for texture images. However, this depth is either 0 or 1 for depth maps. Moreover, CUs depths frequently change when passing from the base and enhancement layers of SHVC for the non-flat regions. This is not the case for the smooth regions that generally preserve the same CUs depths in the two SHVC layers.
Download

Paper Nr: 204
Title:

Predicting Eye Gaze Location on Websites

Authors:

Ciheng Zhang, Decky Aspandi and Steffen Staab

Abstract: World-Wide-Web, with website and webpage as a main interface, facilitates dissemination of important information. Hence it is crucial to optimize webpage design for better user interaction, which is primarily done by analyzing users’ behavior, especially users’ eye-gaze locations on the webpage. However, gathering these data is still considered to be labor and time intensive. In this work, we enable the development of automatic eye-gaze estimations given webpage screenshots as input by curating of a unified dataset that consists of webpage screenshots, eye-gaze heatmap and website’s layout information in the form of image and text masks. Our curated dataset allows us to propose a deep learning-based model that leverages on both webpage screenshot and content information (image and text spatial location), which are then combined through attention mechanism for effective eye-gaze prediction. In our experiment, we show benefits of careful fine-tuning using our unified dataset to improve accuracy of eye-gaze predictions. We further observe the capability of our model to focus on targeted areas (images and text) to achieve accurate eye-gaze area predictions. Finally, comparison with other alternatives shows state-of-the-art result of our approach, establishing a benchmark for webpage based eye-gaze prediction task.
Download

Paper Nr: 261
Title:

Toward a Thermal Image-Like Representation

Authors:

Patricia L. Suárez and Angel D. Sappa

Abstract: This paper proposes a novel model to obtain thermal image-like representations to be used as an input in any thermal image compressive sensing approach (e.g., thermal image: filtering, enhancing, super-resolution). Thermal images offer interesting information about the objects in the scene, in addition to their temperature. Unfortunately, in most of the cases thermal cameras acquire low resolution/quality images. Hence, in order to improve these images, there are several state-of-the-art approaches that exploit complementary information from a low-cost channel (visible image) to increase the image quality of an expensive channel (infrared image). In these SOTA approaches visible images are fused at different levels without paying attention the images acquire information at different bands of the spectral. In this paper a novel approach is proposed to generate thermal image-like representations from a low cost visible images, by means of a contrastive cycled GAN network. Obtained representations (synthetic thermal image) can be later on used to improve the low quality thermal image of the same scene. Experimental results on different datasets are presented.
Download

Short Papers
Paper Nr: 7
Title:

Deep Learning Semantic Segmentation Models for Detecting the Tree Crown Foliage

Authors:

Danilo S. Jodas, Giuliana N. Velasco, Reinaldo Araujo de Lima, Aline R. Machado and João P. Papa

Abstract: Urban tree monitoring yields significant benefits to the environment and human society. Several aspects are essential to ensure the good condition of the trees and eventually predict their mortality or the risk of falling. So far, the most common strategy relies on the tree’s physical measures acquired from fieldwork analysis, which includes its height, diameter of the trunk, and metrics from the crown for a first glance condition analysis. The canopy of the tree is essential for predicting the resistance to extreme climatic conditions. However, the manual process is laborious considering the massive number of trees in the urban environment. Therefore, computer-aided methods are desirable to provide forestry managers with a rapid estimation of the tree foliage covering. This paper proposes a deep learning semantic segmentation strategy to detect the tree crown foliage in images acquired from the street-view perspective. The proposed approach employs several improvements to the well-known U-Net architecture in order to increase the prediction accuracy and reduce the network size. Compared to several vegetation indices found in the literature, the proposed model achieved competitive results considering the overlapping with the reference annotations.
Download

Paper Nr: 51
Title:

Turkish Sign Language Recognition Using CNN with New Alphabet Dataset

Authors:

Tuğçe Temel and Revna A. Vural

Abstract: Sign Language Recognition (SLR), also referred to as hand gesture recognition, is an active area of research in computer vision that aims to facilitate communication between the deaf-mute community and the people who don’t understand sign language. The objective of this study is to take a look at how this problem is tackled specifically for Turkish Sign Language (TSL). For this problem, we present a system based on convolution neural networks (CNN) in real-time however the most important part of this study to be underlined is that we present the first open-source TSL alphabet dataset to our knowledge. This dataset focuses on finger spelling and has been collected from 30 people. We conduct and present experiments with this new and first open-source TSL dataset. Our system scores an average accuracy of 99.5 % and the top accuracy value is 99.9% with our dataset. Further tests were conducted to measure the performance of our model in real time and added to the study. Finally, our proposed model is trained on a couple of American Sign Language (ASL) datasets, the results of which turn out to be state-of-the-art. You can access our dataset from https://github.com/tugcetemel1/TSL-Recognition-with-CNN.
Download

Paper Nr: 68
Title:

An Extension of the Radial Line Model to Predict Spatial Relations

Authors:

Logan Servant, Camille Kurtz and Laurent Wendling

Abstract: Analysing the spatial organization of objects in images is fundamental to increasing both the understanding of a scene and the explicability of perceived similarity between images. In this article, we propose to describe the spatial positioning of objects by an extension of the original Radial Line Model to any pair of objects present in an image, by defining a reference point from the convex hulls and not the enclosing rectangles, as done in the initial version of this descriptor. The recognition of spatial configurations is then considered as a classification task where the achieved descriptors can be embedded in a neural learning mechanism to predict from object pairs their directional spatial relationships. An experimental study, carried out on different image datasets, highlights the interest of this approach and also shows that such a representation makes it possible to automatically correct or denoise datasets whose construction has been rendered ambiguous by the human evaluation of 2D/3D views. Source code: https://github.com/Logan-wilson/extendedRLM.
Download

Paper Nr: 69
Title:

Persistent Homology Based Generative Adversarial Network

Authors:

Jinri Bao, Zicong Wang, Junli Wang and Chungang Yan

Abstract: In recent years, image generation has become one of the most popular research areas in the field of computer vision. Significant progress has been made in image generation based on generative adversarial network (GAN). However, the existing generative models fail to capture enough global structural information, which makes it difficult to coordinate the global structural features and local detail features during image generation. This paper proposes the Persistent Homology based Generative Adversarial Network (PHGAN). A topological feature transformation algorithm is designed based on the persistent homology method and then the topological features are integrated into the discriminator of GAN through the fully connected layer module and the self-attention module, so that the PHGAN has an excellent ability to capture global structural information and improves the generation performance of the model. We conduct an experimental evaluation of the PHGAN on the CIFAR10 dataset and the STL10 dataset, and compare it with several classic generative adversarial network models. The better results achieved by our proposed PHGAN show that the model has better image generation ability.
Download

Paper Nr: 82
Title:

Adaptive Fourier Single-Pixel Imaging Based on Probability Estimation

Authors:

Wei Lun Tey, Mau-Luen Tham, Yeong-Nan Phua and Sing Yee Chua

Abstract: Fourier single-pixel imaging (FSI) is able to reconstruct images by sampling the information in the Fourier domain. The conventional sampling method of FSI acquires the low frequency Fourier coefficients to obtain the image outlines but misses out on the image details in high frequency bands. The variable density sampling method improves the image quality but follows a predefined mechanism where the power of image information decreases when frequency increases. In this paper, an adaptive approach is proposed to sample the Fourier coefficients based on probability estimation. While the low frequency Fourier coefficients are fully sampled to secure the image outlines, the high frequency Fourier coefficients are sparsely sampled adaptively, and the image is reconstructed through Compressed sensing (CS) algorithm. Results show that the proposed adaptive FSI sampling method improves the image quality with sampling ratio ranging from 0.05 to 0.25, as compared to the commonly used conventional low frequency sampling and variable density sampling methods.
Download

Paper Nr: 86
Title:

Hand Segmentation with Mask-RCNN Using Mainly Synthetic Images as Training Sets and Repetitive Training Strategy

Authors:

Amin Dadgar and Guido Brunnett

Abstract: We propose an approach to segment hands in real scenes. To that, we employ 1) a relatively large amount of sorely simplistic synthetic images, 2) a small number of real images, and propose 3) a training scheme of repetitive training to resolve the phenomenon we call premature learning saturation (for using relatively large training set). The results suggest the feasibility of hand segmentation subject to attending to the parameters and specifications of each category with meticulous care. We conduct a short study to quantitatively demonstrate the benefits of our repetitive training on a more general ground with the Mask-RCNN framework.
Download

Paper Nr: 89
Title:

Data-Driven Fingerprint Reconstruction from Minutiae Based on Real and Synthetic Training Data

Authors:

Andrey Makrushin, Venkata S. Mannam and Jana Dittmann

Abstract: Fingerprint reconstruction from minutiae performed by model-based approaches often lead to fingerprint patterns that lack realism. In contrast, data-driven reconstruction leads to realistic fingerprints, but the reproduction of a fingerprint’s identity remain a challenging problem. In this paper, we examine the pix2pix network to fit for the reconstruction of realistic high-quality fingerprint images from minutiae maps. For encoding minutiae in minutiae maps we propose directed line and pointing minutiae approaches. We extend the pix2pix architecture to process complete plain fingerprints at their native resolution. Although our focus is on biometric fingerprints, the same concept fits for synthesis of latent fingerprints. We train models based on real and synthetic datasets and compare their performances regarding realistic appearance of generated fingerprints and reconstruction success. Our experiments establish pix2pix to be a valid and scalable solution. Reconstruction from minutiae enables identity-aware generation of synthetic fingerprints which in turn enables compilation of large-scale privacy-friendly synthetic fingerprint datasets including mated impressions.
Download

Paper Nr: 92
Title:

EHDI: Enhancement of Historical Document Images via Generative Adversarial Network

Authors:

Abir Fathallah, Mounim A. El-Yacoubi and Najoua E. Ben Amara

Abstract: Images of historical documents are sensitive to the significant degradation over time. Due to this degradation, exploiting information contained in these documents has become a challenging task. Consequently, it is important to develop an efficient tool for the quality enhancement of such documents. To address this issue, we present in this paper a new modelknown as EHDI (Enhancement of Historical Document Images) which is based on generative adversarial networks. The task is considered as an image-to-image conversion process where our GAN model involves establishing a clean version of a degraded historical document. EHDI implies a global loss function that associates content, adversarial, perceptual and total variation losses to recover global image information and generate realistic local textures. Both quantitative and qualitative experiments demonstrate that our proposed EHDI outperforms significantly the state-of-the-art methods applied to the widespread DIBCO 2013, DIBCO 2017, and H-DIBCO 2018 datasets. Our suggested model is adaptable to other document enhancement problems, following the results across a wide range of degradations. Our code is available at https://github.com/Abir1803/EHDI.git.
Download

Paper Nr: 97
Title:

Concept Explainability for Plant Diseases Classification

Authors:

Jihen Amara, Birgitta König-Ries and Sheeba Samuel

Abstract: Plant diseases remain a considerable threat to food security and agricultural sustainability. Rapid and early identification of these diseases has become a significant concern motivating several studies to rely on the increasing global digitalization and the recent advances in computer vision based on deep learning. In fact, plant disease classification based on deep convolutional neural networks has shown impressive performance. However, these methods have yet to be adopted globally due to concerns regarding their robustness, transparency, and the lack of explainability compared with their human experts counterparts. Methods such as saliency-based approaches associating the network output to perturbations of the input pixels have been proposed to give insights into these algorithms. Still, they are not easily comprehensible and not intuitive for human users and are threatened by bias. In this work, we deploy a method called Testing with Concept Activation Vectors (TCAV) that shifts the focus from pixels to user-defined concepts. To the best of our knowledge, our paper is the first to employ this method in the field of plant disease classification. Important concepts such as color, texture and disease related concepts were analyzed. The results suggest that concept-based explanation methods can significantly benefit automated plant disease identification.
Download

Paper Nr: 105
Title:

Exploiting GAN Capacity to Generate Synthetic Automotive Radar Data

Authors:

Mauren S. C. de Andrade, Matheus V. Nogueira, Eduardo C. Fidelis, Luiz A. Campos, Pietro P. Campos, Torsten Schön and Lester de Abreu Faria

Abstract: In this paper, we evaluate the training of GAN for synthetic RAD image generation for four objects reflected by Frequency Modulated Continuous Wave radar: car, motorcycle, pedestrian and truck. This evaluation adds a new possibility for data augmentation when radar data labeling available is not enough. The results show that, yes, the GAN generated RAD images well, even when a specific class of the object is necessary. We also compared the scores of three GAN architectures, GAN Vanilla, CGAN, and DCGAN, in RAD synthetic imaging generation. We show that the generator can produce RAD images well enough with the results analyzed.
Download

Paper Nr: 125
Title:

Search for Rotational Symmetry of Binary Images via Radon Transform and Fourier Analysis

Authors:

Nikita Lomov, Oleg Seredin, Olesia Kushnir and Daniil Liakhov

Abstract: We considered the optimization of such rotational symmetry properties in 2D shapes as the focus position, symmetry degree, and measure expressed as the Jaccard index generalized to a group of two or more shapes. We proposed to reduce the symmetry detection to the averaging of Jaccard indices for all possible pairs of rotated shapes. It is sufficient to consider a number of pairs linearly proportional to the degree of symmetry. It is shown that for a class of plane affine transformations translating lines into lines, and rotations in particular, the upper estimate of the Jaccard index can be directly derived from the Radon transform of the shape. We proposed a fast estimation of the shape degree of symmetry by applying the Fourier analysis of the secondary features derived from the Radon transform. The proposed methods were implemented as a highly efficient computational procedure. The results are consistent with the expert judgment of the qualities of symmetry.
Download

Paper Nr: 127
Title:

N-MuPeTS: Event Camera Dataset for Multi-Person Tracking and Instance Segmentation

Authors:

Tobias Bolten, Christian Neumann, Regina Pohle-Fröhlich and Klaus D. Tönnies

Abstract: Compared to well-studied frame-based imagers, event-based cameras form a new paradigm. They are biologically inspired optical sensors and differ in operation and output. While a conventional frame is dense and ordered, the output of an event camera is a sparse and unordered stream of output events. Therefore, to take full advantage of these sensors new datasets are needed for research and development. Despite their ongoing use, the selection and availability of event-based datasets is currently still limited. To address this limitation, we present a technical recording setup as well as a software processing pipeline for generating event-based recordings in the context of multi-person tracking. Our approach enables the automatic generation of highly accurate instance labels for each individual output event using color features in the scene. Additionally, we employed our method to release a dataset including one to four persons addressing the common challenges arising in multi-person tracking scenarios. This dataset contains nine different scenarios, with a total duration of over 85 minutes.
Download

Paper Nr: 155
Title:

Neural Style Transfer for Image-Based Garment Interchange Through Multi-Person Human Views

Authors:

Hajer Ghodhbani, Mohamed Neji and Adel M. Alimi

Abstract: The generation of photorealistic images of human appearances under the guidance of body pose enables a wide range of applications, including virtual fitting and style synthesis. Several advances have been made in this direction using image-based deep learning generation approaches. The issue with these methods is that they produce significant aberrations in the final output, such as blurring of fine details and texture alterations. Our work falls within this objective by proposing a system able to realize the garment transfer between different views of person by overcoming these issues. To realize this objective, fundamental steps were achieved. Firstly, we used a conditioning adversarial network to deal with pose and appearance separately, create a human shape image with precise control over pose, and align target garment with appropriate body parts in the human image. As a second step, we introduced a neural approach for style transfer that can differentiate and merge content and style of editing images. We designed architecture with distinct levels to ensure the style transfer while preserving the quality of original texture in the generated results.
Download

Paper Nr: 164
Title:

Image Quality Assessment for Object Detection Performance in Surveillance Videos

Authors:

Poonam Beniwal, Pranav Mantini and Shishir K. Shah

Abstract: The proliferation of video surveillance cameras in recent years has increased the volume of visual data produced. This exponential growth in data has led to greater use of automated analysis. However, the performance of such systems depends upon the image/video quality, which varies heavily in the surveillance network. Compression is one such factor that introduces artifacts in the data. It is crucial to assess the quality of visual data to determine the reliability of the automated analysis. However, traditional image quality assessment (IQA) methods focus on the human perspective to objectively determine the quality of images. This paper focuses on assessing the image quality for the object detection task. We propose a full-reference quality metric based on the cosine similarity between features extracted from lossless compressed and lossy compressed images. However, the use of full-reference metrics is limited by the availability of reference images. To overcome this limitation, we also propose a no-reference metric. We evaluated our metric on a video surveillance dataset. The proposed quality metrics are evaluated using error vs. reject curves, demonstrating a better correlation with false negatives.
Download

Paper Nr: 171
Title:

Advanced Deep Transfer Learning Using Ensemble Models for COVID-19 Detection from X-ray Images

Authors:

Walid Hariri and Imed E. Haouli

Abstract: The pandemic of Coronavirus disease (COVID-19) has become one of the main causes of mortality over the world. In this paper, we employ a transfer learning-based method using five pre-trained deep convolutional neural networks (CNN) architectures fine-tuned with an X-ray image dataset to detect COVID-19. Hence, we use VGG-16, ResNet50, InceptionV3, ResNet101 and Inception-ResNetV2 models in order to classify the input images into three classes (COVID-19 / Healthy / Other viral pneumonia). The results of each model are presented in detail using 10-fold cross-validation and comparative analysis has been given among these models by taking into account different elements in order to find the more suitable model. To further enhance the performance of single models, we propose to combine the obtained predictions of these models using the majority vote strategy. The proposed method has been validated on a publicly available chest X-ray image database that contains more than one thousand images per class. Evaluation measures of the classification performance have been reported and discussed in detail. Promising results have been achieved compared to state-of-the-art methods where the proposed ensemble model achieved higher performance than using any single model. This study gives more insights to researchers for choosing the best models to accurately detect the COVID-19 virus.
Download

Paper Nr: 172
Title:

Counting People in Crowds Using Multiple Column Neural Networks

Authors:

Christian M. Konishi and Helio Pedrini

Abstract: Crowd counting through images is a research field of great interest for its various applications, such as surveil-lance camera images monitoring, urban planning. In this work, a model (MCNN-U) based on Generative Adversarial Networks (GANs) with Wasserstein cost and Multiple Column Neural Networks (MCNNs) is proposed to obtain better estimates of the number of people. The model was evaluated using two crowd counting databases, UCF-CC-50 and ShanghaiTech. In the first database, the reduction in the mean absolute error was greater than 30%, whereas the gains in efficiency were smaller in the second database. An adaptation of the LayerCAM method was also proposed for the crowd counter network visualization.
Download

Paper Nr: 180
Title:

A Low-Cost Process for Plant Motion Magnification for Smart Indoor Farming

Authors:

Danilo Pena, Parinaz Dehaghani, Oussama H. Abdelkader, Hadjer Bouzebiba and A. P. Aguiar

Abstract: Smart indoor farming promises to improve the capacity to feed people in urban centers in future production. Non-invasive sensing and monitoring technologies play a crucial role in enabling such controlled environments. In this paper, we propose a new architecture to magnify subtle movements of plants in videos, highlighting non-perceptible motions that can be used for analyzing and obtaining characteristic traits of plants. We investigate the limitations of the technique with synthetic and real data and evaluate different plant samples. Experimental results present leaf movements from short videos that could not be noticed before the magnification.
Download

Paper Nr: 196
Title:

Generating Pedestrian Views from In-Vehicle Camera Images

Authors:

Daina Shimoyama, Fumihiko Sakaue and Jun Sato

Abstract: In this paper, we propose a method for predicting and generating pedestrian viewpoint images from images captured by an in-vehicle camera. Since the viewpoints of an in-vehicle camera and a pedestrian are very different, viewpoint transfer to the pedestrian viewpoint generally results in a large amount of missing information. To cope with this problem, we in this research use the semantic structure of the road scene. In general, it is considered that there are certain regularities in the driving environment, such as the positional relationship between roads, vehicles, and buildings. We generate accurate pedestrian views by using such structural information on the road scenes.
Download

Paper Nr: 197
Title:

Adaptive Resolution Selection for Improving Segmentation Accuracy of Small Objects

Authors:

Haruki Fujii and Kazuhiro Hotta

Abstract: This paper proposes a segmentation method using adaptive resolution selection for improving the accuracy of small objects. In semantic segmentation, the segmentation of small objects is more difficult than that of large objects. Semantic segmentation requires both spatial details to locate objects and strong semantics to classify objects well, which are likely to exist at different resolution/scale levels. We believe that small objects are well represented by high-resolution feature maps, while large objects are suitable for low-resolution feature maps with high semantic information, and propose a method to automatically select a resolution and assign it to each object in the HRNet with multi-resolution feature maps. We propose Adaptive Resolution Selection Module (ARSM), which selects the resolution for segmentation of each class. The proposed method considers the feature map of each resolution in the HRNet as an Expert Network, and a Gating Network selects adequate resolution for each class. We conducted experiments on Drosophila cell images and the Covid 19 dataset, and confirmed that the proposed method achieved higher accuracy than the conventional method.
Download

Paper Nr: 199
Title:

IACT: Intensive Attention in Convolution-Transformer Network for Facial Landmark Localization

Authors:

Zhanyu Gao, Kai Chen and Dahai Yu

Abstract: Recently, the facial landmarks localization tasks based on deep learning methods have achieved promising results, but they ignore the global context information and long-range relationship among the landmarks. To address this issue, we propose a parallel multi-branch architecture combining convolutional blocks and transformer layer for facial landmarks localization named Intensive Attention in the Convolutional Vision Transformer Network (IACT), which has the advantages of capturing detailed features and gathering global dynamic attention weights. To further improve the performance, the Intensive Attention mechanism is incorporated with the Convolution-Transformer Network, which includes Multi-head Spatial attention, Feature attention, the Channel attention. In addition, we present a novel loss function named Smooth Wing Loss that fills the gap in the gradient discontinuity of the Adaptive Wing loss, resulting in better convergence. Our IACT can achieve state-of-the-art performance on WFLW, 300W, and COFW datasets with 4.04, 2.82 and 3.12 in Normalized Mean Error.
Download

Paper Nr: 215
Title:

Concept Study for Dynamic Vision Sensor Based Insect Monitoring

Authors:

Regina Pohle-Fröhlich and Tobias Bolten

Abstract: A decline in insect populations has been observed for many years. Therefore, it is necessary to measure the number and species of insects to evaluate the effectiveness of the interventions taken against this decline. We describe a sensor-based approach to realize an insect monitoring system utilizing a Dynamic Vision Sensor (DVS). In this concept study, the processing steps required for this are discussed and suggestions for suitable processing methods are given. On the basis of a small dataset, a clustering and filtering-based labeling approach is proposed, which is a promising option for the preparation of larger DVS insect monitoring datasets. An U-Net based segmentation was tested for the extraction of insect flight trajectories, achieving an F1-score of ≈ 0.91. For the discrimination between different species, the classification of polarity images or simulated grayscale images is favored.
Download

Paper Nr: 223
Title:

Multichannel Analysis in Weed Detection

Authors:

Hericles F. Ferraz, Jocival D. Dias Junior, André R. Backes, Daniel D. Abdala and Mauricio C. Escarpinati

Abstract: In this paper a new classification scheme is investigated aiming to improve the current classification models used in weed detection based on UAV imaging data. The premise is that the investigation regarding the relevance of a given color space channel regarding its classification power of important features could lead to a better selection of training data. Consequently it could culminate on a superior classification result. An hybrid image is constructed using only the channels which least overlapping regarding their contribution to represent the weed and soil data. It is then fed to a deep neural net in which a process of transfer learning takes place incorporating the previously trained knowledge with the new data provided by the hybrid images. Three publicly available datasets were used both in training and testing. Preliminary results seems to indicate the feasibility of the proposed methodology.
Download

Paper Nr: 228
Title:

Fast Skeletons of Handwritten Texts in Digital Images

Authors:

Leonid Mestetskiy and Dimitry Koptelov

Abstract: The article considers the problem of constructing a Voronoi Diagram (VD) of a polygonal figure - a polygon with polygonal holes. A planar sweeping algorithm is proposed for constructing the VD of the interior of a polygonal figure with 𝑛 vertices, which has complexity 𝑂(𝑛 𝑙𝑜𝑔 𝑛). Two factors provide a reduction in the amount of calculations and an increase in robustness compared to known solutions. This is the direct construction of only the inner part of the VD, as well as the use of the pairwise incidence property of linear segments formed by the sides of a polygonal figure. The proposed algorithm has been implemented and practically tested for polygonal figures of dimension 𝑛~105 in studies on the analysis and recognition of handwriting. Computational experiments illustrate the robustness and efficiency of the proposed method.
Download

Paper Nr: 250
Title:

You Can Dance! Generating Music-Conditioned Dances on Real 3D Scans

Authors:

Elona Dupont, Inder P. Singh, Laura L. Fuentes, Sk A. Ali, Anis Kacem, Enjie Ghorbel and Djamila Aouada

Abstract: The generation of realistic body dances that are coherent with music has recently caught the attention of the Computer Vision community, due to its various real-world applications. In this work, we are the first to present a fully automated framework ‘You Can Dance’ that generates a personalized music-conditioned 3D dance sequence, given a piece of music and a real 3D human scan. ‘You Can Dance’ is composed of two modules: (1) The first module fits a parametric body model to an input 3D scan; (2) the second generates realistic dance poses that are coherent with music. These dance poses are injected into the body model to generate animated 3D dancing scans. Furthermore, the proposed framework is used to generate a synthetic dataset consisting of music-conditioned dancing 3D body scans. A human-based evaluation study is conducted to assess the quality and realism of the generated 3D dances. This study along with the qualitative results shows that the proposed framework can generate plausible music-conditioned 3D dances.
Download

Paper Nr: 256
Title:

Inverse Rendering Based on Compressed Spatiotemporal Infomation by Neural Networks

Authors:

Eito Itonaga, Fumihiko Sakaue and Jun Sato

Abstract: This paper proposes a method for simultaneous estimation of time variation of the light source distribution, and object shape of a target object from time-series images. This method focuses on the representational capability of neural networks, which can represent arbitrarily complex functions, and efficiently represent light source distribution, object shape, and reflection characteristics using neural networks. Using this method, we show how to stably estimate the time variation of light source distribution, and object shape simultaneously.
Download

Paper Nr: 259
Title:

Combined Unsupervised and Supervised Learning for Improving Chest X-Ray Classification

Authors:

Anca Ignat and Robert-Adrian Găină

Abstract: This paper studies the problem of pneumonia classification of chest X-ray images. We first apply clustering algorithms to eliminate contradictory images from each of the two classes (normal and pneumonia) of the dataset. We then train different classifiers on the reduced dataset and test for improvement in performance evaluators. For feature extraction and also for classification we use ten well-known Convolutional Neural Networks (Resnet18, Resnet50, VGG16, VGG19, Densenet, Mobilenet, Inception, Xception, InceptionResnet and Shufflenet). For clustering, we employed 2-means, agglomerative clustering and spectral clustering. Besides the above-mentioned CNN, linear SVMs (Support Vector Machines) and Random Forest (RF) were employed for classification. The tests were performed on Kermany dataset. Our experiments show that this approach leads to improvement in classification results.
Download

Paper Nr: 272
Title:

Multimodal Light-Field Camera with External Optical Filters Based on Unsupervised Learning

Authors:

Takumi Shibata, Fumihiko Sakaue and Jun Sato

Abstract: In this paper, we propose a method of capturing multimodal images in a single shot by attaching various optical filters to the front of a light-field (LF) camera. However, when a filter is attached to the front of the lens, the result of capturing images from each viewpoint will be a mixture of multiple modalities. Therefore, the proposed method uses a neural network that does not require prior learning to analyze such a modal mixture image to generate an image of all the modalities at all viewpoints. By using external filters as in the proposed method, it is possible to easily switch filters and realize a flexible configuration of the shooting system according to the purpose.
Download

Paper Nr: 285
Title:

FlexPooling with Simple Auxiliary Classifiers in Deep Networks

Authors:

Muhammad Ali, Omar Alsuwaidi and Salman Khan

Abstract: In Computer Vision, the basic pipeline of most convolutional neural networks (CNNs) consists of multiple feature extraction processing layers, wherein the input signal is downsampled into a lower resolution in each subsequent layer. This downsampling process is commonly referred to as pooling, an essential operation in CNNs. It improves the model’s robustness against variances in transformation, reduces the number of trainable parameters, increases the receptive field size, and reduces computation time. Since pooling is a lossy process yet crucial in inferring high-level information from low-level information, we must ensure that each subsequent layer perpetuates the most prominent information from previous activations to aid the network’s discriminability. The standard way to apply this process is to use dense pooling (max or average) or strided convolutional kernels. In this paper, we propose a simple yet effective adaptive pooling method, referred to as FlexPooling, which generalizes the concept of average pooling by learning a weighted average pooling over the activations jointly with the rest of the network. Moreover, attaching the CNN with Simple Auxiliary Classifiers (SAC) further demonstrates the superiority of our method as compared to the standard methods. Finally, we show that our simple approach consistently outperforms baseline networks on multiple popular datasets in image classification, giving us around a 1-3% increase in accuracy.
Download

Paper Nr: 290
Title:

Shape-based Features Investigation for Preneoplastic Lesions on Cervical Cancer Diagnosis

Authors:

Daniela C. Terra, Adriano C. Lisboa, Mariana T. Rezende, Claudia M. Carneiro and Andrea C. Bianchi

Abstract: The diagnosis of cervical lesions is an interpretative process carried out by specialists based on cellular information from the nucleus and cytoplasm. Some authors have used cell nucleus detection and segmentation algorithms to support the computer-assisted diagnosis process. These approaches are based on the assumption that the nucleus contains the most important information for lesion detection. This work investigates the influence of morphological information from the nucleus, cytoplasm, and both on cervical cell diagnosis. Experiments were performed to analyze 3,233 real cells extracting from each one 200 attributes related to size, shape, and edge contours. Results showed that morphological attributes could accurately represent lesions in binary and ternary classifications. However, identifying specific cell anomalies like Bethesda System classes requires adding new attributes such as texture.
Download

Paper Nr: 9
Title:

Generative Adversarial Network Synthesis for Improved Deep Learning Model Training of Alpine Plants with Fuzzy Structures

Authors:

Christoph Praschl, Roland Kaiser and Gerald A. Zwettler

Abstract: Deep learning approaches are highly influenced by two factors, namely the complexity of the task and the size of the training data set. In terms of both, the extraction of features of low-stature alpine plants represents a challenging domain due to their fuzzy appearance, a great structural variety in plant organs and the high effort associated with acquiring high-quality training data for such plants. For this reason, this study proposes an approach for training deep learning models in the context of alpine vegetation based on a combination of real-world and artificial data synthesised using Generative Adversarial Networks. The evaluation of this approach indicates that synthetic data can be used to increase the size of training data sets. With this at hand, the results and robustness of deep learning models are demonstrated with a U-Net segmentation model. The evaluation is carried out using a cross-validation for three alpine plants, namely Soldanella pusilla, Gnaphalium supinum, and Euphrasia minima. Improved segmentation accuracy was achieved for the latter two species. Dice Scores of 24.16% vs 26.18% were quantified for Gnaphalium with 100 real-world training images. In the case of Euphrasia, Dice Scores improved from 33.56% to 42.96% using only 20 real-world training images.
Download

Paper Nr: 31
Title:

Multimodal Unsupervised Spatio-Temporal Interpolation of Satellite Ocean Altimetry Maps

Authors:

Théo Archambault, Arthur Filoche, Anastase Charantonis and Dominique Béréziat

Abstract: Satellite remote sensing is a key technique to understand Ocean dynamics. Due to measurement difficulties, various ill-posed image inverse problems occur, and among them, gridding satellite Ocean altimetry maps is a challenging interpolation of sparse along-tracks data. In this work, we show that it is possible to take advantage of better-resolved physical data to enhance Sea Surface Height (SSH) gridding using only partial data acquired via satellites. For instance, the Sea Surface Temperature (SST) is easier to measure through satellite and has an underlying physical link with altimetry. We train a deep neural network to estimate a time series of SSH using a time series of SST in an unsupervised way. We compare to state-of-the-art methods and report a 13% RMSE decrease compared to the operational altimetry algorithm.
Download

Paper Nr: 37
Title:

Semantic Segmentation on Neuromorphic Vision Sensor Event-Streams Using PointNet++ and UNet Based Processing Approaches

Authors:

Tobias Bolten, Regina Pohle-Fröhlich and Klaus D. Tönnies

Abstract: Neuromorphic Vision Sensors, which are also called Dynamic Vision Sensors, are bio-inspired optical sensors which have a completely different output paradigm compared to classic frame-based sensors. Each pixel of these sensors operates independently and asynchronously, detecting only local changes in brightness. The output of such a sensor is a spatially sparse stream of events, which has a high temporal resolution. However, the novel output paradigm raises challenges for processing in computer vision applications, as standard methods are not directly applicable on the sensor output without conversion. Therefore, we consider different event representations by converting the sensor output into classical 2D frames, highly multichannel frames, 3D voxel grids as well as a native 3D space-time event cloud representation. Using PointNet++ and UNet, these representations and processing approaches are systematically evaluated to generate a semantic segmentation of the sensor output stream. This involves experiments on two different publicly available datasets within different application contexts (urban monitoring and autonomous driving). In summary, PointNet++ based processing has been found advantageous over a UNet approach on lower resolution recordings with a comparatively lower event count. On the other hand, for recordings with ego-motion of the sensor and a resulting higher event count, UNet-based processing is advantageous.
Download

Paper Nr: 70
Title:

A Basic Tool for Improving Bad Illuminated Archaeological Pictures

Authors:

Michela Lecca

Abstract: Gathering visual documentation of archaeological sites and monuments helps monitor their status and preserve and transmit the memory of the cultural heritage. Good lighting is essential to provide pictures with clear visibility of details and content, but it is a challenging task. Indeed, illuminating a site may require complex infrastructures, while uncontrolled lights may damage the artifacts. In this framework, computer vision techniques may greatly help archeology by relighting and/or improving the images of archaeological objects that cannot be acquired under a good light. This work presents MEEK, a basic tool to improve low-light, back-light and spot-light images, increasing the visibility of their details and content, while mitigating undesired effects due to illumination. MEEK embeds three algorithms: the Retinex inspired image enhancer SuPeR, the backlight and spotlight image relighting method REK, and the popular contrast enhancer CLAHE. One or more of these algorithms can be applied to the input image, depending on the light conditions of the acquired environments as well as on the final task for which the image is used. Here, MEEK is tested on many archaeological color pictures with bad light showing good performance. The code of MEEK is freely available at https://github.com/MichelaLecca/MEEK.
Download

Paper Nr: 102
Title:

Model Fitting on Noisy Images from an Acoustofluidic Micro-Cavity for Particle Density Measurement

Authors:

Lucas M. Massa, Tiago F. Vieira, Allan M. Martins, Ícaro B. Q. de Araújo, Glauber T. Silva and Harrisson A. Santos

Abstract: We use a 3D printed device to measure the density of a micro-particle with acoustofluidics, which consists in using sound waves to trap particles in free space. Initially, the particle is trapped in the microscope’s focal plane (no blur). Then the transducers are shut off and the particle falls inside the fluid, increasing its diameter due to defocus caused by the distance to the lens. This increase in diameter along time provides its velocity, which can, in turn, be used to compute its density. To manually annotate the diameter in the recorded images is a tedious task and is prone to errors. That happens due to the high noise present in the images, specially in the last frames where the defocus is high. Because of that, we use a 2D Gaussian model fitting process to estimate the particle diameter throughout different depth frames. To find the diameters, we initially perform the Gaussian parameters fit with Genetic Algorithm in each frame of the recorded particle trajectory to avoid local minima. Then we refine the fit with Gradient Descent using Tensorflow in order to compensate for any randomness present in the fit of the Genetic Algorithm. We validate the method by retrieving a known particle’s density with acceptable performance.
Download

Paper Nr: 124
Title:

Fruit Defect Detection Using CNN Models with Real and Virtual Data

Authors:

Renzo Pacheco, Paula González, Luis E. Chuquimarca, Boris X. Vintimilla and Sergio A. Velastin

Abstract: The present study seeks to evaluate different CNN models in order to compare their performance in recognizing a range of defects in apples and mangoes to ensure the quality of the production of these foods. Using the CNN models, InceptionV3, MobileNetV2, VGG16 and DenseNet121, which were trained with a dataset of real and synthetic images of apples and mangoes covering fruit in acceptable quality condition and with defects: rot, bruises, scabs and black spots. Training was performed with variations on the hyper-parameters and the metric is accuracy. The MobileNetV2 model achieved the highest accuracy in training and testing, obtaining 97.50% for apples and 92.50% for mangoes, making it the most suitable model for defect detection in these fruits. The InceptionV3 and DenseNet121 models presented accuracy values above 90%, while the VGG16 model obtained the poorest performance by not exceeding 80% accuracy for any of the fruits. The trained models, especially MobileNetV2, are capable of recognizing a range of defects in the fruits under study with a high degree of accuracy and are suitable for use in the development of automation applications for the quality assessment process of apples and mangoes.
Download

Paper Nr: 128
Title:

Football360: Introducing a New Dataset for Camera Calibration in Sports Domain

Authors:

Igor Jánoš and Vanda Benešová

Abstract: In many computer vision domains, the input images must conform with the pinhole camera model, where straight lines in the real world are projected as straight lines in the image. Many existing camera calibration or distortion compensation methods have been developed using either ImageNet or other generic computer vision datasets, but they are difficult to compare and evaluate when applied to a specific sports domain. We present a new dataset, explicitly designed for the task of radial distortion correction, consisting of high-resolution panoramas of football arenas. From these panoramas, we produce a large number of cropped images distorted using known radial distortion parameters. We also present extensible open-source software to reproducibly export sets of training images conforming to the chosen radial distortion model. We evaluate a chosen radial distortion correction method on the proposed dataset. All data and software can be found at https://vgg.fiit.stuba.sk/football360.
Download

Paper Nr: 140
Title:

Finger-UNet: A U-Net Based Multi-Task Architecture for Deep Fingerprint Enhancement

Authors:

Ekta Gavas and Anoop Namboodiri

Abstract: For decades, fingerprint recognition has been prevalent for security, forensics, and other biometric applications. However, the availability of good-quality fingerprints is challenging, making recognition difficult. Fingerprint images might be degraded with a poor ridge structure and noisy or less contrasting backgrounds. Hence, fingerprint enhancement plays a vital role in the early stages of the fingerprint recognition/verification pipeline. In this paper, we investigate and improvise the encoder-decoder style architecture and suggest intuitive modifications to U-Net to enhance low-quality fingerprints effectively. We investigate the use of Discrete Wavelet Transform (DWT) for fingerprint enhancement and use a wavelet attention module instead of max pooling which proves advantageous for our task. Moreover, we replace regular convolutions with depthwise separable convolutions, which significantly reduces the memory footprint of the model without degrading the performance. We also demonstrate that incorporating domain knowledge with fingerprint minutiae prediction task can improve fingerprint reconstruction through multi-task learning. Furthermore, we also integrate the orientation estimation task to propagate the knowledge of ridge orientations to enhance the performance further. We present the experimental results and evaluate our model on FVC 2002 and NIST SD302 databases to show the effectiveness of our approach compared to previous works.
Download

Paper Nr: 143
Title:

Deep Neural Network Based Attention Model for Structural Component Recognition

Authors:

Sangeeth D. Sarangi and Bappaditya Mandal

Abstract: The recognition of structural components from images/videos is a highly complex task because of the appearance of huge components and their extended existence alongside, which are relatively small components. The latter is frequently overestimated or overlooked by existing methodologies. For the purpose of automating bridge visual inspection efficiently, this research examines and aids vision-based automated bridge component recognition. In this work, we propose a novel deep neural network-based attention model (DNNAM) architecture, which comprises synchronous dual attention modules (SDAM) and residual modules to recognise structural components. These modules help us to extract local discriminative features from structural component images and classify different categories of bridge components. These innovative modules are constructed at the contextual level of information encoding across spatial and channel dimensions. Experimental results and ablation studies on benchmarking bridge components and semantic augmented datasets show that our proposed architecture outperforms current state-of-the-art methodologies for structural component recognition.
Download

Paper Nr: 156
Title:

Contactless Optical Detection of Nocturnal Respiratory Events

Authors:

Belmin Alić, Tim Zauber, Chen Zhang, Wang Liao, Alina Wildenauer, Noah Leosz, Torsten Eggert, Sarah Dietz-Terjung, Sivagurunathan Sutharsan, Gerhard Weinreich, Christoph Schöbel, Gunther Notni, Christian Wiede and Karsten Seidl

Abstract: Obstructive sleep apnea (OSA) is a common sleep-related breathing disorder characterized by the collapse of the upper airway and associated with various diseases. For clinical diagnosis, a patient’s sleep is recorded during the night via polysomnography (PSG) and evaluated the next day regarding nocturnal respiratory events. The most prevalent events include obstructive apneas and hypopneas. In this paper, we introduce a fully automatic contactless optical method for the detection of nocturnal respiratory events. The goal of this study is to demonstrate how nocturnal respiratory events, such as apneas and hypopneas, can be autonomously detected through the analysis of multi-spectral image data. This represents the first step towards a fully automatic and contactless diagnosis of OSA. We conducted a trial patient study in a sleep laboratory and evaluated our results in comparison with PSG, the gold standard in sleep diagnostics. In a study sample with three patients, 24 hours of recorded video materials and 245 respiratory events, we have achieved a classification accuracy of 82 % with a random forest classifier.
Download

Paper Nr: 176
Title:

Re-Learning ShiftIR for Super-Resolution of Carbon Nanotube Images

Authors:

Yoshiki Kakamu, Takahiro Maruyama and Kazuhiro Hotta

Abstract: In this study, we perform super-resolution of carbon nanotube images using Deep Learning. In order to achieve super-resolution with higher accuracy than conventional SwinIR, we introduce an encoder-decoder structure to input an image of larger size and a Shift mechanism for local feature extraction. In addition, we propose super-resolution method by re-training to perform super-resolution with high accuracy even with a small number of images. Experiments were conducted on DIV2K, General100, Set5, and carbon nanotube image dataset for evaluation. Experimental results confirmed that the proposed method provides higher accuracy than the conventional SwinIR, and showed that the proposed method can super-resolve carbon nanotube images. The main contribution is the proposal of a network model with better performance for super-resolution of carbon nanotube images even if there is no crisp supervised images. The proposed method is suitable for such images. Effectiveness of our method was demonstrated by experimental results on a general super-resolution dataset and a carbon nanotube image dataset.
Download

Paper Nr: 224
Title:

ResNet Classifier Using Shearlet-Based Features for Detecting Change in Satellite Images

Authors:

Emna Brahim, Sonia Bouzidi and Walid Barhoumi

Abstract: In this paper, we present an effective method to extract the change in two optical remote-sensing images. The proposed method is mainly composed of the following steps. First, the two input Normalized Difference Vegetation Index (NDVI) images are smoothed using the Shearlet transform. Then, we used ResNet152 architecture in order to extract the final change detection image. We validated the performance of the proposed method on three challenging data illustrating the areas of Brazil, Virginia, and California. The experiments performed on 38416 patches showed that the suggested method has outperformed many relevant state-of-theart works with an accuracy of 99.50%.
Download

Paper Nr: 232
Title:

Image Quality Assessment in the Context of the Brazilian Electoral System

Authors:

Marcondes D. Silva Júnior, Jairton S. Falcão Filho, Zilde M. Neto, Julia D. Tavares de Souza, Vinícius L. Ventura and João M. Teixeira

Abstract: The Brazilian electoral system uses an electronic voting machine to increase the voting reliability. This voting machine goes through a series of security procedures, and the one that uses the most human resources is the integrity test. The proposed solution to optimize these resources is using a robotic arm and computer vision methods to replace the eight persons needed to carry out the test currently. However, there is a problem with the LCD screen in the poll worker's terminal.There is no backlight on the LCD screen, this may cause visual pollution on images captured by the camera, depending on the ambient lighting and camera position. In this way, this paper proposes two methods to make it easier to choose the best images to be used in the extraction of information process: OCR and blur analysis. We analyzed 27 images with three ambient lighting configurations then compared our results with three no-reference image quality evaluators and research on human perception of image quality. The OCR analysis matched very closely the human perception and the other evaluators.
Download

Paper Nr: 235
Title:

Evaluation of U-Net Backbones for Cloud Segmentation in Satellite Images

Authors:

Laura M. Arakaki, Leandro P. Silva, Matheus V. Silva, Bruno M. Melo, André R. Backes, Mauricio C. Escarpinati and João F. Mari

Abstract: Remote sensing images are an important resource for obtaining information for different types of applications. The occlusion of regions of interest by clouds is a common problem in this type of image. Thus, the objective of this work is to evaluate methods based on convolutional neural networks (CNNs) for cloud segmentation in satellite images. We compared three segmentation models, all of them based on the U-Net architecture with different backbones. The first considered backbone is simpler and consists of three contraction blocks followed by three expansion blocks. The second model has a backbone based on the VGG-16 CNN and the third one on the ResNet-18. The methods were tested using the Cloud-38 dataset, composed of 8400 satellite images in the training set and 9201 in the test set. The model considering the simplest backbone was trained from scratch, while the models with backbones based on VGG-16 and ResNet-18 were trained using fine-tuning on pre-trained models with ImageNet. The results demonstrate that the tested models can segment the clouds in the images satisfactorily, reaching up to 97% accuracy on the validation set and 95% on the test set.
Download

Paper Nr: 283
Title:

An Unsupervised IR Approach Based Density Clustering Algorithm

Authors:

Achref Ouni

Abstract: Finding the most similar images to an input query in the database is an important task in computer vision. Many approaches have been proposed from visual content have proven its effectiveness in retrieving the most relevant images. Bag of visual words model (BoVW) is one of the most algorithm used for image classification and recognition. Even the discriminative power of BoVW, the problem of retrieving the relevant images from the dataset is still a challenge. In this paper, we propose an efficient method inspired by the BoVW algorithm. Our key idea is to convert the standard BoVW model into a BoVP (Bag of Visual Phrase) model based on a density-spatial clustering algorithm. We show experimentally that the proposed model is able to perform better than classical methods. We examine the performance of the proposed method on four different datasets.
Download

Area 2 - Mobile and Egocentric Vision for Humans and Robots

Full Papers
Paper Nr: 132
Title:

Estimation of Robot Motion Parameters Based on Functional Consistency for Randomly Stacked Parts

Authors:

Takahiro Suzuki and Manabu Hashimoto

Abstract: In this paper, we propose a method for estimating robot motion parameters necessary for robots to automatically assemble objects. Generally, parts used in assembly are often randomly stacked. The proposed method estimates the robot motion parameters from this state. Each part has a role referred to as a “function” such as “to be grasped” or “to be assembled with other parts” for each region. Related works have defined functions for everyday objects, but in this paper, we defined a novel functional label for industrial parts. In addition, we proposed novel ideas which is the functional consistency of part. Functional consistency refers to the constraints that functional labels have. Functional consistency is used in adapting to various bin scene because it is invariant no matter what state the parts are placed in. Functional consistency is used in the proposed method as a cue, robot motion parameters are estimated on the basis of relationship between parameters and functions. In an experiment using connecting rods, the average success rate was 81.5%. The effectiveness of the proposed method was confirmed from the ablation studies and comparison with related work.
Download

Paper Nr: 182
Title:

SynMotor: A Benchmark Suite for Object Attribute Regression and Multi-Task Learning

Authors:

Chengzhi Wu, Linxi Qiu, Kanran Zhou, Julius Pfrommer and Jürgen Beyerer

Abstract: In this paper, we develop a novel benchmark suite including both a 2D synthetic image dataset and a 3D synthetic point cloud dataset. Our work is a sub-task in the framework of a remanufacturing project, in which small electric motors are used as fundamental objects. Apart from the given detection, classification, and segmentation annotations, the key objects also have multiple learnable attributes with ground truth provided. This benchmark can be used for computer vision tasks including 2D/3D detection, classification, segmentation, and multi-attribute learning. It is worth mentioning that most attributes of the motors are quantified as continuously variable rather than binary, which makes our benchmark well-suited for the less explored regression tasks. In addition, appropriate evaluation metrics are adopted or developed for each task and promising baseline results are provided. We hope this benchmark can stimulate more research efforts on the sub-domain of object attribute learning and multi-task learning in the future.
Download

Paper Nr: 243
Title:

Face-Based Gaze Estimation Using Residual Attention Pooling Network

Authors:

Chaitanya Bandi and Ulrike Thomas

Abstract: Gaze estimation reveals a person’s intent and willingness to interact, which is an important cue in human-robot interaction applications to gain a robot’s attention. With tremendous developments in deep learning architectures and easily accessible cameras, human eye gaze estimation has received a lot of attention. Compared to traditional model-based gaze estimation methods, appearance-based methods have shown a substantial improvement in accuracy. In this work, we present an appearance-based gaze estimation architecture that adopts convolutions, residuals, and attention blocks to increase gaze accuracy further. Face and eye images are generally adopted separately or in combination for the estimation of eye gaze. In this work, we rely entirely on facial features, since the gaze can be tracked under extreme head pose variations. With the proposed architecture, we attain better than state-of-the-art accuracy on the MPIIFaceGaze dataset and the ETH-XGaze open-source benchmark.
Download

Short Papers
Paper Nr: 34
Title:

Classification and Embedding of Semantic Scene Graphs for Active Cross-Domain Self-Localization

Authors:

Yoshida Mitsuki, Yamamoto Ryogo, Wakayama Kazuki, Hiroki Tomoe and Tanaka Kanji

Abstract: In visual robot self-localization, semantic scene graph (S2G) has attracted recent research attention as a valuable scene model that is robust against both viewpoint and appearance changes. However, the use of S2G in the context of active self-localization has not been sufficiently explored yet. In general, an active self-localization system consists of two essential modules. One is the visual place recognition (VPR) model, which aims to classify an input scene to a specific place class. The other is the next-best-view (NBV) planner, which aims to map the current state to the NBV action. We propose an efficient trainable framework of active self-localization in which a graph neural network (GNN) is effectively shared by these two modules. Specifically, first, the GNN is trained as a S2G classifier for VPR in a self-supervised learning manner. Second, the trained GNN is reused as a means of the dissimilarity-based embedding to map an S2G to the fixed-length state vector. To summarize, our approach uses the GNN in two ways: (1) passive single-view self-localization, (2) knowledge transfer from passive to active self-localization. Experiments using the public NCLT dataset have shown that the proposed framework outperforms other baseline self-localization methods.
Download

Paper Nr: 47
Title:

Deep Distance Metric Learning for Similarity Preserving Embedding of Point Clouds

Authors:

Ahmed Abouelazm, Igor Vozniak, Nils Lipp, Pavel Astreika and Christian Mueller

Abstract: Point cloud processing and 3D model retrieval methods have received a lot of interest as a result of the recent advancement in deep learning, computing hardware, and a wide range of available 3D sensors. Many state-of-the-art approaches utilize distance metric learning for solving the 3D model retrieval problem. However, the majority of these approaches disregard the variation in shape and properties of instances belonging to the same class known as intra-class variance, and focus on semantic labels as a measure of relevance. In this work, we present two novel loss functions for similarity-preserving point cloud embedding, in which the distance between point clouds in the embedding space is directly proportional to the ground truth distance between them using a similarity or distance measure. The building block of both loss functions is the forward passing of n-pair input point clouds through a Siamese network. We utilize ModelNet 10 dataset in the course of numerical evaluations under classification and mean average precision evaluation metrics. The reported quantitative and qualitative results demonstrate enhancement in retrieved models both quantitatively and qualitatively by a significant margin.
Download

Paper Nr: 48
Title:

Point Cloud Neighborhood Estimation Method Using Deep Neuro-Evolution

Authors:

Ahmed Abouelazm, Igor Vozniak, Nils Lipp, Pavel Astreika and Christian Mueller

Abstract: Due to the recent advancements in computing hardware, deep learning, and 3D sensors, point clouds have become an essential 3D data structure, and their processing and analysis have received considerable attention. Given the unstructured and irregular nature of point clouds, encoding local geometries is a significant barrier in point cloud analysis. The aforementioned challenge is known as neighborhood estimation, and it is commonly addressed by fitting a plane to points within a local neighborhood defined by estimated parameters. The estimated neighborhood parameters for each point should adapt to the point cloud’s irregularities and different local geometries’ sizes and shapes. Different objective functions have been derived in the literature for optimal parameters selection with no efficient approach for these objective functions’ optimization as of now. In this work, we propose a novel neighborhood estimation pipeline for such optimization which is objective function and neighborhood type invariant, utilizing a modified version of deep Neuro-Evolution algorithm and Farthest Point Sampling as an intelligent sampling approach. Results demonstrate the ability of the proposed pipeline for state-of-the-art objective functions optimization and enhancement of neighborhood properties estimation such as the normal vector.
Download

Paper Nr: 90
Title:

3D Ego-Pose Lift-Up Robustness Study for Fisheye Camera Perturbations

Authors:

Teppei Miura, Shinji Sako and Tsutomu Kimura

Abstract: 3D egocentric human pose estimations from a mounted fisheye camera have been developed following the advances in convolutional neural networks and synthetic data generations. The camera captures different images that are affected by the optical properties, the mounted position, and the camera perturbations caused by body motion. Therefore, data collecting and model training are main challenges to estimate 3D ego-pose from a mounted fisheye camera. Past works proposed synthetic data generations and two-step estimation model that consisted of 2D human pose estimation and subsequent 3D lift-up to overcome the tasks. However, the works insufficiently verify robustness for the camera perturbations. In this paper, we evaluate existing models for robustness using a synthetic dataset with the camera perturbations that increases in several steps. Our study provides useful knowledges to introduce 3D ego-pose estimation for a mounted fisheye camera in practical.
Download

Paper Nr: 98
Title:

UMVpose++: Unsupervised Multi-View Multi-Person 3D Pose Estimation Using Ground Point Matching

Authors:

Diógenes F. Silva, João M. Lima, Diego F. Thomas, Hideaki Uchiyama and Veronica Teichrieb

Abstract: We present UMVpose++ to address the problem of 3D pose estimation of multiple persons in a multi-view scenario. Different from the most recent state-of-the-art methods, which are based on supervised techniques, our work does not need labeled data to perform 3D pose estimation. Furthermore, generating 3D annotations is costly and has a high probability of containing errors. Our approach uses a plane sweep method to generate the 3D pose estimation. We define one view as the target and the remainder as reference views. We estimate the depth of each 2D skeleton in the target view to obtain our 3D poses. Instead of comparing them with ground truth poses, we project the estimated 3D poses onto the reference views, and we compare the 2D projections with the 2D poses obtained using an off-the-shelf method. 2D poses of the same pedestrian obtained from the target and reference views must be matched to allow comparison. By performing a matching process based on ground points, we identify the corresponding 2D poses and compare them with our respective projections. Furthermore, we propose a new reprojection loss based on the smooth L1 norm. We evaluated our proposed method on the publicly available Campus dataset. As a result, we obtained better accuracy than state-of-the-art unsupervised methods, achieving 0.5% points above the best geometric method. Furthermore, we outperform some state-of-the-art supervised methods, and our results are comparable with the best-supervised method, achieving only 0.2% points below.
Download

Paper Nr: 150
Title:

Near-infrared Lipreading System for Driver-Car Interaction

Authors:

Samar Daou, Ahmed Rekik, Achraf Ben-Hamadou and Abdelaziz Kallel

Abstract: In this paper, we propose a new lipreading approach for driver-car interaction in a cockpit monitoring environment. Furthermore, we introduce and release the first lipreading dataset dedicated to intuitive driver-car interaction using near-infrared driver monitoring cameras. In this paper, we propose a two-stream deep learning architecture that combines both geometric and global visual features extracted from the mouth region to improve the performance of lipreading based only on visual cues. Geometric features are extracted by a graph convolutional network applied to a series of 2D facial landmarks, while a 2D-3D convolutional network is used to extract the global visual features from the near-infrared frame sequence. These features are then decoded based on a multi-scale temporal convolutional network to generate the output word sequence classification. Our proposed model achieved high accuracy for both training scenarios overlapped speaker and unseen speaker with 98.5% and 92.2% respectively.
Download

Paper Nr: 159
Title:

Surface-Biased Multi-Level Context 3D Object Detection

Authors:

Sultan A. Ghazal, Jean Lahoud and Rao Anwer

Abstract: Object detection in 3D point clouds is a crucial task in a range of computer vision applications including robotics, autonomous cars, and augmented reality. This work addresses the object detection task in 3D point clouds using a highly efficient, surface-biased, feature extraction method (Wang et al., 2022), that also captures contextual cues on multiple levels. We propose a 3D object detector that extracts accurate feature representations of object candidates and leverages self-attention on point patches, object candidates, and on the global scene in 3D scene. Self-attention is proven to be effective in encoding correlation information in 3D point clouds by (Qian et al., 2020). While other 3D detectors focus on enhancing point cloud feature extraction by selectively obtaining more meaningful local features (Wang et al., 2022) where contextual information is overlooked. To this end, the proposed architecture uses ray-based surface-biased feature extraction and multi-level context encoding to out perform the state-of-the-art 3D object detector. In this work, 3D detection experiments are performed on scenes from the ScanNet dataset whereby the self-attention modules are introduced one after the other to isolate the effect of self-attention at each level. The code is available at https://github.com/SultanAbuGhazal/SurfaceBaisedMLevelContext.

Paper Nr: 181
Title:

Put Your PPE on: A Tool for Synthetic Data Generation and Related Benchmark in Construction Site Scenarios

Authors:

Camillo Quattrocchi, Daniele Di Mauro, Antonino Furnari, Antonino Lopes, Marco Moltisanti and Giovanni M. Farinella

Abstract: Using Machine Learning algorithms to enforce safety in construction sites has attracted a lot of interest in recent years. Being able to understand if a worker is wearing personal protective equipment, if he has fallen in the ground, or if he is too close to a moving vehicles or a dangerous tool, could be useful to prevent accidents and to take immediate rescue actions. While these problems can be tackled with machine learning algorithms, a large amount of labeled data, difficult and expensive to obtain are required. Motivated by these observations, we propose a pipeline to produce synthetic data in a construction site to mitigate real data scarcity. We present a benchmark to test the usefulness of the generated data, focusing on three different tasks: safety compliance through object detection, fall detection through pose estimation and distance regression from monocular view. Experiments show that the use of synthetic data helps to reduce the amount of needed real data and allow to achieve good performances.
Download

Paper Nr: 190
Title:

A Wearable Device Application for Human-Object Interactions Detection

Authors:

Michele Mazzamuto, Francesco Ragusa, Alessandro Resta, Giovanni M. Farinella and Antonino Furnari

Abstract: Over the past ten years, wearable technologies have continued to evolve. In the development of wearable technology, smart glasses for augmented and mixed reality are becoming particularly prominent. We believe that it is crucial to incorporate artificial intelligence algorithms that can understand real-world human behavior into these devices if we want them to be able to properly mix the real and virtual worlds and give assistance to the users. In this paper, we present an application for smart glasses that provides assistance to workers in an industrial site recognizing human-object interactions. We propose a system that utilizes a 2D object detector to locate and identify the objects in the scene and classic mixed reality features like plane detector, virtual object anchoring, and hand pose estimation to predict the interaction between a person and the objects placed on a working area in order to avoid the 3D object annotation and detection problem. We have also performed a user study with 25 volunteers who have been asked to complete a questionnaire after using the application to assess the usability and functionality of the developed application.
Download

Paper Nr: 213
Title:

Sentiment-Based Engagement Strategies for Intuitive Human-Robot Interaction

Authors:

Thorsten Hempel, Laslo Dinges and Ayoub Al-Hamadi

Abstract: Emotion expressions serve as important communicative signals and are crucial cues in intuitive interactions between humans. Hence, it is essential to include these fundamentals in robotic behavior strategies when interacting with humans to promote mutual understanding and to reduce misjudgements. We tackle this challenge by detecting and using the emotional state and attention for a sentiment analysis of potential human interaction partners to select well-adjusted engagement strategies. This way, we pave the way for more intuitive human-robot interactions, as the robot’s action conforms to the person’s mood and expectation. We propose four different engagement strategies with implicit and explicit communication techniques that we implement on a mobile robot platform for initial experiments.
Download

Paper Nr: 279
Title:

When Continual Learning Meets Robotic Grasp Detection: A Novel Benchmark on the Jacquard Dataset

Authors:

Rui Yang, Matthieu Grard, Emmanuel Dellandréa and Liming Chen

Abstract: Robotic grasp detection is to predict a grasp configuration, e.g., grasp location, gripper openness size, to enable a suitable end-effector to stably grasp a given object on the scene, whereas continual learning (CL) refers to the skill of an artificial learning system to learn continuously about the external changing world. Because it corresponds to real-life scenarios where data and tasks continuously occur, CL has aroused increasing interest in research communities. Numerous studies have focused so far on image classification, but none of them involve robotic grasp detection, although extending continuously robots with novel grasp capabilities when facing novel objects in unknown scenes is a major requirement of real-life applications. In this paper, we propose a first benchmark, namely Jacquard-CL, that uses a small part of the Jacquard Dataset with variations of the illumination and background to create a NI(new instances)-like scenario. Then, we adapt and benchmark several state-of-the-art continual learning methods to the grasp detection problem and create a baseline for the issue of continual grasp detection. The experiments show that regularization-based methods struggle to retain the previously learned knowledge, but memory-based methods perform better.
Download

Paper Nr: 21
Title:

Robust Path Planning in the Wild for Automatic Look-Ahead Camera Control

Authors:

Sander R. Klomp and Peter H. N. de With

Abstract: Finding potential driving paths on unstructured roads is a challenging problem for autonomous driving and robotics applications. Although the rise of autonomous driving has resulted in massive public datasets, most of these datasets focus on urban environments and feature almost exclusively paved roads. To circumvent the problem of limited public datasets of unpaved roads, we combine seven public vehicle-mounted-camera datasets with a very small private dataset and train a neural network to achieve accurate road segmentation on almost any type of road. This trained network vastly outperforms networks trained on individual datasets when validated on our unpaved road datasets, with only a minor performance reduction on the highly challenging public WildDash dataset, which is mostly urban. Finally, we develop an algorithm to robustly transform these road segmentations to road centerlines, used to automatically control a vehicle-mounted PTZ camera.
Download

Paper Nr: 65
Title:

Fully Convolutional Neural Network for Event Camera Pose Estimation

Authors:

Ahmed Tabia, Fabien Bonardi and Samia Bouchafa-Bruneau

Abstract: Event cameras are bio-inspired vision sensors that record the dynamics of a scene while filtering out unnecessary data. Many classic pose estimation methods have been superseded by camera relocalization approaches based on convolutional neural networks (CNN) and long short-term memory (LSTM) in the investigation of simultaneous localization and mapping systems. However, and due to the usage of LSTM layer these methods are easy to overfit and usually take a long time to converge. In this paper, we introduce a new method to estimate the 6DOF pose of an event camera with a deep learning. Our approach starts by processing the events and generates a set of images. It then uses two CNNs to extract relevant features from the generated images. Those features are multiplied using the outer product at each location of the image and pooled across locations. The model ends with a regression layer which outputs the estimated position and orientation of the event camera. Our approach has been evaluated on different datasets. The results show its superiority compared to state-of-the-art methods.
Download

Paper Nr: 107
Title:

Data-Efficient Transformer-Based 3D Object Detection

Authors:

Aidana Nurakhmetova, Jean Lahoud and Hisham Cholakkal

Abstract: Recent 3D detection models rely on Transformer architecture due to its natural ability to abstract global context features. One is the 3DETR network - a pure transformer-based model designed to generate 3D boxes on indoor dataset scans. It is generally known that transformers are data-hungry. However, data collection and annotation in 3D are more challenging than in 2D. Thus, our goal is to study the data-hungriness of the 3DETR-m model and propose a solution for its data efficiency. Our methodology is based on the observation that PointNet++ provides more locally aggregated features that can be useful to support 3DETR-m prediction on small dataset problem. We suggest three methods of backbone fusion that are based on addition (Fusion I), concatenation (Fusion II), and replacement (Fusion III). We utilize pre-trained weights from the Group-free model trained on the SUN RGB-D dataset. The proposed 3DETR-m outperforms the original model in all data proportions (10%, 25%, 50%, 75%, and 100%). We improve 3DETR-m paper results by 1.46% and 2.46% in mAP@25 and mAP@50 on the full dataset. Hence, we believe our research efforts can provide new insights into the data-hungriness issue of 3D transformer detectors and inspire the usage of pre-trained models in 3D as one way towards data efficiency.
Download

Paper Nr: 131
Title:

Trajectory Prediction in First-Person Video: Utilizing a Pre-Trained Bird's-Eye View Model

Authors:

Masashi Hatano, Ryo Hachiuma and Hideo Saito

Abstract: In recent years, much attention has been paid to the prediction of pedestrian trajectories, as they are one of the key factors for a better society, such as automatic driving, a guide for blind people, and social robots interacting with humans. To tackle this task, many methods have been proposed but few are from the first-person perspective because of the lack of a publicly available dataset. Therefore, we propose a method that uses egocentric vision, which does not need to be trained with a first-person video dataset. We made it possible to utilize existing methods, which predict from a bird’s-eye view. In addition, we propose a novel way to consider semantic information without changing the shape of the input to apply to all existing bird’s-eye methods that use only past trajectories. Therefore, there is no need to create a new dataset from egocentric vision. The experimental results demonstrate that the proposed method makes it possible to predict from an egocentric view via existing methods of bird’s-eye view. The proposed method qualitatively improves trajectory predictions without aggravating quantitative accracy, and the effectiveness of predicting the trajectories of multiple people simultaneously.
Download

Paper Nr: 165
Title:

Memory-Efficient Implementation of GMM-MRCoHOG for Human Recognition Hardware

Authors:

Ryogo Takemoto, Yuya Nagamine, Kazuki Yoshihiro, Masatoshi Shibata, Hideo Yamada, Yuichiro Tanaka, Shuichi Enokida and Hakaru Tamukoh

Abstract: High-speed and accurate human recognition is necessary to realize safe autonomous mobile robots. Recently, human recognition methods based on deep learning have been studied extensively. However, these methods consume large amounts of power. Therefore, this study focuses on the Gaussian mixture model of multiresolution co-occurrence histograms of oriented gradients (GMM-MRCoHOG), which is a feature extraction method for human recognition that entails lower computational costs compared to deep learning-based methods, and aims to implement its hardware for high-speed, high-accuracy, and low-power human recognition. A digital hardware implementation method of GMM-MRCoHOG has been proposed. However, the method requires numerous look-up tables (LUTs) to store state spaces of GMM-MRCoHOG, thereby impeding the realization of human recognition systems. This study proposes a LUT reduction method to overcome this drawback by standardizing basis function arrangements of Gaussian mixture distributions in GMM-MRCoHOG. Experimental results show that the proposed method is as accurate as the previous method, and the memory required for state spaces consuming LUTs can be reduced to 1/504th of that required in the previous method.
Download

Paper Nr: 200
Title:

Seeing Risk of Accident from In-Vehicle Cameras

Authors:

Takuya Goto, Fumihiko Sakaue and Jun Sato

Abstract: In this paper, we propose a method for visualizing the risk of car accidents in in-vehicle camera images by using deep learning. Our network predicts the future risk of car accidents and generates a risk map image that represents the degree of accident risk at each point in the image. For training our network, we need pairs of in-vehicle images and risk map images, but such datasets do not exist and are very difficult to create. In this research, we derive a method for computing the degree of the future risk of car accidents at each point in the image and use it for constructing the training dataset. By using the dataset, our network learns to generate risk map images from in-vehicle images. The efficiency of our method is tested by using real car accident images.
Download

Paper Nr: 236
Title:

Automatic Robotic Arm Calibration for the Integrity Test of Voting Machines in the Brazillian 2022's Election Context

Authors:

Marcondes D. Silva Júnior, Jonas F. Silva and João M. Teixeira

Abstract: The Brazilian electoral system uses the electronic ballot box to increase the security of the vote and the speed of counting the votes. It is subjected to several security tests, and the one that has the most human interaction and personnel involved is the integrity test. Our macro project proposed a solution to optimize the testing process and reduce the amount of human beings involved, using a robotic arm with the aid of computer vision to optimize the personal demand from 8 people to 2. However, in order to use the robot, technical knowledge was still required, and it could not be used by any user, as it was necessary to manually map the keys to the places where the robotic arm would press to perform the test. We present a solution for automatically mapping a workspace to a robotic arm. Using an RGB-D camera and computer vision techniques with deep learning, we can move the robotic arm with 6 Degrees of Freedom (DoF) through Cartesian actions within a workspace. For this, we use a YOLO network, mapping of a robot workspace, and a correlation of 3D points from the camera to the robot workspace coordinates. Based on the tests carried out, the results show that we were able to map the points of interest with high precision and trace a path plan for the robot to reach them. The solution was then applied in a real test scenario during the first round of Brazillian elections of 2022, and the obtained results were compatible to the conventional non-assisted approach.
Download

Paper Nr: 237
Title:

ENIGMA: Egocentric Navigator for Industrial Guidance, Monitoring and Anticipation

Authors:

Francesco Ragusa, Antonino Furnari, Antonino Lopes, Marco Moltisanti, Emanuele Ragusa, Marina Samarotto, Luciano Santo, Nicola Picone, Leo Scarso and Giovanni M. Farinella

Abstract: We present ENIGMA (Egocentric Navigator for Industrial Guidance, Monitoring and Anticipation), an integrated system to support workers in an industrial laboratory. ENIGMA includes a wearable assistant which understands the worker’s behavior through Computer Vision algorithms which 1) localize the operator, 2) recognize the objects present in the laboratory, 3) detect the human-object interactions which happen and 4) anticipate the next-active object with which the worker will interact. Furthermore, a back-end extracts high semantic information about the worker behavior to provide useful services and to improve his safety. Preliminary experiments were conducted showing good performance on the tasks of localization, object detection and recognition and egocentric human-object interaction detection considering the challenging industrial scenario.
Download

Area 3 - Image and Video Understanding

Full Papers
Paper Nr: 1
Title:

Unfolding Local Growth Rate Estimates for (Almost) Perfect Adversarial Detection

Authors:

Peter Lorenz, Margret Keuper and Janis Keuper

Abstract: Convolutional neural networks (CNN) define the state-of-the-art solution on many perceptual tasks. However, current CNN approaches largely remain vulnerable against adversarial perturbations of the input that have been crafted specifically to fool the system while being quasi-imperceptible to the human eye. In recent years, various approaches have been proposed to defend CNNs against such attacks, for example by model hardening or by adding explicit defence mechanisms. Thereby, a small “detector” is included in the network and trained on the binary classification task of distinguishing genuine data from data containing adversarial perturbations. In this work, we propose a simple and light-weight detector, which leverages recent findings on the relation between networks’ local intrinsic dimensionality (LID) and adversarial attacks. Based on a re-interpretation of the LID measure and several simple adaptations, we surpass the state-of-the-art on adversarial detection by a significant margin and reach almost perfect results in terms of F1-score for several networks and datasets. Sources available at: https://github.com/adverML/multiLID
Download

Paper Nr: 12
Title:

DaDe: Delay-Adaptive Detector for Streaming Perception

Authors:

Wonwoo Jo, Kyungshin Lee, Jaewon Baik, Sangsun Lee, Dongho Choi and Hyunkyoo Park

Abstract: Recognizing the surrounding environment at low latency is critical in autonomous driving. In real-time environment, surrounding environment changes when processing is over. Current detection models are incapable of dealing with changes in the environment that occur after processing. Streaming perception is proposed to assess the latency and accuracy of real-time video perception. However, additional problems arise in real-world applications due to limited hardware resources, high temperatures, and other factors. In this study, we develop a model that can reflect processing delays in real time and produce the most reasonable results. By incorporating the proposed feature queue and feature select module, the system gains the ability to forecast specific time steps without any additional computational costs. Our method is tested on the Argoverse-HD dataset. It achieves higher performance than the current state-of-the-art methods(2022.12) in various environments when delayed . The code is available at https://github.com/danjos95/DADE.
Download

Paper Nr: 13
Title:

A Patch-Based Architecture for Multi-Label Classification from Single Positive Annotations

Authors:

Warren Jouanneau, Aurélie Bugeau, Marc Palyart, Nicolas Papadakis and Laurent Vézard

Abstract: Supervised methods rely on correctly curated and annotated datasets. However, data annotation can be a cumbersome step needing costly hand labeling. In this paper, we tackle multi-label classification problems where only a single positive label is available in images of the dataset. This weakly supervised setting aims at simplifying datasets assembly by collecting only positive image exemples for each label without further annotation refinement. Our contributions are twofold. First, we introduce a light patch architecture based on the attention mechanism. Next, leveraging on patch embedding self-similarities, we provide a novel strategy for estimating negative examples and deal with positive and unlabeled learning problems. Experiments demonstrate that our architecture can be trained from scratch, whereas pre-training on similar databases is required for related methods from the literature.
Download

Paper Nr: 16
Title:

Human Object Interaction Detection Primed with Context

Authors:

Maya Antoun and Daniel Asmar

Abstract: Recognizing Human-Object Interaction (HOI) in images is a difficult yet fundamental requirement for scene understanding. Despite the significant advances deep learning has achieved so far in this field, the performance of state of the art HOI detection systems is still very low. Contextual information about the scene has shown improvement in the prediction. However, most works that use semantic features rely on general word embedding models to represent the objects or the actions rather than contextual embedding. Motivated by evidence from the field of human psychology, this paper suggests contextualizing actions by pairing their verbs with their relative objects at an early stage. The proposed system consists of two streams: a semantic memory stream on one hand, where verb-object pairs are represented via a graph network by their corresponding feature vector; and an episodic memory stream on the other hand in which human-objects interactions are represented by their corresponding visual features. Experimental results indicate that our proposed model achieves comparable results on the HICO-DET dataset with a pretrained object detector and superior results on HICO-DET with finetuned detector.
Download

Paper Nr: 54
Title:

Absolute-ROMP: Absolute Multi-Person 3D Mesh Prediction from a Single Image

Authors:

Bilal Abdulrahman and Zhigang Zhu

Abstract: Recovering multi-person 3D poses and shapes with absolute scales from a single RGB image is a challenging task due to the inherent depth and scale ambiguity from a single view. Current works on 3D pose and shape estimation tend to mainly focus on the estimation of the 3D joint locations relative to the root joint , usually defined as the one closest to the shape centroid, in case of humans defined as the pelvis joint. In this paper, we build upon an existing multi-person 3D mesh predictor network, ROMP, to create Absolute-ROMP. By adding absolute root joint localization in the camera coordinate frame, we are able to estimate multi-person 3D poses and shapes with absolute scales from a single RGB image. Such a single-shot approach allows the system to better learn and reason about the inter-person depth relationship, thus improving multi-person 3D estimation. In addition to this end to end network, we also train a CNN and transformer hybrid network, called TransFocal, to predict the focal length of the image’s camera. Absolute-ROMP estimates the 3D mesh coordinates of all persons in the image and their root joint locations normalized by the focal point. We then use TransFocal to obtain focal length and get absolute depth information of all joints in the camera coordinate frame. We evaluate Absolute-ROMP on the root joint localization and root-relative 3D pose estimation tasks on publicly available multi-person 3D pose datasets. We evaluate TransFocal on dataset created from the Pano360 dataset and both are applicable to in-the-wild images and videos, due to real time performance.
Download

Paper Nr: 55
Title:

Semi-Supervised Domain Adaptation with CycleGAN Guided by Downstream Task Awareness

Authors:

Annika Mütze, Matthias Rottmann and Hanno Gottschalk

Abstract: Domain adaptation is of huge interest as labeling is an expensive and error-prone task, especially on pixel-level like for semantic segmentation. Therefore, one would like to train neural networks on synthetic domains, where data is abundant. However, these models often perform poorly on out-of-domain images. Image-to-image approaches can bridge domains on input level. Nevertheless, standard image-to-image approaches do not focus on the downstream task but rather on the visual inspection level. We therefore propose a “task aware” generative adversarial network in an image-to-image domain adaptation approach. Assisted by some labeled data, we guide the image-to-image translation to a more suitable input for a semantic segmentation network trained on synthetic data. This constitutes a modular semi-supervised domain adaptation method for semantic segmentation based on CycleGAN where we refrain from adapting the semantic segmentation expert. Our experiments involve evaluations on complex domain adaptation tasks and refined domain gap analyses using from-scratch-trained networks. We demonstrate that our method outperforms CycleGAN by 7 percent points in accuracy in image classification using only 70 (10%) labeled images. For semantic segmentation we show an improvement of up to 12.5 percent points in mean intersection over union on Cityscapes using up to 148 labeled images.
Download

Paper Nr: 61
Title:

A General Context Learning and Reasoning Framework for Object Detection in Urban Scenes

Authors:

Xuan Wang, Hao Tang and Zhigang Zhu

Abstract: Contextual information has been widely used in many computer vision tasks. However, existing approaches design specific contextual information mechanisms for different tasks. In this work, we propose a general context learning and reasoning framework for object detection tasks with three components: local contextual labeling, contextual graph generation and spatial contextual reasoning. With simple user defined parameters, local contextual labeling automatically enlarge the small object labels to include more local contextual information. A Graph Convolutional Network learns over the generated contextual graph to build a semantic space. A general spatial relation is used in spatial contextual reasoning to optimize the detection results. All three components can be easily added and removed from a standard object detector. In addition, our approach also automates the training process to find the optimal combinations of user defined parameters. The general framework can be easily adapted to different tasks. In this paper we compare our framework with a previous multistage context learning framework specifically designed for storefront accessibility detection and a state of the art detector for pedestrian detection. Experimental results on two urban scene datasets demonstrate that our proposed general framework can achieve same performance as the specifically designed multistage framework on storefront accessibility detection, and with improved performance on pedestrian detection over the state of art detector.
Download

Paper Nr: 67
Title:

Rethinking the Backbone Architecture for Tiny Object Detection

Authors:

Jinlai Ning, Haoyan Guan and Michael Spratling

Abstract: Tiny object detection has become an active area of research because images with tiny targets are common in several important real-world scenarios. However, existing tiny object detection methods use standard deep neural networks as their backbone architecture. We argue that such backbones are inappropriate for detecting tiny objects as they are designed for the classification of larger objects, and do not have the spatial resolution to identify small targets. Specifically, such backbones use max-pooling or a large stride at early stages in the architecture. This produces lower resolution feature-maps that can be efficiently processed by subsequent layers. However, such low-resolution feature-maps do not contain information that can reliably discriminate tiny objects. To solve this problem we design “bottom-heavy” versions of backbones that allocate more resources to processing higher-resolution features without introducing any additional computational burden overall. We also investigate if pre-training these backbones on images of appropriate size, using CIFAR100 and ImageNet32, can further improve performance on tiny object detection. Results on TinyPerson and WiderFace show that detectors with our proposed backbones achieve better results than the current state-of-the-art methods.
Download

Paper Nr: 83
Title:

Rotation Equivariance for Diamond Identification

Authors:

Floris De Feyter, Bram Claes and Toon Goedemé

Abstract: To guarantee integrity when trading diamonds, a certified company can grade the diamonds and give them a unique ID. While this is often done for high-valued diamonds, it is economically less interesting to do this for lower-valued diamonds. While integrity could be checked manually as well, this involves a high labour cost. Instead, we present a computer vision-based technique for diamond identification. We propose to apply a polar transformation to the diamond image before passing the image to a CNN. This makes the network equivariant to rotations of the diamond. With this set-up, our best model achieves an mAP of 100% under a stringent evaluation regime. Moreover, we provide a custom implementation of the polar warp that is multiple orders of magnitude faster than the frequently used implementation of OpenCV.
Download

Paper Nr: 84
Title:

Toward Few Pixel Annotations for 3D Segmentation of Material from Electron Tomography

Authors:

Cyril Li, Christophe Ducottet, Sylvain Desroziers and Maxime Moreaud

Abstract: Segmentation is a notorious tedious task, especially for 3D volume of material obtained via electron tomography. In this paper, we propose a new method for the segmentation of such data with only few partially labeled slices extracted from the volume. This method handles very restricted training data, and particularly less than a slice of the volume. Moreover, unlabeled data also contributes to the segmentation. To achieve this, a combination of self-supervised and contrastive learning methods are used on top of any 2D segmentation backbone. This method has been evaluated on three real electron tomography volumes.
Download

Paper Nr: 94
Title:

Leveraging Unsupervised and Self-Supervised Learning for Video Anomaly Detection

Authors:

Devashish Lohani, Carlos Crispim-Junior, Quentin Barthélemy, Sarah Bertrand, Lionel Robinault and Laure T. Rodet

Abstract: Video anomaly detection consists of detecting abnormal events in videos. Since abnormal events are rare, anomaly detection methods are mainly not fully supervised. One such popular family of methods learn normality by training an autoencoder (AE) on normal data and detect anomalies as they deviate from this normality. But the powerful reconstruction capacity of AE makes it still difficult to separate anomalies from normality. To address this issue, some works enhance the AE with an external memory bank or attention modules but still these methods suffer in detecting diverse spatial and temporal anomalies. In this work, we propose a method that leverages unsupervised and self-supervised learning on a single AE. The AE is trained in an end-to-end manner and jointly learns to discriminate anomalies using three chosen tasks: (i) unsupervised video clip reconstruction; (ii) unsupervised future frame prediction; (iii) self-supervised playback rate prediction. Furthermore, to correctly emphasize the detected anomalous regions in the video, we introduce a new error measure, called the blur pooled error. Our experiments reveal that the chosen tasks enrich the representational capability of the autoencoder to detect anomalous events in videos. Results demonstrate our approach outperforms the state-of-the-art methods on three public video anomaly datasets.
Download

Paper Nr: 100
Title:

Visual Anomaly Detection and Localization with a Patch-Wise Transformer and Convolutional Model

Authors:

Afshin Dini and Esa Rahtu

Abstract: We present a one-class classification approach for detecting and locating anomalies in vision applications based on the combination of convolutional networks and transformers. This method utilizes a pre-trained model with four blocks of patch-wise transformer encoders and convolutional layers to extract patch embeddings from normal samples. The patch features from the third and fourth blocks of the model are then combined together to form the final representations, and then several multivariate Gaussian distributions are mapped on these normal embeddings accordingly. At the testing phase, irregularities are detected and located by setting a threshold on anomaly score and map defined by calculating the Mahalanobis distances between the patch embeddings of test samples and the related normal distributions. By evaluating the proposed method on the MVTec dataset, we find out that not only can this method detect anomalies properly due to the ability of the convolutional and transformer layers to present local and overall properties of an image, respectively, but also it is computationally efficient as it skips the training phase by using a pre-trained network as the feature extractor. These properties make our method a good candidate for detecting and locating irregularities in real-world industrial applications.
Download

Paper Nr: 117
Title:

YOLO: You Only Look 10647 Times

Authors:

Christian Limberg, Andrew Melnik, Helge Ritter and Helmut Prendinger

Abstract: In this work, we explore the You Only Look Once (YOLO) single-stage object detection architecture and compare it to the simultaneous classification of 10647 fixed region proposals. We use two different approaches to demonstrate that each of YOLO’s grid cells is attentive to a specific sub-region of previous layers. This finding makes YOLO’s method comparable to local region proposals. Such insight reduces the conceptual gap between YOLO-like single-stage object detection models, R-CNN-like two-stage region proposal based models, and ResNet-like image classification models. For this work, we created interactive exploration tools for a better visual understanding of the YOLO information processing streams: https://limchr.github.io/yolo_visu
Download

Paper Nr: 139
Title:

On Attribute Aware Open-Set Face Verification

Authors:

Arun K. Subramanian and Anoop Namboodiri

Abstract: Deep Learning on face recognition problems has shown extremely high accuracy owing to their ability in finding strongly discriminating features. However, face images in the wild show variations in pose, lighting, expressions, and the presence of facial attributes (for example eyeglasses). We ask, why then are these variations not detected and used during the matching process? We demonstrate that this is indeed possible while restricting ourselves to facial attribute variation, to prove the case in point. We show two ways of doing so. a) By using the face attribute labels as a form of prior, we bin the matching template pairs into three bins depending on whether each template of the matching pair possesses a given facial attribute or not. By operating on each bin and averaging the result, we better the EER of SOTA by over 1 % over a large set of matching pairs. b) We use the attribute labels and correlate them with each neuron of an embedding generated by a SOTA architecture pre-trained DNN on a large Face dataset and fine-tuned on face-attribute labels. We then suppress a set of maximally correlating neurons and perform matching after doing so. We demonstrate this improves the EER by over 2 %.
Download

Paper Nr: 163
Title:

A Lightweight Gaussian-Based Model for Fast Detection and Classification of Moving Objects

Authors:

Joaquin Palma-Ugarte, Laura Estacio-Cerquin, Victor Flores-Benites and Rensso Mora-Colque

Abstract: Moving object detection and classification are fundamental tasks in computer vision. However, current solutions detect all objects, and then another algorithm is used to determine which objects are in motion. Furthermore, diverse solutions employ complex networks that require a lot of computational resources, unlike lightweight solutions that could lead to widespread use. We introduce TRG-Net, a unified model that can be executed on computationally limited devices to detect and classify just moving objects. This proposal is based on the Faster R-CNN architecture, MobileNetV3 as a feature extractor, and a Gaussian mixture model for a fast search of regions of interest based on motion. TRG-Net reduces the inference time by unifying moving object detection and image classification tasks, and by limiting the regions of interest to the number of moving objects. Experiments over surveillance videos and the Kitti dataset for 2D object detection show that our approach improves the inference time of Faster R-CNN (0.221 to 0.138s) using fewer parameters (18.91 M to 18.30 M) while maintaining average precision (AP=0.423). Therefore, TRG-Net achieves a balance between precision and speed, and could be applied in various real-world scenarios.
Download

Paper Nr: 168
Title:

Image Generation from a Hyper Scene Graph with Trinomial Hyperedges

Authors:

Ryosuke Miyake, Tetsu Matsukawa and Einoshin Suzuki

Abstract: Generating realistic images is one of the important problems in the field of computer vision. In image generation tasks, generating images consistent with an input given by the user is called conditional image generation. Due to the recent advances in generating high-quality images with Generative Adversarial Networks, many conditional image generation models have been proposed, such as text-to-image, scene-graph-to-image, and layout-to-image models. Among them, scene-graph-to-image models have the advantage of generating an image for a complex situation according to the structure of a scene graph. However, existing scene-graph-to-image models have difficulty in capturing positional relations among three or more objects since a scene graph can only represent relations between two objects. In this paper, we propose a novel image generation model which addresses this shortcoming by generating images from a hyper scene graph with trinomial edges. We also use a layout-to-image model supplementally to generate higher resolution images. Experimental validations on COCO-Stuff and Visual Genome datasets show that the proposed model generates more natural and faithful images to user’s inputs than a cutting-edge scene-graph-to-image model.
Download

Paper Nr: 174
Title:

Automatic Defect Detection in Leather

Authors:

João Soares, Luís Magalhães, Rafaela Pinho, Mehrab Allahdad and Manuel Ferreira

Abstract: Traditionally, leather defect detection is manually solved using specialized workers in the leather inspection process. However, this task is slow and prone to error. So, in the last two decades, distinct researchers proposed new solutions to automatize this procedure. At this moment, there are already efficient solutions in the literature review. However, these solutions are based on supervised machine learning techniques that require a high-dimension dataset. As the leather annotation process is time-consuming, it is necessary to find a solution to overcome this challenge. So, this research explores novelty detection techniques. Moreover, this work evaluates SSIM Autoencoder, CFLOW, STFPM, RDOCE, and DRAEM performances on leather defects detection problem. These techniques are trained and tested in two distinct datasets: MVTEC and Neadvance. These techniques present a good performance on MVTEC defects detection. However, they have difficulties with the Neadvance dataset. This research presents the best methodology to use for two distinct scenarios. When the real-world samples have only one color, DRAEM should be used. When the real-world samples have more than one color, the STFPM should be applied.
Download

Paper Nr: 178
Title:

IFMix: Utilizing Intermediate Filtered Images for Domain Adaptation in Classification

Authors:

Saeed B. Germi and Esa Rahtu

Abstract: This paper proposes an iterative intermediate domain generation method using low- and high-pass filters. Domain shift is one of the prime reasons for the poor generalization of trained models in most real-life applications. In a typical case, the target domain differs from the source domain due to either controllable factors (e.g., different sensors) or uncontrollable factors (e.g., weather conditions). Domain adaptation methods bridge this gap by training a domain-invariant network. However, a significant gap between the source and the target domains would still result in bad performance. Gradual domain adaptation methods utilize intermediate domains that gradually shift from the source to the target domain to counter the effect of the significant gap. Still, the assumption of having sufficiently large intermediate domains at hand for any given task is hard to fulfill in real-life scenarios. The proposed method utilizes low- and high-pass filters to create two distinct representations of a single sample. After that, the filtered samples from two domains are mixed with a dynamic ratio to create intermediate domains, which are used to train two separate models in parallel. The final output is obtained by averaging out both models. The method’s effectiveness is demonstrated with extensive experiments on public benchmark datasets: Office-31, Office-Home, and VisDa-2017. The empirical evaluation suggests that the proposed method performs better than the current state-of-the-art works.
Download

Paper Nr: 192
Title:

DeepCaps+: A Light Variant of DeepCaps

Authors:

Pouya Shiri and Amirali Baniasadi

Abstract: Image classification is one of the fundamental problems in the field of computer vision. Convolutional Neural Networks (CNN) are complex feed-forward neural networks that represent outstanding solutions for this problem. Capsule Network (CapsNet) is considered as the next generation of classifiers based on Convolutional Neural Networks. Despite its advantages including higher robustness to affine transformations, CapsNet does not perform well on complex data. Several works have tried to realize the true potential of CapsNet to provide better performance. DeepCaps is one of such networks with significantly improved performance. Despite its better performance on complex datasets such as CIFAR-10, DeepCaps fails to work on more complex datasets with a higher number of categories such as CIFAR-100. In this network, we introduce DeepCaps+ as an optimized variant of DeepCaps which includes fewer parameters and higher accuracy. Using a 7-ensemble model on the CIFAR-10 dataset, DeepCaps+ obtains a an accuracy of 91.63% while performing the inference 2.51x faster than DeepCaps. DeepCaps+ also obtains 67.56% test accuracy on the CIFAR-100 dataset, proving this network to be capable of handling complex datasets.
Download

Paper Nr: 208
Title:

Body Part Information Additional in Multi-decoder Transformer-Based Network for Human Object Interaction Detection

Authors:

Zihao Guo, Fei Li, Rujie Liu, Ryo Ishida and Genta Suzuki

Abstract: Human Object Interaction Detection is one of the essential branches of video understanding. However, many complex scenes exist, such as humans interacting with multiple objects. The whole human body as the subject of interaction in the complex interaction environment may misjudge the interaction with the wrong objects. In this paper, we propose a Transformer based structure with the body part additional module to solve this problem. The Transformer structure is applied to provide powerful information mining capability. Moreover, a multi-decoder structure is adopted for solving different sub-problems, enabling models to focus on different regions to provide more powerful performance. The most important contribution of our work is the proposed body part additional module. It introduces the body part information for Human-Object Interaction(HOI) detection, which refines the subject of the HOI triplet and assists the interaction detection. The body part additional module also includes the Channel Attention module to ensure the balance between the information, preventing the model from paying too much attention to the body part or the Human-Object pair. We got better performance than the State-Of-The-Art model.
Download

Paper Nr: 209
Title:

Multi-View Video Synthesis Through Progressive Synthesis and Refinement

Authors:

Mohamed I. Lakhal, Oswald Lanz and Andrea Cavallaro

Abstract: Multi-view video synthesis aims to reproduce a video as seen from a targeted viewpoint. This paper proposes to tackle this problem using a multi-stage framework to progressively add more details on the synthesized frames and refine wrong pixels from previous predictions. First, we reconstruct the foreground and the background by using 3D mesh. To do so, we leverage the one-to-one correspondence between rendered mesh faces between the input and the target view. Then, the predicted frames are defined with a recurrence formula to correct wrong pixels and adding high-frequency details. Results on the NTU RGB+D dataset show the effectiveness of the proposed approach against frame-based and video-based state-of-the-art models.
Download

Paper Nr: 211
Title:

BGD: Generalization Using Large Step Sizes to Attract Flat Minima

Authors:

Muhammad Ali, Omar Alsuwaidi and Salman Khan

Abstract: In the digital age of ever-increasing data sources, accessibility, and collection, the demand for generalizable machine learning models that are effective at capitalizing on given limited training datasets is unprecedented due to the labor-intensiveness and expensiveness of data collection. The deployed model must efficiently exploit patterns and regularities in the data to achieve desirable predictive performance on new, unseen datasets. Naturally, due to the various sources of data pools within different domains from which data can be collected, such as in Machine Learning, Natural Language Processing, and Computer Vision, selection bias will evidently creep into the gathered data, resulting in distribution (domain) shifts. In practice, it is typical for learned deep neural networks to yield sub-optimal generalization performance as a result of pursuing sharp local minima when simply solving empirical risk minimization (ERM) on highly complex and non-convex loss functions. Hence, this paper aims to tackle the generalization error by first introducing the notion of a local minimum’s sharpness, which is an attribute that induces a model’s non-generalizability and can serve as a simple guiding heuristic to theoretically distinguish satisfactory (flat) local minima from poor (sharp) local minima. Secondly, motivated by the introduced concept of variance-stability ∼ exploration-exploitation tradeoff, we propose a novel gradient-based adaptive optimization algorithm that is a variant of SGD, named Bouncing Gradient Descent (BGD). BGD’s primary goal is to ameliorate SGD’s deficiency of getting trapped in suboptimal minima by utilizing relatively large step sizes and ”unorthodox” approaches in the weight updates in order to achieve better model generalization by attracting flatter local minima. We empirically validate the proposed approach on several benchmark classification datasets, showing that it contributes to significant and consistent improvements in model generalization performance and produces state-of-the-art results when compared to the baseline approaches.
Download

Paper Nr: 218
Title:

Tackling Data Bias in Painting Classification with Style Transfer

Authors:

Mridula Vijendran, Frederick B. Li and Hubert P. H. Shum

Abstract: It is difficult to train classifiers on paintings collections due to model bias from domain gaps and data bias from the uneven distribution of artistic styles. Previous techniques like data distillation, traditional data augmentation and style transfer improve classifier training using task specific training datasets or domain adaptation. We propose a system to handle data bias in small paintings datasets like the Kaokore dataset while simultaneously accounting for domain adaptation in fine-tuning a model trained on real world images. Our system consists of two stages which are style transfer and classification. In the style transfer stage, we generate the stylized training samples per class with uniformly sampled content and style images and train the style transformation network per domain. In the classification stage, we can interpret the effectiveness of the style and content layers at the attention layers when training on the original training dataset and the stylized images. We can tradeoff the model performance and convergence by dynamically varying the proportion of augmented samples in the majority and minority classes. We achieve comparable results to the SOTA with fewer training epochs and a classifier with fewer training parameters.
Download

Paper Nr: 246
Title:

Dynamically Modular and Sparse General Continual Learning

Authors:

Arnav Varma, Elahe Arani and Bahram Zonooz

Abstract: Real-world applications often require learning continuously from a stream of data under ever-changing conditions. When trying to learn from such non-stationary data, deep neural networks (DNNs) undergo catastrophic forgetting of previously learned information. Among the common approaches to avoid catastrophic forgetting, rehearsal-based methods have proven effective. However, they are still prone to forgetting due to task-interference as all parameters respond to all tasks. To counter this, we take inspiration from sparse coding in the brain and introduce dynamic modularity and sparsity (Dynamos) for rehearsal-based general continual learning. In this setup, the DNN learns to respond to stimuli by activating relevant subsets of neurons. We demonstrate the effectiveness of Dynamos on multiple datasets under challenging continual learning evaluation protocols. Finally, we show that our method learns representations that are modular and specialized, while maintaining reusability by activating subsets of neurons with overlaps corresponding to the similarity of stimuli. The code is available at https://github.com/NeurAI-Lab/DynamicContinualLearning.
Download

Paper Nr: 252
Title:

Emotion Transformer: Attention Model for Pose-Based Emotion Recognition

Authors:

Pedro V. Paiva, Josué G. Ramos, Marina L. Gavrilova and Marco G. Carvalho

Abstract: Capturing humans’ emotional states from images in real-world scenarios is a key problem in affective computing, which has various real-life applications. Emotion recognition methods can enhance video games to increase engagement, help students to keep motivated during e-learning sections, or make interaction more natural in social robotics. Body movements, a crucial component of non-verbal communication, remain less explored in the domain of emotion recognition, while face expression-based methods are widely investigated. Transformer networks have been successfully applied across several domains, bringing significant breakthroughs. Transformers’ self-attention mechanism captures relationships between different features across different spatial locations, allowing contextual information extraction. In this work, we introduce Emotion Transformer, a self-attention architecture leveraging spatial configurations of body joints for Body Emotion Recognition. Our approach is based on the visual transformer linear projection function, allowing the conversion of 2D joint coordinates to a regular matrix representation. The matrix projection then feeds a regular transformer multi-head attention architecture. The developed method allows a more robust correlation between joint movements with time to recognize emotions using contextual information learning. We present an evaluation benchmark for acted emotional sequences extracted from movie scenes using the BoLD dataset. The proposed methodology outperforms several state-of-the-art architectures, proving the effectiveness of the method.
Download

Paper Nr: 268
Title:

Evaluation of Computer Vision-Based Person Detection on Low-Cost Embedded Systems

Authors:

Francesco Pasti and Nicola Bellotto

Abstract: Person detection applications based on computer vision techniques often rely on complex Convolutional Neural Networks that require powerful hardware in order achieve good runtime performance. The work of this paper has been developed with the aim of implementing a safety system, based on computer vision algorithms, able to detect people in working environments using an embedded device. Possible applications for such safety systems include remote site monitoring and autonomous mobile robots in warehouses and industrial premises. Similar studies already exist in the literature, but they mostly rely on systems like NVidia Jetson that, with a Cuda enabled GPU, are able to provide satisfactory results. This, however, comes with a significant downside as such devices are usually expensive and require significant power consumption. The current paper instead is going to consider various implementations of computer vision-based person detection on two power-efficient and inexpensive devices, namely Raspberry Pi 3 and 4. In order to do so, some solutions based on off-the-shelf algorithms are first explored by reporting experimental results based on relevant performance metrics. Then, the paper presents a newly-created custom architecture, called eYOLO, that tries to solve some limitations of the previous systems. The experimental evaluation demonstrates the good performance of the proposed approach and suggests ways for further improvement.
Download

Paper Nr: 270
Title:

Triple-stream Deep Metric Learning of Great Ape Behavioural Actions

Authors:

Otto Brookes, Majid Mirmehdi, Hjalmar Kühl and Tilo Burghardt

Abstract: We propose the first metric learning system for the recognition of great ape behavioural actions. Our proposed triple stream embedding architecture works on camera trap videos taken directly in the wild and demonstrates that the utilisation of an explicit DensePose-Chimp body part segmentation stream effectively complements traditional RGB appearance and optical flow streams. We evaluate system variants with different feature fusion techniques and long-tail recognition approaches. Results and ablations show performance improvements of ~12% in top-1 accuracy over previous results achieved on the PanAf-500 dataset containing 180,000 manually annotated frames across nine behavioural actions. Furthermore, we provide a qualitative analysis of our findings and augment the metric learning system with long-tail recognition techniques showing that average per class accuracy -- critical in the domain -- can be improved by ~23% compared to the literature on that dataset. Finally, since our embedding spaces are constructed as metric, we provide first data-driven visualisations of the great ape behavioural action spaces revealing emerging geometry and topology. We hope that the work sparks further interest in this vital application area of computer vision for the benefit of endangered great apes. We provide all key source code and network weights alongside this publication.
Download

Paper Nr: 275
Title:

Efficient Deep Learning Ensemble for Skin Lesion Classification

Authors:

David D. Gaviria, Md K. Saker and Petia Radeva

Abstract: Vision Transformers (ViTs) are deep learning techniques that have been gaining in popularity in recent years. In this work, we study the performance of ViTs and Convolutional Neural Networks (CNNs) on skin lesions classification tasks, specifically melanoma diagnosis. We show that regardless of the performance of both architectures, an ensemble of them can improve their generalization. We also present an adaptation to the Gram-OOD* method (detecting Out-of-distribution (OOD) using Gram matrices) for skin lesion images. Moreover, the integration of super-convergence was critical to success in building models with strict computing and training time constraints. We evaluated our ensemble of ViTs and CNNs, demonstrating that generalization is enhanced by placing first in the 2019 and third in the 2020 ISIC Challenge Live Leaderboards (available at https://challenge.isic-archive.com/leaderboards/live/).
Download

Paper Nr: 276
Title:

Linking Data Separation, Visual Separation, and Classifier Performance Using Pseudo-labeling by Contrastive Learning

Authors:

Bárbara C. Benato, Alexandre X. Falcão and Alexandru-Cristian Telea

Abstract: Lacking supervised data is an issue while training deep neural networks (DNNs), mainly when considering medical and biological data where supervision is expensive. Recently, Embedded Pseudo-Labeling (EPL) addressed this problem by using a non-linear projection (t-SNE) from a feature space of the DNN to a 2D space, followed by semi-supervised label propagation using a connectivity-based method (OPFSemi). We argue that the performance of the final classifier depends on the data separation present in the latent space and visual separation present in the projection. We address this by first proposing to use contrastive learning to produce the latent space for EPL by two methods (SimCLR and SupCon) and by their combination, and secondly by showing, via an extensive set of experiments, the aforementioned correlations between data separation, visual separation, and classifier performance. We demonstrate our results by the classification of five real-world challenging image datasets of human intestinal parasites with only 1% supervised samples.
Download

Paper Nr: 278
Title:

HaloAE: A Local Transformer Auto-Encoder for Anomaly Detection and Localization Based on HaloNet

Authors:

Emilie Mathian, Huidong Liu, Lynnette Fernandez-Cuesta, Dimitris Samaras, Matthieu Foll and Liming Chen

Abstract: Unsupervised anomaly detection and localization is a crucial task in many applications, e.g., defect detection in industry, cancer localization in medicine, and requires both local and global information as enabled by the self-attention in Transformer. However, brute force adaptation of Transformer, e.g., ViT, suffers from two issues: 1) the high computation complexity, making it hard to deal with high-resolution images; and 2) patch-based tokens, which are inappropriate for pixel-level dense prediction tasks, e.g., anomaly localization,and ignores intra-patch interactions. We present HaloAE, the first auto-encoder based on a local 2D version of Transformer with HaloNet allowing intra-patch correlation computation with a receptive field covering 25% of the input image. HaloAE combines convolution and local 2D block-wise self-attention layers and performs anomaly detection and segmentation through a single model. Moreover, because the loss function is generally a weighted sum of several losses, we also introduce a novel dynamic weighting scheme to better optimize the learning of the model. The competitive results on the MVTec dataset suggest that vision models incorporating Transformer could benefit from a local computation of the self-attention operation, and its very low computational cost and pave the way for applications on very large images a
Download

Paper Nr: 280
Title:

FInC Flow: Fast and Invertible k × k Convolutions for Normalizing Flows

Authors:

Aditya Kallappa, Sandeep Nagar and Girish Varma

Abstract: Invertible convolutions have been an essential element for building expressive normalizing flow-based generative models since their introduction in Glow. Several attempts have been made to design invertible k × k convolutions that are efficient in training and sampling passes. Though these attempts have improved the expressivity and sampling efficiency, they severely lagged behind Glow which used only 1×1 convolutions in terms of sampling time. Also, many of the approaches mask a large number of parameters of the underlying convolution, resulting in lower expressivity on a fixed run-time budget. We propose a k × k convolutional layer and Deep Normalizing Flow architecture which i.) has a fast parallel inversion algorithm with running time O(nk2) (n is height and width of the input image and k is kernel size), ii.) masks the minimal amount of learnable parameters in a layer. iii.) gives better forward pass and sampling times comparable to other k ×k convolution-based models on real-world benchmarks. We provide an implementation of the proposed parallel algorithm for sampling using our invertible convolutions on GPUs. Benchmarks on CIFAR-10, ImageNet, and CelebA datasets show comparable perf
Download

Paper Nr: 284
Title:

Learning Less Generalizable Patterns for Better Test-Time Adaptation

Authors:

Thomas Duboudin, Emmanuel Dellandréa, Corentin Abgrall, Gilles Hénaff and Liming Chen

Abstract: Deep neural networks often fail to generalize outside of their training distribution, particularly when only a single data domain is available during training. While test-time adaptation has yielded encouraging results in this setting, we argue that to reach further improvements, these approaches should be combined with training procedure modifications aiming to learn a more diverse set of patterns. Indeed, test-time adaptation methods usually have to rely on a limited representation because of the shortcut learning phenomenon: only a subset of the available predictive patterns is learned with standard training. In this paper, we first show that the combined use of existing training-time strategies and test-time batch normalization, a simple adaptation method, does not always improve upon the test-time adaptation alone on the PACS benchmark. Furthermore, experiments on Office-Home show that very few training-time methods improve upon standard training, with or without test-time batch normalization. Therefore, we propose a novel approach that mitigates the shortcut learning behavior by having an additional classification branch learn less predictive and generalizable patterns. Our experiments show that our method improves upon the state-of-the-art results on both benchmarks and benefits the most to test-time batch normalization.
Download

Short Papers
Paper Nr: 2
Title:

A Multi-Class Probabilistic Optimum-Path Forest

Authors:

Silas N. Fernandes, Leandro A. Passos, Danilo Jodas, Marco Akio, André N. Souza and João P. Papa

Abstract: The advent of machine learning provided numerous benefits to humankind, impacting fields such as medicine, military, and entertainment, to cite a few. In most cases, given some instances from a previously known domain, the intelligent algorithm is encharged of predicting a label that categorizes such samples in some learned context. Among several techniques capable of accomplishing such classification tasks, one may refer to Support Vector Machines, Neural Networks, or graph-based classifiers, such as the Optimum-Path Forest (OPF). Even though such a paradigm satisfies a wide sort of problems, others require the predicted class label and the classifier’s confidence, i.e., how sure the model is while attributing labels. Recently, an OPF-based variant was proposed to tackle this problem, i.e., the Probabilistic Optimum-Path Forest. Despite its satisfactory results over a considerable number of datasets, it was conceived to deal with binary classification only, thus lacking in the context of multi-class problems. Therefore, this paper proposes the Multi-Class Probabilistic Optimum-Path Forest, an extension designed to outdraw limitations observed in the standard Probabilistic OPF.
Download

Paper Nr: 3
Title:

Quantitative Analysis to Find the Optimum Scale Range for Object Representations in Remote Sensing Images

Authors:

Rasna A. Amit and C. K. Mohan

Abstract: Airport object surveillance using big data requires high temporal frequency remote sensing observations. However, the spatial heterogeneity and multi-scale, multi-resolution properties of images for airport surveillance tasks have led to severe data discrepancies. Consequently, artificial intelligence and deep learning algorithms suffer from accurate detections and effective scaling of remote sensing information. The quantification of intra-pixel differences may be enhanced by employing non-linear estimating algorithms to reduce its impact. An alternate strategy is to define scales that help minimize spatial and intra-pixel variability for various image processing tasks. This paper aims to demonstrate the effect of scale and resolution on object representations for airport surveillance using remote sensing images. In our method, we introduce dynamic kernel-based representations that aid in adapting the spatial variability and identify the optimum scale range for object representations for seamless airport surveillance. Airport images are captured at different spatial resolutions and feature representations are learned using large Gaussian Mixture Models (GMM). The object classification is done using a support vector machine and the optimum range is identified. Dynamic kernel GMMs can handle the disparities due to scale variations and image capturing by effectively preserving the local structure information, similarities, and changes in spatial contents globally for the same context. Our experiments indicate that the classification performance is better when both the first and second-order statistics for the Gaussian Mixture Models are used.
Download

Paper Nr: 5
Title:

Mixing Augmentation and Knowledge-Based Techniques in Unsupervised Domain Adaptation for Segmentation of Edible Insect States

Authors:

Paweł Majewski, Piotr Lampa, Robert Burduk and Jacek Reiner

Abstract: Models for detecting edible insect states (live larvae, dead larvae, pupae) are a crucial component of large-scale edible insect monitoring systems. The problem of changing the nature of the data (domain shift) that occurs when implementing the system to new conditions results in a reduction in the effectiveness of previously developed models. Proposing methods for the unsupervised adaptation of models is necessary to reduce the adaptation time of the entire system to new breeding conditions. The study acquired images from three data sources characterized by different types of cameras and illumination and checked the inference quality of the model trained in the source domain on samples from the target domain. A hybrid approach based on mixing augmentation and knowledge-based techniques was proposed to adapt the model. The first stage of the proposed method based on object augmentation and synthetic image generation enabled an increase in average AP50 from 58.4 to 62.9. The second stage of the proposed method, based on knowledge-based filtering of target domain objects and synthetic image generation, enabled a further increase in average AP50 from 62.9 to 71.8. The strategy of mixing objects from the source domain and the target domain (AP50=71.8) when generating synthetic images proved to be much better than the strategy of using only objects from the target domain (AP50=65.5). The results show the great importance of augmentation and a priori knowledge when adapting models to a new domain.
Download

Paper Nr: 10
Title:

False Negative Reduction in Semantic Segmentation Under Domain Shift Using Depth Estimation

Authors:

Kira Maag and Matthias Rottmann

Abstract: State-of-the-Art deep neural networks demonstrate outstanding performance in semantic segmentation. However, their performance is tied to the domain represented by the training data. Open world scenarios cause inaccurate predictions which is hazardous in safety relevant applications like automated driving. In this work, we enhance semantic segmentation predictions using monocular depth estimation to improve segmentation by reducing the occurrence of non-detected objects in presence of domain shift. To this end, we infer a depth heatmap via a modified segmentation network which generates foreground-background masks, operating in parallel to a given semantic segmentation network. Both segmentation masks are aggregated with a focus on foreground classes (here road users) to reduce false negatives. To also reduce the occurrence of false positives, we apply a pruning based on uncertainty estimates. Our approach is modular in a sense that it post-processes the output of any semantic segmentation network. In our experiments, we observe less non-detected objects of most important classes and an enhanced generalization to other domains compared to the basic semantic segmentation prediction.
Download

Paper Nr: 28
Title:

Fast Eye Detector Using Siamese Network for NIR Partial Face Images

Authors:

Yuka Ogino, Yuho Shoji, Takahiro Toizumi, Ryoma Oami and Masato Tsukada

Abstract: This paper proposes a fast eye detection method that is based on a Siamese network for near infrared (NIR) partial face images. NIR partial face images do not include the whole face of a subject since they are captured using iris recognition systems with the constraint of frame rate and resolution. The iris recognition systems such as the iris on the move (IOTM) system require fast and accurate eye detection as a pre-process. Our goal is to design eye detection with high speed, high discrimination performance between left and right eyes, and high positional accuracy of eye center. Our method adopts a Siamese network and coarse to fine position estimation with a fast lightweight CNN backbone. The network outputs features of images and the similarity map indicating coarse position of an eye. A regression on a portion of a feature with high similarity refines the coarse position of the eye to obtain the fine position with high accuracy. We demonstrate the effectiveness of the proposed method by comparing it with conventional methods, including SOTA, in terms of the positional accuracy, the discrimination performance, and the processing speed. Our method achieves superior performance in speed.
Download

Paper Nr: 35
Title:

Understanding of Feature Representation in Convolutional Neural Networks and Vision Transformer

Authors:

Hiroaki Minoura, Tsubasa Hirakawa, Takayoshi Yamashita and Hironobu Fujiyoshi

Abstract: Understanding a feature representation (e.g., object shape and texture) of an image is an important clue for image classification tasks using deep learning models, it is important to us humans. Transformer-based architectures such as Vision Transformer (ViT) have outperformed higher accuracy than Convolutional Neural Networks (CNNs) on such tasks. To capture a feature representation, ViT tends to focus on the object shape more than the classic CNNs as shown in prior work. Subsequently, the derivative methods based on self-attention and those not based on self-attention have also been proposed. In this paper, we investigate the feature representations captured by the derivative methods of ViT in an image classification task. Specifically, we investigate the following using a publicly available ImageNet pre-trained model, i ) a feature representation of either an object’s shape or texture using the derivative methods with the SIN dataset, ii ) a classification without relying on object texture using the edge image made by the edge detection network, and iii ) the robustness of a different feature representation with a common perturbation and corrupted image. Our results indicate that the network which focused more on shapes had an effect captured feature representations more accurately in almost all the experiments.
Download

Paper Nr: 40
Title:

1D-SalsaSAN: Semantic Segmentation of LiDAR Point Cloud with Self-Attention

Authors:

Takahiro Suzuki, Tsubasa Hirakawa, Takayoshi Yamashita and Hironobu Fujiyoshi

Abstract: Semantic segmentation on the three-dimensional (3D) point-cloud data acquired from omnidirectional light detection and ranging (LiDAR) identifies static objects, such as roads, and dynamic objects such as vehicles and pedestrians. This enables us to recognize the environment in all directions around a vehicle, which is necessary for autonomous driving. Processing such data requires a huge amount of computation. Therefore, methods have been proposed for converting 3D point-cloud data into pseudo-images and executing semantic segmentation to increase the processing speed. With these methods, a large amount of point-cloud data are lost when converting 3D point-cloud data to pseudo-images, which tends to decrease the identification accuracy of small objects such as pedestrians and traffic signs with a small number of pixels. We propose a semantic segmentation method that involves projection using Scan-Unfolding and a 1D self-attention block that is on the basis of the self-attention block. As a result of an evaluation using SemanticKITTI, we confirmed that the proposed method improves the accuracy of semantic segmentation, contributes to the improvement of small-object identification accuracy, and is sufficient regarding processing speed. We also showed that the proposed method is fast enough for real-time processing.
Download

Paper Nr: 43
Title:

Monocular Depth Estimation for Tilted Images via Gravity Rectifier

Authors:

Yuki Saito, Hideo Saito and Vincent Frémont

Abstract: Monocular depth estimation is a challenging task in computer vision. Although many approaches using Convolutional neural networks (CNNs) have been proposed, most of them are trained on large-scale datasets mainly composed of gravity-aligned images. Therefore, conventional approaches fail to predict reliable depth for tilted images containing large pitch and roll camera rotations. To tackle this problem, we propose a novel refining method based on the distribution of gravity directions in the training sets. We designed a gravity rectifier that is learned to transform the gravity direction of a tilted image into a rectified one that matches the gravity-aligned training data distribution. For the evaluation, we employed public datasets and also created our own dataset composed of large pitch and roll camera movements. Our experiments showed that our approach successfully rectified the camera rotation and outperformed our baselines, which achieved 29% im-provement in abs rel over the vanilla model. Additionally, our method had competitive accuracy comparable to state-of-the-art monocular depth prediction approaches considering camera rotation.
Download

Paper Nr: 52
Title:

How far Generated Data Can Impact Neural Networks Performance?

Authors:

Sayeh Gholipour Picha, Dawood Al Chanti and Alice Caplier

Abstract: The success of deep learning models depends on the size and quality of the dataset to solve certain tasks. Here, we explore how far generated data can aid real data in improving the performance of Neural Networks. In this work, we consider facial expression recognition since it requires challenging local data generation at the level of local regions such as mouth, eyebrows, etc, rather than simple augmentation. Generative Adversarial Networks (GANs) provide an alternative method for generating such local deformations but they need further validation. To answer our question, we consider noncomplex Convolutional Neural Networks (CNNs) based classifiers for recognizing Ekman emotions. For the data generation process, we consider generating facial expressions (FEs) by relying on two GANs. The first generates a random identity while the second imposes facial deformations on top of it. We consider training the CNN classifier using FEs from: real-faces, GANs-generated, and finally using a combination of real and GAN-generated faces. We determine an upper bound regarding the data generation quantity to be mixed with the real one which contributes the most to enhancing FER accuracy. In our experiments, we find out that 5-times more synthetic data to the real FEs dataset increases accuracy by 16%.
Download

Paper Nr: 53
Title:

Object Detection in Floor Plans for Automated VR Environment Generation

Authors:

Timothée Fréville, Charles Hamesse, Bênoit Pairet and Rob Haelterman

Abstract: The development of visually compelling Virtual Reality (VR) environments for serious games is a complex task. Most environments are designed using game engines such as Unity or Unreal Engine and require hours if not days of work. However, most important information of indoor environments can be represented by floor plans. Those have been used in architecture for centuries as a fast and reliable way of depicting building configurations. Therefore, the idea of easing the creation of VR ready environments using floor plans is of great interest. In this paper we propose an automated framework to detect and classify objects in floor plans using a neural network trained with a custom floor plan dataset generator. We evaluate our system on three floor plans datasets: ROBIN (labelled), PFG (our own Procedural Floor plan Generation method) and 100 labelled samples from the CubiCasa Dataset
Download

Paper Nr: 56
Title:

MixedTeacher: Knowledge Distillation for Fast Inference Textural Anomaly Detection

Authors:

Simon Thomine, Hichem Snoussi and Mahmoud Soua

Abstract: For a very long time, unsupervised learning for anomaly detection has been at the heart of image processing research and a stepping stone for high performance industrial automation process. With the emergence of CNN, several methods have been proposed such as Autoencoders, GAN, deep feature extraction, etc. In this paper, we propose a new method based on the promising concept of knowledge distillation which consists of training a network (the student) on normal samples while considering the output of a larger pretrained network (the teacher). The main contributions of this paper are twofold: First, a reduced student architecture with optimal layer selection is proposed, then a new Student-Teacher architecture with network bias reduction combining two teachers is proposed in order to jointly enhance the performance of anomaly detection and its localization accuracy. The proposed texture anomaly detector has an outstanding capability to detect defects in any texture and a fast inference time compared to the SOTA methods.
Download

Paper Nr: 63
Title:

Impact of Vehicle Speed on Traffic Signs Missed by Drivers

Authors:

Farzan Heidari and Michael A. Bauer

Abstract: A driver’s recognition of traffic signs while driving is a pivotal indicator of a driver’s attention to critical environmental information and can be a key element in Advanced Driver Assistance Systems (ADAS). In this study, we look at the impact of driving speed on a driver’s attention to traffic signs by considering signs missed. We adopt a very strict definition of "missing" in this work where a sign is considered "missed" if it does not fall under the gaze of a driver. We employ an accurate algorithm to detect traffic sign objects and then estimate the driver’s visual attention area. By intersecting this area with objects identified as traffic signs, we can estimate the number of missed traffic sign objects while driving at different ranges of speeds. The experimental results show that the vehicle speed has a negative impact on drivers missing or seeing traffic sign objects.
Download

Paper Nr: 64
Title:

Transfer Learning for Word Spotting in Historical Arabic Documents Based Triplet-CNN

Authors:

Abir Fathallah, Mounim A. El-Yacoubi and Najoua E. Ben Amara

Abstract: With the increasing number of digitized historical documents, information processing has become a fundamental task to exploit the information contained in these documents. Thus, it is very significant to develop efficient tools in order to analyze and recognize them. One of these means is word spotting which has lately emerged as an active research area of historical document analysis. Various techniques have been suggested successfully to enhance the performance of word spotting systems. In this paper, an enhanced word spotting approach for historical Arabic documents is proposed. It involves improving learning feature representations that characterize word images. The proposed approach is mainly based on transfer learning. More precisely, it consists in building an embedding space for word image representations from an online training triplet-CNN, while performing transfer learning by leveraging the varied knowledge acquired from two different domains. The first domain is Hebrew handwritten documents, the second is English historical documents. We will investigate the impact of each domain in improving the representation of Arabic word images. As a final step, in order to evolve the word spotting system, the query word image along with all the reference word images will be projected into the embedding space where they will be matched according to their embedding vectors. We evaluate our method on the historical Arabic VML-HD dataset and show that our method outperforms significantly the state-of-the-art methods.
Download

Paper Nr: 75
Title:

Estimating Distances Between People Using a Single Overhead Fisheye Camera with Application to Social-Distancing Oversight

Authors:

Zhangchi Lu, Mertcan Cokbas, Prakash Ishwar and Janusz Konrad

Abstract: Unobtrusive monitoring of distances between people indoors is a useful tool in the fight against pandemics. A natural resource to accomplish this are surveillance cameras. Unlike previous distance estimation methods, we use a single, overhead, fisheye camera with wide area coverage and propose two approaches. One method leverages a geometric model of the fisheye lens, whereas the other method uses a neural network to predict the 3D-world distance from people-locations in a fisheye image. For evaluation, we collected a first-of-its-kind dataset, Distance Estimation between People from Overhead Fisheye cameras (DEPOF), using a single fisheye camera, that comprises a wide range of distances between people (1–58ft) and is publicly available. The algorithms achieve 20-inch average distance error and 95% accuracy in detecting social-distance violations.
Download

Paper Nr: 77
Title:

Banana Ripeness Level Classification Using a Simple CNN Model Trained with Real and Synthetic Datasets

Authors:

Luis E. Chuquimarca, Boris X. Vintimilla and Sergio A. Velastin

Abstract: The level of ripeness is essential in determining the quality of bananas. To correctly estimate banana maturity, the metrics of international marketing standards need to be considered. However, the process of assessing the maturity of bananas at an industrial level is still carried out using manual methods. The use of CNN models is an attractive tool to solve the problem, but there is a limitation regarding the availability of sufficient data to train these models reliably. On the other hand, in the state-of-the-art, existing CNN models and the available data have reported that the accuracy results are acceptable in identifying banana maturity. For this reason, this work presents the generation of a robust dataset that combines real and synthetic data for different levels of banana ripeness. In addition, it proposes a simple CNN architecture, which is trained with synthetic data and using the transfer learning technique, the model is improved to classify real data, managing to determine the level of maturity of the banana. The proposed CNN model is evaluated with several architectures, then hyper-parameter configurations are varied, and optimizers are used. The results show that the proposed CNN model reaches a high accuracy of 0.917 and a fast execution time.
Download

Paper Nr: 85
Title:

FedBID and FedDocs: A Dataset and System for Federated Document Analysis

Authors:

Daniel Perazzo, Thiago de Souza, Pietro Masur, Eduardo de Amorim, Pedro de Oliveira, Kelvin Cunha, Lucas Maggi, Francisco Simões, Veronica Teichrieb and Lucas Kirsten

Abstract: Data privacy has recently become one of the main concerns for society and machine learning researchers. The question of privacy led to research in privacy-aware machine learning and, amongst many other techniques, one solution gaining ground is federated learning. In this machine learning paradigm, data does not leave the user’s device, with training happening on it and aggregated in a remote server. In this work, we present, to our knowledge, the first federated dataset for document classification: FedBID. To demonstrate how this dataset can be used for evaluating different techniques, we also developed a system, FedDocs, for federated learning for document classification. We demonstrate the characteristics of our federated dataset, along with different types of distributions possible to be created with our dataset. Finally, we analyze our system, FedDocs, in our dataset, FedBID, in multiple different scenarios. We analyze a federated setting with balanced categories, a federated setting with unbalanced classes, and, finally, simulating a siloed federated training. We demonstrate that FedBID can be used to analyze a federated learning algorithm. Finally, we hope the FedBID dataset allows more research in federated document classification. The dataset is available in https://github.com/voxarlabs/FedBID.
Download

Paper Nr: 113
Title:

Real-World Case Study of a Deep Learning Enhanced Elderly Person Fall Video-Detection System

Authors:

Amal El Kaid, Karim Baïna, Jamal Baïna and Vincent Barra

Abstract: Recent large and rapid growth in the healthcare sector has contributed to an increase in the elderly population and an increase in life expectancy. One of the important study topics in this field is the automatic fall detection system. Camera-video has been extensively employed recently for applications in surveillance, the home, and healthcare. Therefore a smart fall detection system is focusing on image and video analysis techniques. For that, our scientific work studied an actual vision-based fall detection system. It produces satisfactory outcomes, but there is still room for improvement. The system has a very high recall rate and can detect all falls, but it lacks precision and frequently reports false positives (more than 99 per-cent). In fact, due to the optimum camera quality, several ordinary activities with specific movements, such as wheelchair mobility, or the light changing in an empty room, can be mistaken for falls. To address this problem and increase precision, we propose a post-process approach, hybridizing a CNN model and a Haar Cascade Classifier to determine whether to confirm or reject an alert that has been identified as a fall. The system’s effectiveness will increase while the false positives are decreased.
Download

Paper Nr: 122
Title:

Human Fall Detection from Sequences of Skeleton Features using Vision Transformer

Authors:

Ali Raza, Muhammad H. Yousaf, Sergio A. Velastin and Serestina Viriri

Abstract: Detecting human falls is an exciting topic that can be approached in a number of ways. In recent years, several approaches have been suggested. These methods aim at determining whether a person is walking normally, standing, or falling, among other activities. The detection of falls in the elderly population is essential for preventing major medical consequences and early intervention mitigates the effects of such accidents. However, the medical team must be very vigilant, monitoring people constantly, something that is time consuming, expensive, intrusive and not always accurate. In this paper, we propose an approach to automatically identify human fall activity using visual data to timely warn the appropriate caregivers and authorities. The proposed approach detects human falls using a vision transformer. A Multi-headed transformer encoder model learns typical human behaviour based on skeletonized human data. The proposed method has been evaluated on the UR-Fall and UP-Fall datasets, with an accuracy of 96.12%, 97.36% respectively using RP normalization and linear interpolation comparable to state-of-the-art methods.
Download

Paper Nr: 134
Title:

Robust Semi-Supervised Anomaly Detection via Adversarially Learned Continuous Noise Corruption

Authors:

Jack W. Barker, Neelanjan Bhowmik, Yona Falinie A. Gaus and Toby P. Breckon

Abstract: Anomaly detection is the task of recognising novel samples which deviate significantly from pre-established normality. Abnormal classes are not present during training meaning that models must learn effective representations solely across normal class data samples. Deep Autoencoders (AE) have been widely used for anomaly detection tasks, but suffer from overfitting to a null identity function. To address this problem, we implement a training scheme applied to a Denoising Autoencoder (DAE) which introduces an efficient method of producing Adversarially Learned Continuous Noise (ALCN) to maximally globally corrupt the input prior to denoising. Prior methods have applied similar approaches of adversarial training to increase the robustness of DAE, however they exhibit limitations such as slow inference speed reducing their real-world applicability or producing generalised obfuscation which is more trivial to denoise. We show through rigorous evaluation that our ALCN method of regularisation during training improves AUC performance during inference while remaining efficient over both classical, leave-one-out novelty detection tasks with the variations-: 9 (normal) vs. 1 (abnormal) & 1 (normal) vs. 9 (abnormal); MNIST - AUCavg: 0.890 & 0.989, CIFAR-10 - AUCavg: 0.670 & 0.742, in addition to challenging real-world anomaly detection tasks: industrial inspection (MVTEC-AD AUCavg: 0.780) and plant disease detection (Plant Village - AUC: 0.770) when compared to prior approaches.
Download

Paper Nr: 135
Title:

An End-to-End Multi-Task Learning Model for Image-based Table Recognition

Authors:

Nam T. Ly and Atsuhiro Takasu

Abstract: Image-based table recognition is a challenging task due to the diversity of table styles and the complexity of table structures. Most of the previous methods focus on a non-end-to-end approach which divides the problem into two separate sub-problems: table structure recognition; and cell-content recognition and then attempts to solve each sub-problem independently using two separate systems. In this paper, we propose an end-to-end multi-task learning model for image-based table recognition. The proposed model consists of one shared encoder, one shared decoder, and three separate decoders which are used for learning three sub-tasks of table recognition: table structure recognition, cell detection, and cell-content recognition. The whole system can be easily trained and inferred in an end-to-end approach. In the experiments, we evaluate the performance of the proposed model on two large-scale datasets: FinTabNet and PubTabNet. The experiment results show that the proposed model outperforms the state-of-the-art methods in all benchmark datasets.
Download

Paper Nr: 138
Title:

Multi-Scale Feature Based Fashion Attribute Extraction Using Multi-Task Learning for e-Commerce Applications

Authors:

Viral Parekh and Karimulla Shaik

Abstract: Visual attribute extraction of products from their images is an essential component for E-commerce applications like easy cataloging, catalog enrichment, visual search, etc. In general, the product attributes are the mixture of coarse-grained and fine-grained classes, also a mixture of small (for example neck type, sleeve length of top-wear), or large (for example pattern of print on apparel) regions of coverage on products which makes attribute extraction even more challenging. In spite of the challenges, it is important to extract the attributes with high accuracy and low latency. So we have modeled attribute extraction as a classification problem with multi-task learning where each attribute is a task. This paper proposes solutions to address above mentioned challenges through multi-scale feature extraction using Feature Pyramid Network (FPN) along with attention and feature fusion for multi-task setup. We have experimented incrementally with various ways of extracting multi-scale features. We use our in-house fashion category dataset and iMaterialist 2021 for visual attribute extraction to show the efficacy of our approaches. We observed, on average, ∼ 4% improvement in F1 scores of different product attributes in both datasets compared to the baseline.
Download

Paper Nr: 145
Title:

Domain Adaptive Pedestrian Detection Based on Semantic Concepts

Authors:

Patrick Feifel, Frank Bonarens and Frank Köster

Abstract: Pedestrian detection is subject to high complexity with a wide variety of pedestrian appearances and postures as well as environmental conditions. Building a sufficient real-world dataset is labor-intensive and costly. Thus, the application of synthetic data is promising, but deep neural networks show a lack of generalization when trained solely on synthetic data. In our work, we propose a novel method for concept-based domain adaptation for pedestrian detection (ConDA). In addition to the 2D bounding box prediction, an auxiliary body part segmentation exploits discriminative features of semantic concepts of pedestrians. Inspired by approaches to the inherent interpretability of DNNs, ConDA has been shown to strengthen generalization. This is done by enforcing a high intra-class concentration and inter-class separation of extracted body part features in the latent space. We report performance results regarding various training strategies, feature extractions and backbones for ConDA on the real-world CityPersons dataset.
Download

Paper Nr: 147
Title:

A Robust Deep Learning-Based Video Watermarking Using Mosaic Generation

Authors:

Souha Mansour, Saoussen Ben Jabra and Ezzedine Zagrouba

Abstract: Recently, digital watermarking has benefited from the rise of deep learning and machine learning approaches. Even while effective deep learning-based watermarking techniques have been proposed for images, video still introduces extra difficulties, such as motion, temporal consistency, and spatial location. In this paper, a robust and imperceptible deep-learning-based video watermarking method based on CNN architecture and mosaic generation is suggested. The proposed approach is decomposed into two main steps: mosaic generation and signature embedding. This last one includes four stages: pre-processing networks for both the obtained mosaic and the watermark, embedding network, attack simulation, and extraction network. In fact, the main purpose of mosaic generation is to create an image from the original video and to provide robustness against malicious attacks, particularly against collusion attacks. CNN architecture is used to embed signature to maximize invisibility and robustness compromise. The proposed solution outperforms both traditional video watermarking and deep learning video watermarking, according to experimental evaluations on a variety of distortions.
Download

Paper Nr: 152
Title:

CrowdSim2: An Open Synthetic Benchmark for Object Detectors

Authors:

Paweł Foszner, Agnieszka Szczęsna, Luca Ciampi, Nicola Messina, Adam Cygan, Bartosz Bizoń, Michał Cogiel, Dominik Golba, Elżbieta Macioszek and Michał Staniszewski

Abstract: Data scarcity has become one of the main obstacles to developing supervised models based on Artificial Intelligence in Computer Vision. Indeed, Deep Learning-based models systematically struggle when applied in new scenarios never seen during training and may not be adequately tested in non-ordinary yet crucial real-world situations. This paper presents and publicly releases CrowdSim2, a new synthetic collection of images suitable for people and vehicle detection gathered from a simulator based on the Unity graphical engine. It consists of thousands of images gathered from various synthetic scenarios resembling the real world, where we varied some factors of interest, such as the weather conditions and the number of objects in the scenes. The labels are automatically collected and consist of bounding boxes that precisely localize objects belonging to the two object classes, leaving out humans from the annotation pipeline. We exploited this new benchmark as a testing ground for some state-of-the-art detectors, showing that our simulated scenarios can be a valuable tool for measuring their performances in a controlled environment.
Download

Paper Nr: 183
Title:

A Data Augmentation Strategy for Improving Age Estimation to Support CSEM Detection

Authors:

Deisy Chaves, Nancy Agarwal, Eduardo Fidalgo and Enrique Alegre

Abstract: Leveraging image-based age estimation in preventing Child Sexual Exploitation Material (CSEM) content over the internet is not investigated thoroughly in the research community. While deep learning methods are considered state-of-the-art for general age estimation, they perform poorly in predicting the age group of minors and older adults due to the few examples of these age groups in the existing datasets. In this work, we present a data augmentation strategy to improve the performance of age estimators trained on imbalanced data based on synthetic image generation and artificial facial occlusion. Facial occlusion is focused on modelling as CSEM criminals tend to cover certain parts of the victim, such as the eyes, to hide their identity. The proposed strategy is evaluated using the Soft Stagewise Regression Network (SSR-Net), a compact size age estimator and three publicly available datasets composed mainly of non-occluded images. Therefore, we create the Synthetic Augmented with Occluded Faces (SAOF-15K) dataset to assess the performance of eye and mouth-occluded images. Results show that our strategy improves the performance of the evaluated age estimator.
Download

Paper Nr: 185
Title:

Shuffle Mixing: An Efficient Alternative to Self Attention

Authors:

Ryouichi Furukawa and Kazuhiro Hotta

Abstract: In this paper, we propose ShuffleFormer, which replaces Transformer’s Self Attention with the proposed shuffle mixing. ShuffleFormer can be flexibly incorporated as the backbone of conventional visual recognition, precise prediction, etc. Self Attention can learn globally and dynamically, while shuffle mixing employs Depth Wise Convolution to learn locally and statically. Depth Wise Convolution does not consider the relationship between channels because convolution is applied to each channel individually. Therefore, shuffle mixing can obtain the information on different channels without changing the computational cost by inserting a shift operation in the spatial direction of the channel direction components. However, by using the shift operation, the amount of spatial components obtained is less than that of Depth Wise Convolution. ShuffleFormer uses overlapped patch embedding with a kernel larger than the stride width to reduce the resolution, thereby eliminating the disadvantages of using the shift operation by extracting more features in the spatial direction. We evaluated ShuffleFormer on ImageNet-1K image classification and ADE20K semantic segmentation. ShuffleFormer has superior results compared to Swin Transformer. In particular, ShuffleFormer-Base/Light outperforms Swin-Base in accuracy at about two-thirds of the computational cost.
Download

Paper Nr: 186
Title:

Semantic Segmentation by Semi-Supervised Learning Using Time Series Constraint

Authors:

Takahiro Mano, Sota Kato and Kazuhiro Hotta

Abstract: In this paper, we propose a method to improve the accuracy of semantic segmentation when the number of training data is limited. When time-series information such as video is available, it is expected that images that are close in time-series are similar to each other, and pseudo-labels can be easily assigned to those images with high accuracy. In other words, if the pseudo-labels are assigned to the images in the order of time-series, it is possible to efficiently collect pseudo-labels with high accuracy. As a result, the segmentation accuracy can be improved even when the number of training images is limited. In this paper, we evaluated our method on the CamVid dataset to confirm the effectiveness of the proposed method. We confirmed that the segmentation accuracy of the proposed method is much improved in comparison with the baseline without pseudo-labels.
Download

Paper Nr: 189
Title:

Joint Training of Product Detection and Recognition Using Task-Specific Datasets

Authors:

Floris De Feyter and Toon Goedemé

Abstract: Training a single model jointly for detection and recognition is typically done with a dataset that is fully annotated, i.e., the annotations consist of boxes with class labels. In the case of retail product detection and recognition, however, developing such a dataset is very expensive due to the large variety of products. It would be much more cost-efficient and scalable if we could employ two task-specific datasets: one detection-only and one recognition-only dataset. Unfortunately, experiments indicate a significant drop in performance when trained on task-specific data. Due to the potential cost savings, we are convinced that more research should be done on this matter and, therefore, we propose a set of training procedures that allows us to carefully investigate the differences between training with fully-annotated vs. task-specific data. We demonstrate this on a product detection and recognition dataset and as such reveal one of the core issues that is inherent to task-specific training. We hope that our results will motivate and inspire researchers to further look into the problem of employing task-specific datasets to train joint detection and recognition models.
Download

Paper Nr: 191
Title:

The Effect of Covariate Shift and Network Training on Out-of-Distribution Detection

Authors:

Simon Mariani, Sander R. Klomp, Rob Romijnders and Peter H. N. de With

Abstract: The field of Out-of-Distribution (OOD) detection aims to separate OOD data from in-distribution (ID) data in order to make safe predictions. With the increasing application of Convolutional Neural Networks (CNNs) in sensitive environments such as autonomous driving and security, this field is bound to become indispensable in the future. Although the OOD detection field has made some progress in recent years, a fundamental understanding of the underlying phenomena enabling the separation of datasets remains lacking. We find that the OOD detection relies heavily on the covariate shift of the data and not so much on the semantic shift, i.e. a CNN does not carry explicit semantic information and relies solely on differences in features. Although these features can be affected by the underlying semantics, this relation does not seem strong enough to rely on. Conversely, we found that since the CNN training setup determines what features are learned, that it is an important factor for the OOD performance. We found that variations in the model training can lead to an increase or decrease in the OOD detection performance. Through this insight, we obtain an increase in OOD detection performance on the common OOD detection benchmarks by changing the training procedure and using the simple Maximum Softmax Probability (MSP) model introduced by (Hendrycks and Gimpel, 2016). We hope to inspire others to look more closely into the fundamental principles underlying the separation of two datasets. The code for reproducing our results can be found at https://github.com/SimonMariani/OOD- detection.
Download

Paper Nr: 194
Title:

Improvement of Vision Transformer Using Word Patches

Authors:

Ayato Takama, Sota Kato, Satoshi Kamiya and Kazuhiro Hotta

Abstract: Vision Transformer achieves higher accuracy on image classification than conventional convolutional neural networks. However, Vision Transformer requires more training images than conventional neural networks. Since there is no clear concept of words in images, we created Visual Words by cropping training images and clustering them using K-means like bag-of-visual words, and incorporated them into Vision Transformer as ”Word Patches” to improve the accuracy. We also try trainable words instead of visual words by clustering. Experiments were conducted to confirm the effectiveness of the proposed method. When Word Patches are trainable parameters, the accuracy was much improved from 84.16% to 87.35% on the Food101 dataset.
Download

Paper Nr: 202
Title:

Algorithmic Fairness Applied to the Multi-Label Classification Problem

Authors:

Ana Paula S. Dantas, Gabriel Bianchin de Oliveira, Daiane Mendes de Oliveira, Helio Pedrini, Cid C. de Souza and Zanoni Dias

Abstract: In recent years, a concern for algorithmic fairness has been increasing. Given that decision making algorithms are intrinsically embedded in our lives, their biases become more harmful. To prevent a model from displaying bias, we consider the coverage of the training to be an important factor. We define a problem called Fairer Coverage (FC) that aims to select the fairest training subset. We present a mathematical formulation for this problem and a protocol to translate a dataset into an instance of FC. We also present a case study by applying our method to the Single Cell Classification Problem. Experiments showed that our method improves the overall quality of the qualification while also increasing the quality of the classification for smaller individual underrepresented classes in the dataset.
Download

Paper Nr: 220
Title:

VK-SITS: Variable Kernel Speed Invariant Time Surface for Event-Based Recognition

Authors:

Laure Acin, Pierre Jacob, Camille Simon-Chane and Aymeric Histace

Abstract: Event-based cameras are a recent non-conventional sensor which offer a new movement perception with low latency, high power efficiency, high dynamic range and high-temporal resolution. However, event data is asynchronous and sparse thus standard machine learning and deep learning tools are not optimal for this data format. A first step of event-based processing often consists in generating image-like representations from events, such as time-surfaces. Such event representations are proposed with specific applications. These event representations and learning algorithms are most often evaluated together. Furthermore, these methods are often evaluated in a non-rigorous way (i.e. by performing the validation on the testing set). We propose a generic event representation for multiple applications: a trainable extension of Speed Invariant Time Surface, coined VK-SITS. This speed and spatial-invariant framework is computationally fast and GPU-friendly. A second contribution is a new benchmark based on 10-Fold cross-validation to better evaluate event-based representation of DVS128 Gesture and N-Caltech101 recognition datasets. Our VK-SITS event-based representation improves recognition performance of state-of-art methods.
Download

Paper Nr: 222
Title:

Synthetic Driver Image Generation for Human Pose-Related Tasks

Authors:

Romain Guesdon, Carlos Crispim-Junior and Laure T. Rodet

Abstract: The interest in driver monitoring has grown recently, especially in the context of autonomous vehicles. However, the training of deep neural networks for computer vision requires more and more images with significant diversity, which does not match the reality of the field. This lack of data prevents networks to be properly trained for certain complex tasks such as human pose transfer which aims to produce an image of a person in a target pose from another image of the same person. To tackle this problem, we propose a new synthetic dataset for pose-related tasks. By using a straightforward pipeline to increase the variety between the images, we generate 200k images with a hundred human models in different cars, environments, lighting conditions, etc. We measure the quality of the images of our dataset and compare it with other datasets from the literature. We also train a network for human pose transfer in the synthetic domain using our dataset. Results show that our dataset matches the quality of existing datasets and that it can be used to properly train a network on a complex task. We make both the images with the pose annotations and the generation scripts publicly available.
Download

Paper Nr: 241
Title:

Crane Spreader Pose Estimation from a Single View

Authors:

Maria Pateraki, Panagiotis Sapoutzoglou and Manolis Lourakis

Abstract: This paper presents a methodology for inferring the full 6D pose of a container crane spreader from a single image and reports on its application to real-world imagery. A learning-based approach is adopted that starts by constructing a photorealistically textured 3D model of the spreader. This model is then employed to generate a set of synthetic images that are used to train a state-of-the-art object detection method. Online operation establishes image-model correspondences, which are used to infer the spreader’s 6D pose. The performance of the approach is quantitatively evaluated through extensive experiments conducted with real images.
Download

Paper Nr: 263
Title:

WSAM: Visual Explanations from Style Augmentation as Adversarial Attacker and Their Influence in Image Classification

Authors:

Felipe Moreno-Vera, Edgar Medina and Jorge Poco

Abstract: Currently, style augmentation is capturing attention due to convolutional neural networks (CNN) being strongly biased toward recognizing textures rather than shapes. Most existing styling methods either perform a low-fidelity style transfer or a weak style representation in the embedding vector. This paper outlines a style augmentation algorithm using stochastic-based sampling with noise addition for randomization improvement on a general linear transformation for style transfer. With our augmentation strategy, all models not only present incredible robustness against image stylizing but also outperform all previous methods and surpass the state-of-the-art performance for the STL-10 dataset. In addition, we present an analysis of the model interpretations under different style variations. At the same time, we compare comprehensive experiments demonstrating the performance when applied to deep neural architectures in training settings.
Download

Paper Nr: 267
Title:

Towards an Automatic System for Generating Synthetic and Representative Facial Data for Anonymization

Authors:

Natália C. Meira, Ricardo M. Santos, Mateus C. Silva, Eduardo S. Luz and Ricardo R. Oliveira

Abstract: Deep learning models based on autoencoders and generative adversarial networks (GANs) have enabled increasingly realistic face-swapping tasks. Surveillance cameras for detecting people and faces to monitor human behavior are becoming more common. Training AI models for these detection and monitoring tasks require large sets of facial data that represent ethnic, gender, and age diversity. In this work, we propose the use of generative facial manipulation techniques to build a new representative data augmentation set to be used in deep learning training for tasks involving the face. In the presented step, we implemented one of the most famous facial switching architectures to demonstrate an application for anonymizing personal data and generating synthetic data with images of drivers’ faces during their work activity. Our case study generated synthetic facial data from a driver at work. The results were convincing in facial replacement and preservation of the driver’s expression.
Download

Paper Nr: 269
Title:

FPCD: An Open Aerial VHR Dataset for Farm Pond Change Detection

Authors:

Chintan Tundia, Rajiv Kumar, Om Damani and G. Sivakumar

Abstract: Change detection for aerial imagery involves locating and identifying changes associated with the areas of interest between co-registered bi-temporal or multi-temporal images of a geographical location. Farm ponds are man-made structures belonging to the category of minor irrigation structures used to collect surface run-off water for future irrigation purposes. Detection of farm ponds from aerial imagery and their evolution over time helps in land surveying to analyze the agricultural shifts, policy implementation, seasonal effects and climate changes. In this paper, we introduce a publicly available object detection and instance segmentation (OD/IS) dataset for localizing farm ponds from aerial imagery. We also collected and annotated the bi-temporal data over a time-span of 14 years across 17 villages, resulting in a binary change detection dataset called Farm Pond Change Detection Dataset (FPCD). We have benchmarked and analyzed the performance of various object detection and instance segmentation methods on our OD/IS dataset and the change detection methods over the FPCD dataset. The datasets are publicly accessible at this page: https://huggingface.co/datasets/ ctundia/FPCD.
Download

Paper Nr: 271
Title:

DEff-GAN: Diverse Attribute Transfer for Few-Shot Image Synthesis

Authors:

Rajiv Kumar and G. Sivakumar

Abstract: Requirements of large amounts of data is a difficulty in training many GANs. Data efficient GANs involve fitting a generator’s continuous target distribution with a limited discrete set of data samples, which is a difficult task. Single image methods have focused on modelling the internal distribution of a single image and generating its samples. While single image methods can synthesize image samples with diversity, they do not model multiple images or capture the inherent relationship possible between two images. Given only a handful number of images, we are interested in generating samples and exploiting the commonalities in the input images. In this work, we extend the single-image GAN method to model multiple images for sample synthesis. We modify the discriminator with an auxiliary classifier branch, which helps to generate wide variety of samples and to classify the input labels. Our Data-Efficient GAN (DEff-GAN) generates excellent results when similarities and correspondences can be drawn between the input images/classes.
Download

Paper Nr: 286
Title:

Towards Human-Interpretable Prototypes for Visual Assessment of Image Classification Models

Authors:

Poulami Sinhamahapatra, Lena Heidemann, Maureen Monnet and Karsten Roscher

Abstract: Explaining black-box Artificial Intelligence (AI) models is a cornerstone for trustworthy AI and a prerequisite for its use in safety critical applications such that AI models can reliably assist humans in critical decisions. However, instead of trying to explain our models post-hoc, we need models which are interpretable-by-design built on a reasoning process similar to humans that exploits meaningful high-level concepts such as shapes, texture or object parts. Learning such concepts is often hindered by its need for explicit specification and annotation up front. Instead, prototype-based learning approaches such as ProtoPNet claim to discover visually meaningful prototypes in an unsupervised way. In this work, we propose a set of properties that those prototypes have to fulfill to enable human analysis, e.g. as part of a reliable model assessment case, and analyse such existing methods in the light of these properties. Given a ‘Guess who?’ game, we find that these prototypes still have a long way ahead towards definite explanations. We quantitatively validate our findings by conducting a user study indicating that many of the learnt prototypes are not considered useful towards human understanding. We discuss about the missing links in the existing methods and present a potential real-world application motivating the need to progress towards truly human-interpretable prototypes.
Download

Paper Nr: 287
Title:

Curriculum Learning for Compositional Visual Reasoning

Authors:

Wafa Aissa, Marin Ferecatu and Michel Crucianu

Abstract: Visual Question Answering (VQA) is a complex task requiring large datasets and expensive training. Neural Module Networks (NMN) first translate the question to a reasoning path, then follow that path to analyze the image and provide an answer. We propose an NMN method that relies on predefined cross-modal embeddings to “warm start” learning on the GQA dataset, then focus on Curriculum Learning (CL) as a way to improve training and make a better use of the data. Several difficulty criteria are employed for defining CL methods. We show that by an appropriate selection of the CL method the cost of training and the amount of training data can be greatly reduced, with a limited impact on the final VQA accuracy. Furthermore, we introduce intermediate losses during training and find that this allows to simplify the CL strategy.
Download

Paper Nr: 289
Title:

End-to-End Gaze Grounding of a Person Pictured from Behind

Authors:

Hayato Yumiya, Daisuke Deguchi, Yasutomo Kawanishi and Hiroshi Murase

Abstract: In this study, we address a novel problem with end-to-end gaze grounding, which estimates the area of an object at which a person in an image is gazing, especially focusing on images of people seen from behind. Existing methods usually estimate facial information such as eye gaze and face orientation first, and then estimate the area at which the target person is gazing; they do not work when a person is pictured from behind. In this study, we focus on individual’s posture, which is a feature that can be obtained even from behind. Posture changes depending on where a person is looking, although this varies from person to person. In this study, we proposes an end-to-end model designed to estimate the area at which a person is gazing from their 3D posture. To minimize differences between individuals, we also introduce the Posture Embedding Encoder Module as a metric learning module. To evaluate the proposed method, we constructed an experimental environment in which a person gazed at a certain object on a shelf. We constructed a dataset consisting of pairs of 3D skeletons and gazes. In an evaluation on this dataset,HEREHEREHEREwe confirmed that the proposed method can estimate the area at which a person is gazing from behind.
Download

Paper Nr: 291
Title:

Human Motion Prediction on the IKEA-ASM Dataset

Authors:

Mattias Billast, Kevin Mets, Tom De Schepper, José Oramas and Steven Latré

Abstract: Motion prediction of the human pose estimates future poses based on the preceding poses. It is a stepping stone toward industrial applications, like human-robot interactions and ergonomy indicators. The goal is to minimize the error in predicted joint positions on the IKEA-ASM dataset which resembles assembly use cases with a high diversity of execution and background of the same action class. In this paper, we use the STS-GCN model to tackle 2D motion prediction and make various alterations to improve the performance of the model. First, we pre-processed the training dataset through filtering to remove outliers and inconsistencies to boost performance by 31%. Secondly, we added object gaze information to give more context to the body motion of the subject, which lowers the error (MPJPE) to 10.1618 compared to 18.3462 without object gaze information. The increased performance indicates that there is a correlation between the object gaze and body motion. Lastly, the over-smoothing of the Graph Convolutional Network embeddings is decreased by limiting the number of layers, providing richer joint embeddings.
Download

Paper Nr: 6
Title:

Combining Metric Learning and Attention Heads for Accurate and Efficient Multilabel Image Classification

Authors:

Kirill Prokofiev and Vladislav Sovrasov

Abstract: Multi-label image classification allows predicting a set of labels from a given image. Unlike multiclass classification, where only one label per image is assigned, such a setup is applicable for a broader range of applications. In this work we revisit two popular approaches to multilabel classification: transformer-based heads and labels relations information graph processing branches. Although transformer-based heads are considered to achieve better results than graph-based branches, we argue that with the proper training strategy, graph-based methods can demonstrate just a small accuracy drop, while spending less computational resources on inference. In our training strategy, instead of Asymmetric Loss (ASL), which is the de-facto standard for multilabel classification, we introduce its metric learning modification. In each binary classification sub-problem it operates with L2 normalized feature vectors coming from a backbone and enforces angles between the normalized representations of positive and negative samples to be as large as possible. This results in providing a better discrimination ability, than binary cross entropy loss does on unnormalized features. With the proposed loss and training strategy, we obtain SOTA results among single modality methods on widespread multilabel classification benchmarks such as MS-COCO, PASCAL-VOC, NUS-Wide and Visual Genome 500. Source code of our method is available as a part of the OpenVINO™ Training Extensions∗ .
Download

Paper Nr: 23
Title:

Exploring Deep Learning Capabilities for Coastal Image Segmentation on Edge Devices

Authors:

Jonay Suárez-Ramírez, Alejandro Betancor-Del-Rosario, Daniel Santana-Cedrés and Nelson Monzón

Abstract: Artificial Intelligence (AI) has become a revolutionary tool in multiple fields in the last decade. The appearance of hardware with improved capabilities has paved the way to apply image processing based on Deep Neural Networks to more complex tasks with lower costs. Nevertheless, some environments, such as remote areas, require the use of edge devices. Consequently, the algorithms must be suited to platforms with more constrained resources. This is crucial in the development of AI systems in seaside zones. In our work, we compare a wide range of recent state-of-the-art Deep Learning models for Semantic Segmentation over edge devices. Such segmentation techniques provide a better scene understanding, in particular in complex areas, providing pixel-level detection and classification. In this regard, coastal environments represent a clear example, where more specific tasks can be performed from these approaches, such as littering detection, surveillance, and shoreline changes, among many others.
Download

Paper Nr: 39
Title:

Combining Two Adversarial Attacks Against Person Re-Identification Systems

Authors:

Eduardo O. Andrade, Igor B. Sampaio, Joris Guérin and José Viterbo

Abstract: The field of Person Re-Identification (Re-ID) has received much attention recently, driven by the progress of deep neural networks, especially for image classification. The problem of Re-ID consists in identifying individuals through images captured by surveillance cameras in different scenarios. Governments and companies are investing a lot of time and money in Re-ID systems for use in public safety and identifying missing persons. However, several challenges remain for successfully implementing Re-ID, such as occlusions and light reflections in people’s images. In this work, we focus on adversarial attacks on Re-ID systems, which can be a critical threat to the performance of these systems. In particular, we explore the combination of adversarial attacks against Re-ID models, trying to strengthen the decrease in the classification results. We conduct our experiments on three datasets: DukeMTMC-ReID, Market-1501, and CUHK03. We combine the use of two types of adversarial attacks, P-FGSM and Deep Mis-Ranking, applied to two popular Re-ID models: IDE (ResNet-50) and AlignedReID. The best result demonstrates a decrease of 3.36% in the Rank-10 metric for AlignedReID applied to CUHK03. We also try to use Dropout during the inference as a defense method.
Download

Paper Nr: 45
Title:

Overcome Ethnic Discrimination with Unbiased Machine Learning for Facial Data Sets

Authors:

Michael Danner, Bakir Hadžić, Robert Radloff, Xueping Su, Leping Peng, Thomas Weber and Matthias Rätsch

Abstract: AI-based prediction and recommender systems are widely used in various industry sectors. However, general acceptance of AI-enabled systems is still widely uninvestigated. Therefore, firstly we conducted a survey with 559 respondents. Findings suggested that AI-enabled systems should be fair, transparent, consider personality traits and perform tasks efficiently. Secondly, we developed a system for the Facial Beauty Prediction (FBP) benchmark that automatically evaluates facial attractiveness. As our previous experiments have proven, these results are usually highly correlated with human ratings. Consequently they also reflect human bias in annotations. An upcoming challenge for scientists is to provide training data and AI algorithms that can withstand distorted information. In this work, we introduce AntiDiscriminationNet (ADN), a superior attractiveness prediction network. We propose a new method to generate an unbiased convolutional neural network (CNN) to improve the fairness of machine learning in facial dataset. To train unbiased networks we generate synthetic images and weight training data for anti-discrimination assessments towards different ethnicities. Additionally, we introduce an approach with entropy penalty terms to reduce the bias of our CNN. Our research provides insights in how to train and build fair machine learning models for facial image analysis by minimising implicit biases. Our AntiDiscriminationNet finally outperforms all competitors in the FBP benchmark by achieving a Pearson correlation coefficient of PCC = 0.9601.
Download

Paper Nr: 57
Title:

Benchmarking Person Re-Identification Datasets and Approaches for Practical Real-World Implementations

Authors:

Jose Huaman, Felix O. Sumari H., Luigy Machaca, Esteban Clua and Joris Guérin

Abstract: Person Re-Identification (Re-ID) is receiving a lot of attention. Large datasets containing labeled images of various individuals have been released, and successful approaches were developed. However, when Re-ID models are deployed in new cities or environments, they face an important domain shift (ethnicity, clothing, weather, architecture, etc.), resulting in decreased performance. In addition, the whole frames of the video streams must be converted into cropped images of people using pedestrian detection models, which behave differently from the human annotators who built the training dataset. To better understand the extent of this issue, this paper introduces a complete methodology to evaluate Re-ID approaches and training datasets with respect to their suitability for unsupervised deployment for live operations. We benchmark four Re-ID approaches on three datasets, providing insight and guidelines that can help to design better Re-ID pipelines.
Download

Paper Nr: 59
Title:

Surface-Graph-Based 6DoF Object-Pose Estimation for Shrink-Wrapped Items Applicable to Mixed Depalletizing Robots

Authors:

Taiki Yano, Nobutaka Kimura and Kiyoto Ito

Abstract: We developed an object-recognition method that enables six degrees of freedom (6DoF) pose and size estimation of shrink-wrapped items for use with a mixed depalletizing robot. Shrink-wrapped items consist of multiple products wrapped in transparent plastic wrap, the boundaries of which are unclear, making it difficult to identify the area of a single item to be picked. To solve this problem, we propose a surface-graph-based 6DoF object-pose estimation method. This method constructs a surface graph representing the connection of products by using their surfaces as graph nodes and determines the boundary of each shrink-wrapped item by detecting the homogeneity of the edge length, which corresponds to the distance between the centers of the products. We also developed a recognition-process flow that can be applied to various objects by appropriately switching between conventional box-shape object recognition and shrink-wrapped object recognition. We conducted an experiment to evaluation the proposed method, and the results indicate that the proposed method can achieve an average recognition rate of more than 90%, which is higher than that with a conventional object-recognition method in a depalletizing work environment that includes shrink-wrapped items.
Download

Paper Nr: 79
Title:

Using Continual Learning on Edge Devices for Cost-Effective, Efficient License Plate Detection

Authors:

Reshawn Ramjattan, Rajeev Ratan, Shiva Ramoudith, Patrick Hosein and Daniele Mazzei

Abstract: Deep learning networks for license plate detection can produce exceptional results. However, the challenge lies in real-world use where model performance suffers when exposed to new variations and distortions of images. Rain occlusion, low lighting, glare, motion blur and varying camera quality are a few among many possible data shifts that can occur. If portable edge devices are being used then the change in location or the angle of the device also results in reduced performance. Continual learning (CL) aims to handle shifts by helping models learn from new data without forgetting old knowledge. This is particularly useful for deep learning on edge devices where resources are limited. Gdumb is a simple CL method that achieves state-of-the-art performance results. We explore the potential of using continual learning for license plate detection through experiments using an adapted Gdumb approach. Our data was collected for a license plate recognition system using edge devices and consists of images split into 3 categories by quality and distance. We evaluate the application for data shifts, forward/backward transfer, accuracy and forgetting. Our results show that a CL approach under limited resources can attain results close to full retraining for our application.
Download

Paper Nr: 91
Title:

An Experimental Consideration on Gait Spoofing

Authors:

Yuki Hirose, Kazuaki Nakamura, Naoko Nitta and Noboru Babaguchi

Abstract: Deep learning technologies have improved the performance of biometric systems as well as increased the risk of spoofing attacks against them. So far, lots of spoofing and anti-spoofing methods were proposed for face and voice. However, for gait, there are a limited number of studies focusing on the spoofing risk. To examine the executability of gait spoofing, in this paper, we attempt to generate a sequence of fake gait silhouettes that mimics a certain target person’s walking style only from his/her single photo. A feature vector extracted from such a single photo does not have full information about the target person’s gait characteristics. To complement the information, we update the extracted feature so that it simultaneously contains various people’s characteristics like a wolf sample. Inspired by a wolf sample or also called “master” sample, which can simultaneously pass two or more verification systems like a master key, we call the proposed process “masterization”. After the masterization, we decode its resultant feature vector to a gait silhouette sequence. In our experiment, the gait recognition accuracy with the generated fake silhouette sequences is increased from 69% to 78% by the masterization, which indicates an unignorable risk of gait spoofing.
Download

Paper Nr: 99
Title:

Subjective Baggage-Weight Estimation from Gait: Can You Estimate How Heavy the Person Feels?

Authors:

Masaya Mizuno, Yasutomo Kawanishi, Tomohiro Fujita, Daisuke Deguchi and Hiroshi Murase

Abstract: We propose a new computer vision problem of subjective baggage-weight estimation by defining the term subjective weight as how heavy the person feels. We propose a method named G2SW (Gait to Subjective Weight), which is based on the assumption that cues of the subjective weight appear in the human gait, described by a 3D skeleton sequence. The method uses 3D locations and velocities of body joints as input and estimates subjective weight using a Graph Convolutional Network. It also estimates human body weight as a sub-task based on the assumption that the strength of a person depends on body weight. For the evaluation, we built a dataset for subjective baggage-weight estimation, consisting of 3D skeleton sequences with subjective weight annotations. We confirmed that the subjective weight could be estimated from a human gait and also confirmed that the sub-task of body weight estimation pulls up the performance of the subjective weight estimation.
Download

Paper Nr: 114
Title:

Pyramid Swin Transformer: Different-Size Windows Swin Transformer for Image Classification and Object Detection

Authors:

Chenyu Wang, Toshio Endo, Takahiro Hirofuchi and Tsutomu Ikegami

Abstract: We present the Pyramid Swin Transformer for object detection and image classification, by taking advantage of more shift window operations, smaller and more different size windows. We also add a Feature Pyramid Network for object detection, which produces excellent results. This architecture is implemented in four stages, containing different size window layers. We test our architecture on ImageNet classification and COCO detection. Pyramid Swin Transformer achieves 85.4% accuracy on ImageNet classification and 54.3 box AP on COCO.
Download

Paper Nr: 129
Title:

Self-Modularized Transformer: Learn to Modularize Networks for Systematic Generalization

Authors:

Yuichi Kamata, Moyuru Yamada and Takayuki Okatani

Abstract: Visual Question Answering (VQA) is a task of answering questions about images that fundamentally requires systematic generalization capabilities, i.e., handling novel combinations of known visual attributes (e.g., color and shape) or visual sub-tasks (e.g., FILTER and COUNT). Recent researches report that Neural Module Networks (NMNs), which compose modules that tackle sub-tasks with a given layout, are a promising approach for the systematic generalization in VQA. However, their performance heavily relies on the human-designed sub-tasks and their layout. Despite being crucial for training, most datasets do not contain these annotations. Self-Modularized Transformer (SMT), a novel Transformer-based NMN that concurrently learns to decompose the question into the sub-tasks and compose modules without such annotations, is proposed to overcome this important limitation of NMNs. SMT outperforms the state-of-the-art NMNs and multi-modal Transformers for the systematic generalization to the novel combinations of the sub-tasks in VQA.
Download

Paper Nr: 133
Title:

Fast and Reliable Template Matching Based on Effective Pixel Selection Using Color and Intensity Information

Authors:

Rina Tagami, Hiroki Kobayashi, Shuichi Akizuki and Manabu Hashimoto

Abstract: We propose a fast and reliable method for object detection using color and intensity information. The probability of hue and pixel values (gray level intensity values) in two-pixel pairs occurring in a template image is calculated, and only those pixel pairs with extremely low probability are carefully selected for matching. Since these pixels are highly distinctive, reliable matching is not affected by surrounding disturbances, and since only a very small number of pixels is used, the matching speed is high. Moreover, the use of the two measures enables reliable matching regardless of an object’s color. In a real image experiment, we achieved a recognition rate of 98% and a processing time of 80 msec using only 5% (684 pixels) of the template image. When only 0.5% (68 pixels) of the template image was used, the recognition rate was 80% and the processing time was 5.9 msec.
Download

Paper Nr: 136
Title:

PanDepth: Joint Panoptic Segmentation and Depth Completion

Authors:

Juan P. Lagos and Esa Rahtu

Abstract: Understanding 3D environments semantically is pivotal in autonomous driving applications where multiple computer vision tasks are involved. Multi-task models provide different types of outputs for a given scene, yielding a more holistic representation while keeping the computational cost low. We propose a multi-task model for panoptic segmentation and depth completion using RGB images and sparse depth maps. Our model successfully predicts fully dense depth maps and performs semantic segmentation, instance segmentation, and panoptic segmentation for every input frame. Extensive experiments were done on the Virtual KITTI 2 dataset and we demonstrate that our model solves multiple tasks, without a significant increase in computational cost, while keeping high accuracy performance. Code is available at https://github.com/juanb09111/PanDepth.git.
Download

Paper Nr: 146
Title:

Environmental Information Extraction Based on YOLOv5-Object Detection in Videos Collected by Camera-Collars Installed on Migratory Caribou and Black Bears in Northern Quebec

Authors:

Jalila Filali, Denis Laurendeau and Steeve D. Côté

Abstract: With the rapid increase in the number of recorded videos, developing and exploring intelligent systems become more prominent to analyze video content. Within projects related to Sentinel North’s research program$^*$, our project involves how to analyze videos that are collected using camera collars installed on caribou (Rangifer tarandus) and black bears (Ursus americanus) living in northern Quebec. Our objective was to extract valuable environmental information such as weather, resources, and habitat where animals live. In this paper, we propose an environmental information extraction approach based on YOLOv5-Object detection in videos collected by camera collars installed on caribou and black bears in Northern Quebec. Our proposal consists, firstly, in filtering raw data and stabilizing videos to build a wildlife video dataset for deep learning training and evaluating object detection. Secondly, it focuses on solving the existing difficulties in detecting objects by adopting the YOLOv5 model to incorporate enriched features and detect objects of different sizes, and it further allows us to exploit and analyze object detection results to extract relevant information about weather, resources, and habitat of animals. Finally, it consists in visualizing object detection and statistical results by developing a GUI interface. The experimental results show that the YOLOv5m model was significantly better than the YOLOv5s model and can detect objects with different sizes. In addition, the obtained results show that our method can extract weather, habitat, and resource classes from stabilized videos, and then determine their percentage of appearance. Moreover, our proposed method can automatically provide statistics about environmental information for each stabilized video.
Download

Paper Nr: 160
Title:

Neural Architecture Search in the Context of Deep Multi-Task Learning

Authors:

Guilherme Gadelha, Herman Gomes and Leonardo Batista

Abstract: Multi-Task Learning (MTL) is a neural network design paradigm that aims to improve generalization while simultaneously solving multiple tasks. It has obtained success in many application areas such as Natural Language Processing and Computer Vision. In an MTL neural network, there are shared task branches and task-specific branches. However, automatically deciding on the best locations and sizes of those branches as a result of the domain tasks remains an open question. With the aim of shedding light to the above question, we designed a sequence of experiments involving single-task networks, multi-task networks, and networks created with a neural architecture search (NAS) strategy. In addition, we proposed a competitive neural network architecture for a challenging use case: the ICAO photograph conformance checking for issuing of passports. We obtained the best results using a handcrafted MTL network, whose effectiveness is close to state-of-the-art methods. Furthermore, our experiments and analysis pave the way to develop a technique to automatically create branches and group similar tasks into an MTL network.
Download

Paper Nr: 206
Title:

A Novel 3D Face Reconstruction Model from a Multi-Image 2D Set

Authors:

Mohamed Dhouioui, Tarek Frikha, Hassen Drira and Mohamed Abid

Abstract: Recently, many researchers have focused on 3D face analysis and its applications, and put a lot of work on developing its methods. Even though 3D facial images provide a better representation of the face in terms of accuracy, they are harder to acquire than 2D pictures. This is why, wide efforts have been put to develop systems which reconstruct 3D face models from 2D images. However, the 2D to 3D face reconstruction problem is still not very advanced, it is both computationally intensive and needs great space exploration to acquire accurate representations. In this paper, we present a 3D multi-image face reconstruction method built over a single image reconstruction model. We propose a novel 3D face re-construction approach based on two levels, first, the use of a single image 3d re-construction CNN model to represent vectorial embeddings and generate a 3d Face morphable model. And second, an unsupervised K-means model on top of the single image reconstruction CNN Model to optimize its results by incorporating a multi-image reconstruction. Thanks to the introduction of a hybrid loss function, we are able to train the model without ground truth reference. Further-more, to our knowledge this is the first use of an unsupervised model alongside a weakly supervised one reaching such performance. Experiments show that our approach outperforms its counterparts in the literature both in single-image and multi-image reconstruction, and it proves that its unique and original nature are very promising to implement in other applications.
Download

Paper Nr: 225
Title:

How to Train an Accurate and Efficient Object Detection Model on any Dataset

Authors:

Galina Zalesskaya, Bogna Bylicka and Eugene Liu

Abstract: The rapidly evolving industry demands high accuracy of the models without the need for time-consuming and computationally expensive experiments required for fine-tuning. Moreover, a model and training pipeline, which was once carefully optimized for a specific dataset, rarely generalizes well to training on a different dataset. This makes it unrealistic to have carefully fine-tuned models for each use case. To solve this, we propose an alternative approach that also forms a backbone of Intel® Geti™ platform: a dataset-agnostic template for object detection trainings, consisting of carefully chosen and pre-trained models together with a robust training pipeline for further training. Our solution works out-of-the-box and provides a strong baseline on a wide range of datasets. It can be used on its own or as a starting point for further fine-tuning for specific use cases when needed. We obtained dataset-agnostic templates by performing parallel training on a corpus of datasets and optimizing the choice of architectures and training tricks with respect to the average results on the whole corpora. We examined a number of architectures, taking into account the performance-accuracy trade-off. Consequently, we propose 3 finalists, VFNet, ATSS, and SSD, that can be deployed on CPU using the OpenVINO™ toolkit. The source code is available as a part of the OpenVINO™ Training Extensionsa
Download

Paper Nr: 226
Title:

Real-Time Obstacle Detection using a Pillar-based Representation and a Parallel Architecture on the GPU from LiDAR Measurements

Authors:

Mircea P. Muresan, Robert Schlanger, Radu Danescu and Sergiu Nedevschi

Abstract: In contrast to image-based detection, objects detected from 3D LiDAR data can be localized easier and their shapes are easier identified by using depth information. However, the 3D LiDAR object detection task is more difficult due to factors such as the sparsity of the point clouds and highly variable point density. State-of-the-art learning approaches can offer good results; however, they are limited by the data from the training set. Simple models work only in some environmental conditions, or with specific object classes, while more complex models require high running time, increased computing resources and are unsuitable for real-time applications that have multiple other processing modules. This paper presents a GPU-based approach for detecting the road surface and objects from 3D LiDAR data in real-time. We first present a parallel working architecture for processing 3D points. We then describe a novel road surface estimation approach, useful in separating the ground and object points. Finally, an original object clustering algorithm that is based on pillars is presented. The proposed solution has been evaluated using the KITTI dataset and has also been tested in different environments using different LiDAR sensors and computing platforms to verify its robustness.
Download

Paper Nr: 230
Title:

Prediction of Shuttle Trajectory in Badminton Using Player's Position

Authors:

Yuka Nokihara, Ryosuke Hori, Ryo Hachiuma and Hideo Saito

Abstract: Data analysis in net sports, such as badminton, is becoming increasingly important. This research aims to analyze data so that players can gain an advantage in the fast rally development of badminton matches. We investigate the novel task of predicting future shuttle trajectories in badminton match videos and propose a method that uses shuttle and player position information. In an experiment, we detected players from match videos and trained a time-sequence model. The proposed method outperformed baseline methods that use only the shuttle position information as the input and other methods that use time-sequence models.
Download

Paper Nr: 244
Title:

Few-Shot Gaze Estimation via Gaze Transfer

Authors:

Nikolaos Poulopoulos and Emmanouil Z. Psarakis

Abstract: Precise gaze estimation constitutes a challenging problem in many computer vision applications due to many limitations related to the great variability of human eye shapes, facial expressions and orientations as well as the illumination variations and the presence of occlusions. Nowadays, the increasing interest of deep neural networks requires a great amount of training data. However, the dependency on labeled data for the purpose of gaze estimation constitutes a significant issue because they are expensive to obtain and require dedicated hardware setup. To address these issues, we introduce a few-shot learning approach which exploits a large amount of unlabeled data to disentangle the gaze feature and train a gaze estimator using only few calibration samples. This is achieved by performing gaze transfer between image pairs that share similar eye appearance but different gaze information via the joint training of a gaze estimation and a gaze transfer network. Thus, the gaze estimation network learns to disentangle the gaze feature indirectly in order to perform precisely the gaze transfer task. Experiments on two publicly available datasets reveal promising results and enhanced accuracy against other few-shot gaze estimation methods.
Download

Paper Nr: 249
Title:

Application of Deep Learning to the Detection of Foreign Object Debris at Aerodromes’ Movement Area

Authors:

João Almeida, Gonçalo Cruz, Diogo Silva and Tiago Oliveira

Abstract: This work describes a low-cost and passive system installed on ground vehicles that detects Foreign Object Debris (FOD) at aerodromes’ movement area, using neural networks. In this work, we created a dataset of images collected at an airfield to test our proposed solution, using three different electro-optical sensors, capturing images in different wavelengths: i) visible, ii) near-infrared plus visible and iii) long-wave infrared. The first sensor captured 9,497 images, the second 5,858, and the third 10,388. Unlike other works in this field, our dataset is publicly available, and was collected accordingly to our envisioned real world application. We rely on image classification, object detection networks and image segmentation networks to find objects in the image. For classifier and detector, we choose Xception and YOLOv3, respectively. For image segmentation, we tested several approaches based on Unet with backbone networks. The classification task achieved an AP of 77:92%, the detection achieved 37:49% mAP and the segmentation network achieved 26:9% mIoU.
Download

Paper Nr: 262
Title:

YCbCr Color Space as an Effective Solution to the Problem of Low Emotion Recognition Rate of Facial Expressions In-The-Wild

Authors:

Hadjer Boughanem, Haythem Ghazouani and Walid Barhoumi

Abstract: Facial expressions are natural and universal reactions for persons facing any situation, while being extremely associated with human intentions and emotional states. In this framework, Facial Emotion Recognition (FER) aims to analyze and classify a given facial image into one of several emotion states. With the recent progress in computer vision, machine learning and deep learning techniques, it is possible to effectively recognize emotions from facial images. Nevertheless, FER in a wild situation is still a challenging task due to several circumstances and various challenging factors such as heterogeneous head poses, head motion, movement blur, age, gender, occlusions, skin color, and lighting condition changes. In this work, we propose a deep learningbased facial expression recognition method, using the complementarity between deep features extracted from three pre-trained convolutional neural networks. The proposed method focuses on the quality of features offered by the YCbCr color space and demonstrates that using this color space permits to enhance the emotion recognition accuracy when dealing with images taken under challenging conditions. The obtained results, on the SFEW 2.0 dataset captured in wild environment as well as on two other facial expression benchmark which are the CK+ and the JAFFE datasets, show better performance compared to state-of-the-art methods.
Download

Paper Nr: 265
Title:

Applying Positional Encoding to Enhance Vision-Language Transformers

Authors:

Xuehao Liu, Sarah J. Delany and Susan McKeever

Abstract: Positional encoding is used in both natural language and computer vision transformers. It provides information on sequence order and relative position of input tokens (such as of words in a sentence) for higher performance. Unlike the pure language and vision transformers, vision-language transformers do not currently exploit positional encoding schemes to enrich input information. We show that capturing location information of visual features can help vision-language transformers improve their performance. We take Oscar, one of the state-of-the-art (SOTA) vision-language transformers as an example transformer for implanting positional encoding. We use image captioning as a downstream task to test performance. We added two types of positional encoding into Oscar: DETR as an absolute positional encoding approach and iRPE, for relative positional encoding. With the same training protocol and data, both positional encodings improved the image captioning performance of Oscar by between 6.8% to 24.1% across five image captioning evaluation criteria used.
Download

Paper Nr: 266
Title:

Brazilian Banknote Recognition Based on CNN for Blind People

Authors:

Odalisio S. Neto, Felipe G. Oliveira, João B. Cavalcanti and José S. Pio

Abstract: This paper presents an approach based on computer vision techniques for the recognition of Brazilian banknotes. The methods for identifying banknotes, proposed by the Brazilian Central Bank, are unsafe due to intense banknote damage to their original state during daily use. These damages directly affect the recognition ability of the visually impaired. The proposed approach takes into account the second family of the Brazilian currency, the Real (plural Reais), regarding notes of 2, 5, 10, 20, 50 and 100 Reais. Thus, the proposed strategy is composed by two main steps: i) Image Pre-Processing; and ii) Banknote Classification. In the first step, the images of Brazilian banknotes, acquired by smartphone cameras, are processed to reduce the noise presence and preserve edges, through the bilateral filter. Finally, in the banknote classification step, the feature learning process is performed, representing the main features for banknote image classification. In addition, the Convolutional Neural Network (CNN) is used to classify the note denomination (value). Experiments demonstrated the effectiveness and robustness of the proposed approach, achieving an accuracy of 99.103%, using the proposed dataset with 6365 images of real banknotes in different environments and illumination conditions.
Download

Area 4 - Applications and Services

Full Papers
Paper Nr: 26
Title:

CoDA-Few: Few Shot Domain Adaptation for Medical Image Semantic Segmentation

Authors:

Arthur A. Pinto, Jefersson D. Santos, Hugo Oliveira and Alexei Machado

Abstract: Due to ethical and legal concerns related to privacy, medical image datasets are often kept private, preventing invaluable annotations from being publicly available. However, data-driven models as machine learning algorithms require large amounts of curated labeled data. This tension between ethical concerns regarding privacy and performance is one of the core limitations to the development of artificial intelligence solutions in medical imaging analysis. Aiming to mitigate this problem, we introduce a methodology based on few-shot domain adaptation capable of leveraging organ segmentation annotations from private datasets to segment previously unseen data. This strategy uses unsupervised image-to-image translation to transfer annotations from a confidential source dataset to a set of unseen public datasets. Experiments show that the proposed method achieves equivalent or better performance when compared with approaches that have access to the target data. The method’s effectiveness is evaluated in segmentation studies of the heart and lungs in X-ray datasets, often reaching Jaccard values larger than 90% for novel unseen image sets.
Download

Paper Nr: 30
Title:

Let’s Get the FACS Straight: Reconstructing Obstructed Facial Features

Authors:

Tim Büchner, Sven Sickert, Gerd F. Volk, Christoph Anders, Orlando Guntinas-Lichius and Joachim Denzler

Abstract: The human face is one of the most crucial parts in interhuman communication. Even when parts of the face are hidden or obstructed the underlying facial movements can be understood. Machine learning approaches often fail in that regard due to the complexity of the facial structures. To alleviate this problem a common approach is to fine-tune a model for such a specific application. However, this is computational intensive and might have to be repeated for each desired analysis task. In this paper, we propose to reconstruct obstructed facial parts to avoid the task of repeated fine-tuning. As a result, existing facial analysis methods can be used without further changes with respect to the data. In our approach, the restoration of facial features is interpreted as a style transfer task between different recording setups. By using the CycleGAN architecture the requirement of matched pairs, which is often hard to fullfill, can be eliminated. To proof the viability of our approach, we compare our reconstructions with real unobstructed recordings. We created a novel data set in which 36 test subjects were recorded both with and without 62 surface electromyography sensors attached to their faces. In our evaluation, we feature typical facial analysis tasks, like the computation of Facial Action Units and the detection of emotions. To further assess the quality of the restoration, we also compare perceptional distances. We can show, that scores similar to the videos without obstructing sensors can be achieved.
Download

Paper Nr: 95
Title:

Extractive Text Summarization Using Generalized Additive Models with Interactions for Sentence Selection

Authors:

Vinícius C. da Silva, João Paulo Papa and Kelton Augusto P. da Costa

Abstract: Automatic Text Summarization (ATS) is becoming relevant with the growth of textual data; however, with the popularization of public large-scale datasets, some recent machine learning approaches have focused on dense models and architectures that, despite producing notable results, usually turn out in models difficult to interpret. Given the challenge behind interpretable learning-based text summarization and the importance it may have for evolving the current state of the ATS field, this work studies the application of two modern Generalized Additive Models with interactions, namely Explainable Boosting Machine and GAMI-Net, to the extractive summarization problem based on linguistic features and binary classification.
Download

Paper Nr: 103
Title:

ALiSNet: Accurate and Lightweight Human Segmentation Network for Fashion E-Commerce

Authors:

Amrollah Seifoddini, Koen Vernooij, Timon Künzle, Alessandro Canopoli, Malte Alf, Anna Volokitin and Reza Shirvany

Abstract: Accurately estimating human body shape from photos can enable innovative applications in fashion, from mass customization, to size and fit recommendations and virtual try-on. Body silhouettes calculated from user pictures are effective representations of the body shape for downstream tasks. Smartphones provide a convenient way for users to capture images of their body, and on-device image processing allows predicting body segmentation while protecting users’ privacy. Existing off-the-shelf methods for human segmentation are closed source and cannot be specialized for our application of body shape and measurement estimation. Therefore, we create a new segmentation model by simplifying Semantic FPN with PointRend, an existing accurate model. We finetune this model on a high-quality dataset of humans in a restricted set of poses relevant for our application. We obtain our final model, ALiSNet, with a size of 4MB and 97.6 ± 1.0% mIoU, compared to Apple Person Segmentation, which has an accuracy of 94.4 ± 5.7% mIoU on our dataset.
Download

Paper Nr: 234
Title:

IncludeVote: Development of an Assistive Technology Based on Computer Vision and Robotics for Application in the Brazilian Electoral Context

Authors:

Felipe S. Mendonça, João N. Teixeira and Marcondes D. Silva Júnior

Abstract: This work presents the development of an assistive technology based on computer vision and robotics, which allows users with disabilities to carry out the complete voting process without the need for assistance. The developed system consists of a HeadMouse associated with an auxiliary robotic arm tool that contains an adapted interactive interface equivalent to the interface of the electronic voting machine. For the development of the HeadMouse, techniques based on computer vision, face detection and recognition of face points were used. It is a tool that uses the movements of the face and eyes to perform the function of typing votes through the adapted interface for the robotic arm to carry out the entire voting process. Tests carried out showed that the developed system presented satisfactory performance, allowing a user to carry out the entire voting process in a time of 2 minutes and 28 seconds. It was also possible to conclude that the system has an average throughput of 1.16 bits/s for movements with the mouse cursor. The developed system should be used by people with motor disabilities as an assistive technology, to aid in the voting process, promoting social inclusion.
Download

Short Papers
Paper Nr: 15
Title:

Railway Switch Classification Using Deep Neural Networks

Authors:

Andrei-Robert Alexandrescu, Alexandru Manole and Laura Dioșan

Abstract: Railway switches represent the mechanism which slightly adjusts the rail blades at the intersection of two rail tracks in order to allow trains to exchange their routes. Ensuring that the switches are correctly set represents a critical task. If switches are not correctly set, they may cause delays in train schedules or even loss of lives. In this paper we propose an approach for classifying switches using various deep learning architectures with a small number of parameters. We exploit various input modalities including: grayscale images, black and white binary masks and a concatenated representation consisting of both. The experiments are conducted on RailSem19, the most comprehensive dataset for the task of switch classification, using both fine-tuned models and models trained from scratch. The switch bounding boxes from the dataset are pre-processed by introducing three hyper-parameters over the boxes, improving the models performance. We manage to achieve an overall accuracy of up to 96% in a ternary multi-class classification setting where our model is able to distinguish between images containing left, right or no switches at all. The results for the left and right switch classes are compared with two other existing approaches from the literature. We obtain competitive results using deep neural networks with considerably fewer learnable parameters than the ones from the literature.
Download

Paper Nr: 38
Title:

Multi-Phase Relaxation Labeling for Square Jigsaw Puzzle Solving

Authors:

Ben Vardi, Alessandro Torcinovich, Marina Khoroshiltseva, Marcello Pelillo and Ohad Ben-Shahar

Abstract: We present a novel method for solving square jigsaw puzzles based on global optimization. The method is fully automatic, assumes no prior information, and can handle puzzles with known or unknown piece orientation. At the core of the optimization process is nonlinear relaxation labeling, a well-founded approach for deducing global solutions from local constraints, but unlike the classical scheme here we propose a multi-phase approach that guarantees convergence to feasible puzzle solutions. Next to the algorithmic novelty, we also present a new compatibility function for the quantification of the affinity between adjacent puzzle pieces. Competitive results and the advantage of the multi-phase approach are demonstrated on standard datasets.
Download

Paper Nr: 42
Title:

Interactive Indoor Localization Based on Image Retrieval and Question Response

Authors:

Xinyun Li, Ryosuke Furuta, Go Irie, Yota Yamamoto and Yukinobu Taniguchi

Abstract: Due to the increasing complexity of indoor facilities such as shopping malls and train stations, there is a need for a new technology that can find the current location of the user of a smartphone or other device, as such facilities prevent the reception of GPS signals. Although many methods have been proposed for location estimation based on image search, accuracy is unreliable as there are many similar architectural indoors, and there are few features that are unique enough to offer unequivocal localization. Some methods increase the accuracy of location estimation by increasing the number of query images, but this increases the user’s burden of image capture. In this paper, we propose a method for accurately estimating the current indoor location based on question-response interaction from the user, without imposing greater image capture loads. Specifically, the proposal (i) generates questions using object detection and scene text detection, (ii) sequences the questions by minimizing conditional entropy, and (iii) filters candidate locations to find the current location based on the user’s response.
Download

Paper Nr: 73
Title:

High-Level Workflow Interpreter for Real-Time Image Processing

Authors:

Roberto S. Maciel, João A. Nery and Daniel O. Dantas

Abstract: Medical imaging is used in clinics to support the diagnosis and treatment of diseases. Developing effective computer vision algorithms for image processing is a challenging task, requiring a significant amount of time invested in the prototyping phase. Workflow systems have become popular tools as they allow the development of algorithms as a collection of function blocks, which can be graphically linked to input and output pipelines. These systems help to improve the learning curve for beginning programmers. Other systems make programming easier and increase productivity through automatic code generation. VGLGUI is a graphical user interface for image processing that allows visual workflow programming for parallel image processing. It uses VisionGL functions for automatic wrapper code generation and optimization of image transfers between RAM and GPU. This article describes the high-level VGLGUI workflow interpreter and demonstrates the results of two image processing workflows.
Download

Paper Nr: 81
Title:

PG-3DVTON: Pose-Guided 3D Virtual Try-on Network

Authors:

Sanaz Sabzevari, Ali Ghadirzadeh, Mårten Björkman and Danica Kragic

Abstract: Virtual try-on (VTON) eliminates the need for in-store trying of garments by enabling shoppers to wear clothes digitally. For successful VTON, shoppers must encounter a try-on experience on par with in-store trying. We can improve the VTON experience by providing a complete picture of the garment using a 3D visual pre-sentation in a variety of body postures. Prior VTON solutions show promising results in generating such 3D presentations but have never been evaluated in multi-pose settings. Multi-pose 3D VTON is particularly challenging as it often involves tedious 3D data collection to cover a wide variety of body postures. In this paper, we aim to develop a multi-pose 3D VTON that can be trained without the need to construct such a dataset. Our framework aligns in-shop clothes to the desired garment on the target pose by optimizing a consistency loss. We address the problem of generating fine details of clothes in different postures by incorporating multi-scale feature maps. Besides, we propose a coarse-to-fine architecture to remove artifacts inherent in 3D visual presentation. Our empirical results show that the proposed method is capable of generating 3D presentations in different body postures while outperforming existing methods in fitting fine details of the garment.
Download

Paper Nr: 93
Title:

Two-Model-Based Online Hand Gesture Recognition from Skeleton Data

Authors:

Zorana Doždor, Tomislav Hrkać and Zoran Kalafatić

Abstract: Hand gesture recognition from skeleton data has recently gained popularity due to the broad areas of application and availability of adequate input devices. However, before utilising this technology in real-world conditions there are still many challenges left to overcome. A major challenge is robust gesture localization – estimating the beginning and the end of a gesture in online conditions. We propose an online gesture detection system based on two models – one for gesture localization and the other for gesture classification. This approach is tested and compared against the one-model approach, often found in literature. The system is evaluated on the recent SHREC challenge which offers datasets for online gesture detection. Results show the benefits of distributing the tasks of localization and recognition instead of using one model for both tasks. The proposed system obtains state-of-the-art results on SHREC gesture detection dataset.
Download

Paper Nr: 101
Title:

Maritime Surveillance by Multiple Data Fusion: An Application Based on Deep Learning Object Detection, AIS Data and Geofencing

Authors:

Sergio Ballines-Barrera, Leopoldo López, Daniel Santana-Cedrés and Nelson Monzón

Abstract: Marine traffic represents one of the critical points in coastal monitoring. This task has been eased by the development of Automatic Identification Systems (AIS), which allow ship recognition. However, AIS technology is not mandatory for all vessels, so there is a need for using alternative techniques to identify and track them. In this paper, we present the integration of several technologies. First, we perform ship detection by using different camera-based approaches, depending on the moment of the day (daytime or nighttime). From this detection, we estimate the vessel’s georeferenced position. Secondly, this estimation is combined with the information provided by AIS devices. We obtain a correspondence between the scene and the AIS data and we also detect ships without VHF transmitters. Together with a geofencing technique, we introduce a solution that fuses data from different sources, providing useful information for decision-making regarding the presence of vessels in near-shore locations.
Download

Paper Nr: 106
Title:

Automatic Fracture Detection and Characterization in Borehole Images Using Deep Learning-Based Semantic Segmentation

Authors:

Andrei Baraian, Vili Kellokumpu, Räty Tomi and Leena Kallio

Abstract: Fracture analysis represents one of the key investigations that needs to be carried in borehole logs. Identifying fractures, as well as other similar features (like breakouts or foliations) is essential for characterizing the reservoir where the drilling took place. However, identifying and characterizing the fractures from borehole images is a very time and resource consuming task, that require extensive knowledge from geological experts. For this reason, developing semi-automated or automated tools would facilitate and increase the productivity of fracture analysis, since even for one reservoir, experts need to analyze and interpret hundreds of meters of borehole images. This paper presents a deep learning based approach for application of automatic fracture detection and characterization in borehole images, relying on state-of-the-art convolutional neural network for accurate semantic segmentation of fractures. Target images consists of color borehole images, as opposed to acoustic or drill-core images, and uses real world data, both for training the deep learning model and testing the whole system. The system is evaluated by using multiple metrics and the final outputs of the system are the parameters of the sinusoids that define the predicted fractures.
Download

Paper Nr: 118
Title:

TrichANet: An Attentive Network for Trichogramma Classification

Authors:

Agniv Chatterjee, Snehashis Majhi, Vincent Calcagno and François Brémond

Abstract: Trichogramma wasp classification has a significant application in agricultural research, thanks to their massive usage and production in cropping as a bio-control agent. However, classifying these tiny species is a challenging task due to two factors: (i) Detection of these tiny wasps (barely visible with the naked eyes), (ii) Less inter-species discriminative visual features. To combat this, we propose a robust method to detect and classify the wasps from high-resolution images. The proposed method is enabled by a trich detection module that can be plugged into any competitive object detector for improved wasp detection. Further, we propose a multi-scale attention block to encode the inter-species discriminative representation by exploiting the coarse and fine-level morphological structure of the wasps for enhanced wasps classification. The proposed method along with its two key modules is validated in an in-house Trich dataset and a classification performance gain of 4% compared to recently reported baseline approaches outlines the robustness of our method. The code is available at https://github.com/ac5113/TrichANet.
Download

Paper Nr: 123
Title:

Synthesis for Dataset Augmentation of H&E Stained Images with Semantic Segmentation Masks

Authors:

Peter Sakalik, Lukas Hudec, Marek Jakab, Vanda Benešová and Ondrej Fabian

Abstract: The automatic analysis of medical images with the application of deep learning methods relies highly on the amount and quality of annotated data. Most of the diagnostic processes start with the segmentation and classification of cells. The manual annotation of a sufficient amount of high-variability data is extremely time-consuming, and the semi-automatic methods may introduce an error bias. Another research option is to use deep learning generative models to synthesize medical data with annotations as an extension to real datasets. Enhancing the training with synthetic data proved that it can improve the robustness and generalization of the models used in industrial problems. This paper presents a deep learning-based approach to generate synthetic histological stained images with corresponding multi-class annotated masks evaluated on cell semantic segmentation. We train conditional generative adversarial networks to synthesize a 6-channeled image. The six channels consist of the histological image and the annotations concerning the cell and organ type specified in the input. We evaluated the impact of the synthetic data on the training with the standard network UNet. We observe quantitative and qualitative changes in segmentation results from models trained on different distributions of real and synthetic data in the training batch.
Download

Paper Nr: 193
Title:

Printed Packaging Authentication: Similarity Metric Learning for Rotogravure Manufacture Process Identification

Authors:

Tetiana Yemelianenko, Alain Trémeau and Iuliia Tkachenko

Abstract: The number of medicine counterfeits increases each year due to the accessibility of printing devices and the weak protection of medicine blister foils. The medicine blisters are often produced using the rotogravure printing process. In this paper, we address the problem of rotogravure press identification and printed support identification using similarity metric learning. Both identification problems are difficult as the impact of printing press or of printing support are minimal, moreover the classical techniques (for example, the use of Pearson correlation) cannot identify the rotogravure press or the printing support used for the packaging production. We show that the similarity metric learning can easily identify the press used and the printing support used. Additionally, we explore the possibility to use the proposed approach for packaging authentication.
Download

Paper Nr: 207
Title:

EFL-Net: An Efficient Lightweight Neural Network Architecture for Retinal Vessel Segmentation

Authors:

Nasrin Akbari and Amirali Baniasadi

Abstract: Accurate segmentation of retinal vessels is crucial for the timely diagnosis and treatment of conditions like diabetes and hypertension, which can prevent blindness. Deep learning algorithms have been successful in segmenting retinal vessels, but they often require a large number of parameters and computations. To address this, we propose an efficient and fast lightweight network (EFL-Net) for retinal blood vessel segmentation. EFL-Net includes the ResNet branches shuffle block (RBS block) and the Dilated Separable Down block (DSD block) to extract features at various granularities and enhance the network receptive field, respectively. These blocks are lightweight and can be easily integrated into existing CNN models. The model also uses PixelShuffle as an upsampling layer in the decoder, which has a higher capacity for learning features than deconvolution and interpolation approaches. The model was tested on the Drive and CHASEDB1 datasets and achieved excellent results with fewer parameters compared to other networks such as ladder net and DCU-Net. EFL-Net achieved F1 measures of 0.8351 and 0.8242 on the CHASEDB1 and DRIVE datasets, respectively, with 0.340 million parameters, compared to 1.5 million for ladder net and 1 million for DCU-Net.
Download

Paper Nr: 257
Title:

Colonoscopic Polyp Detection with Deep Learning Assist

Authors:

Alexandre Neto, Diogo Couto, Miguel Coimbra and António Cunha

Abstract: Colorectal cancer is the third most common cancer and the second cause of cancer-related deaths in the world. Colonoscopic surveillance is extremely important to find cancer precursors such as adenomas or serrated polyps. Identifying small or flat polyps can be challenging during colonoscopy and highly dependent on the colonoscopist’s skills. Deep learning algorithms can enable improvement of polyp detection rate and consequently assist to reduce physician subjectiveness and operation errors. This study aims to compare YOLO object detection architecture with self-attention models. In this study, the Kvasir-SEG polyp dataset, composed of 1000 colonoscopy annotated still images, were used to train (700 images) and validate (300images) the performance of polyp detection algorithms. Well-defined architectures such as YOLOv4 and different YOLOv5 models were compared with more recent algorithms that rely on self-attention mechanisms, namely the DETR model, to understand which technique can be more helpful and reliable in clinical practice. In the end, the YOLOv5 proved to be the model achieving better results for polyp detection with 0.81 mAP, however, the DETR had 0.80 mAP proving to have the potential of reaching similar performances when compared to more well-established architectures.
Download

Paper Nr: 24
Title:

Detection of Microscopic Fungi and Yeast in Clinical Samples Using Fluorescence Microscopy and Deep Learning

Authors:

Jakub Paplhám, Vojtěch Franc and Daniela Lžičařová

Abstract: Early detection of yeast and filamentous fungi in clinical samples is critical in treating patients predisposed to severe infections caused by these organisms. The patients undergo regular screening, and the gathered samples are manually examined by trained personnel. This work uses deep neural networks to detect filamentous fungi and yeast in the clinical samples to simplify the work of the human operator by filtering out samples that are clearly negative and presenting the operator with only samples suspected of containing the contaminant. We propose data augmentation with Poisson inpainting and compare the model performance against expert and beginner-level humans. The method achieves human-level performance, theoretically reducing the amount of manual labor by 87%, given a true positive rate of 99% and incidence rate of 10%.
Download

Paper Nr: 71
Title:

AI-Powered Management of Identity Photos for Institutional Staff Directories

Authors:

Daniel Canedo, José Vieira, António Gonçalves and António R. Neves

Abstract: The recent developments in Deep Learning and Computer Vision algorithms allow the automation of several tasks which up until that point required the allocation of considerable human resources. One task that is getting behind the recent developments is the management of identity photos for institutional staff directories because it deals with sensitive information, namely the association of a photo to a person. The main objective of this work is to give a contribution to the automation of this process. This paper proposes several image processing algorithms to validate the submission of a new personal photo to the system, such as face detection, face recognition, face cropping, image quality assessment, head pose estimation, gaze estimation, blink detection, and sunglasses detection. These algorithms allow the verification of the submitted photo according to some predefined criteria. Generally, these criteria revolve around verifying if the face on the photo is of the person that is updating their photo, forcing the face to be centered on the image, verifying if the photo has visually good quality, among others. A use-case is presented based on the integration of the developed algorithms as a web-service to be used by the image directory system of the University of Aveiro. The proposed service is called every time a collaborator tries to update their personal photo and the result of the analysis determines if the photo is valid and the personal profile is updated. The system is already in production and the results that are being obtained are very satisfactory, according to the feedback of the users. Regarding the individual algorithms, the experimental results obtained range from 92% to 100% of accuracy, depending on the image processing algorithm being tested.
Download

Paper Nr: 88
Title:

FakeRecogna Anomaly: Fake News Detection in a New Brazilian Corpus

Authors:

Gabriel L. Garcia, Luis S. Afonso, Leandro A. Passos, Danilo S. Jodas, Kelton A. P. da Costa and João P. Papa

Abstract: The advances in technology have allowed digital content to be shared in a very short time and reach thousands of people. Fake news is one of the content shared among people and it has a negative impact on our society. Therefore, its detection has become a research topic of great importance in the natural language processing and machine learning communities. Besides the techniques employed for detection, it is also important a good corpus so that machine learning techniques can learn to differentiate between real and fake news. One can find corpora in Brazilian Portuguese; however, they are either outdated or balanced, which does not reflect a real-life situation. This work presents a new updated and imbalanced corpus for the detection of fake news where the detection can be treated as an anomaly detection problem. This work also evaluates the proposed corpus by using classifiers designed for anomaly detection purposes.
Download

Paper Nr: 148
Title:

Investigating the Performance of Optimization Techniques on Deep Learning Models to Identify Dota2 Game Events

Authors:

Matheus P. Faria, Etienne S. Julia, Henrique C. Fernandes, Marcelo Zanchetta do Nascimento and Rita S. Julia

Abstract: Game logs are an important part of the player experience analysis in literature. They describe the major actions and events (related to the players or other elements) that affect the progress of a game. In most existing games (especially popular commercial games like FIFA, Dota2 and Valorant), their access is typically restricted to the game’s developers. Deep Learning (DL) approaches have been proposed to perform game event classification from videos. However, retrieving relevant information about these game events (normally associated with actions performed by players) in real-time is still a challenge. Existing approaches require high computational power that serves as an additional issue. In this sense, the present paper investigates a set of approaches that aim to reduce the computational cost of DL-based models - more specifically, Convolutional Neural Networks (CNN) based on Residual Nets architectures - through Genetic Algorithm and Bayesian Optimization. This investigation is carried out in the context of Dota2 game event classification. The comparative analysis showed that the models obtained herein achieved a classification performance as good as the models of the state-ofthe-art considering the Dota2 dataset, but with significantly fewer parameters. Thus, this work can help in the generation of optimized CNNs for real-time applications.
Download

Paper Nr: 161
Title:

Industrial Visual Defect Inspection of Electronic Components with Siamese Neural Network

Authors:

Warley Barbosa, Lucas Amaral, Tiago Vieira, Bruno Georgevich and Gustavo Melo

Abstract: We present a system focused on the Visual Inspection of Pin Through Hole (PTH) electronic components. The project was developed in a partnership with a multinational Printed Circuit Board Printed Circuit Board (PCB) manufacturing company which requested a solution capable of operating adequately on unseen components, not included in the initial image database used for model training. Traditionally, visual inspection was mostly performed with pre-determined feature engineering which is inadequate for a flexible solution. Hence, we used a one-shot-learning approach based on Siamese Neural Network model trained on anchor-negative-positive triplets. Using a specifically designed web crawler we collected a new and comprehensive database composed of electronic components which is used in extensive experiments for hyperparameters tun-ing on training and validations stages, achieving satisfactory performance. A web application is also presented, which is responsible for the management of operators, recipes, part number, etc. A hardware responsible for attaching the PCBs and a 4K camera is also developed and deployed on industrial environment. The overall system is deployed in a PCB manufacturing plant and its functionality is demonstrated in a relevant scenario, reaching a level 6 in Technology Readiness Level (TRL).
Download

Paper Nr: 162
Title:

Finding Similar non-Collapsed Faces to Collapsed Faces Using Deep Learning Face Recognition

Authors:

Ashwinee Mehta, Maged Abdelaal, Moamen Sheba and Nic Herndon

Abstract: Face recognition is the ability to recognize a person’s face in a digital image. Common uses of face recognition include identity verification, automatically organizing raw photo libraries by person, tracking a specific person, counting unique people and finding people with similar appearances. However, there is no systematic and accurate study for finding a similar non-collapsed face to a given collapsed face. In this paper we focus on the use case of finding people with similar appearances that will help us to find a similar face without a collapse to a collapsed face for dental reconstruction. We used Python’s Open-CV for age and gender classification and face recognition for finding similar faces. Our results provide a set of similar images that can be used for reconstructing the collapsed faces for creating dentures. Thus with the help of a similar non-collapsed face, we can reconstruct a collapsed face for designing effective dentures.
Download

Paper Nr: 203
Title:

Towards a Robust Solution for the Supermarket Shelf Audit Problem

Authors:

Emmanuel F. Morán, Boris X. Vintimilla and Miguel A. Realpe

Abstract: Retail supermarket is an industrial sector with repetitive tasks performed using visual analysis by the store’s operators. Tasks such as checking the status of the shelves can contain multiple sequential sub-tasks, each of which needs to be performed correctly. In recent years, there has been some intents to create a solution for the tasks mentioned without been complete solution for retails. In this article, a first realistic approach is proposed to solve the supermarket shelf audit problem. For this, a workflow is presented to deliver compliance level with respect to the expected store’s planogram.
Download

Paper Nr: 282
Title:

Handwriting Recognition in Down Syndrome Learners Using Deep Learning Methods

Authors:

Kirsty-Lee Walker and Tevin Moodley

Abstract: The Handwriting task is essential for any learner to develop as it can be seen as the gateway to further academic progression. The classification of Handwriting in learners with down syndrome is a relatively unexplored research area that has relied on manual techniques to monitor handwriting development. According to earlier studies, there is a gap in how down syndrome learners receive feedback on handwriting assignments, which hinders their academic progression. This research paper employs three deep learning architectures, VGG16, InceptionV2, and Xception, as end-to-end methods to categorise Handwriting as down syndrome or non-down syndrome. The InceptionV2 architecture correctly identifies an image with a model accuracy score of 99.62%. The results illustrate the manner in which the InceptionV2 architecture is able to classify Handwriting from learners with down syndrome accurately. This research paper advances the knowledge of which features differentiate a down syndrome learner’s Handwriting from a non-down syndrome learner’s Handwriting.
Download

Paper Nr: 288
Title:

Novel View Synthesis for Unseen Surgery Recordings

Authors:

Mana Masuda, Hideo Saito, Yoshifumi Takatsume and Hiroki Kajita

Abstract: Recording surgery in operating rooms is a crucial task for both medical education and evaluation of medical treatment. In this paper, we propose a method for visualizing surgical areas that are occluded by the heads or hands of medical professionals in various surgical scenes. To recover the occluded surgical areas, we utilize a surgery recording system equipped with multiple cameras embedded in the surgical lamp, with the aim of ensuring that at least one camera can capture the surgical area without occlusion. We propose the application of a transformer-based Neural Radiance Field (NeRF) model, originally proposed for normal scenes, to surgery scenes, and demonstrate through experimentation that it is feasible to generate occluded surgical areas. We believe this research has the potential to make our multi-camera recording system practical and useful for physicians.
Download

Area 5 - Motion, Tracking and Stereo Vision

Full Papers
Paper Nr: 25
Title:

Smoothed Normal Distribution Transform for Efficient Point Cloud Registration During Space Rendezvous

Authors:

Léo Renaut, Heike Frei and Andreas Nüchter

Abstract: Next to the iterative closest point (ICP) algorithm, the normal distribution transform (NDT) algorithm is becoming a second standard for 3D point cloud registration in mobile robotics. Both methods are effective, however they require a sufficiently good initialization to successfully converge. In particular, the discontinuities in the NDT cost function can lead to difficulties when performing the optimization. In addition, when the size of the point clouds increases, performing the registration in real-time becomes challenging. This work introduces a Gaussian smoothing technique of the NDT map, which can be done prior to the registration process. A kd-tree adaptation of the typical octree representation of NDT maps is also proposed. The performance of the modified smoothed NDT (S-NDT) algorithm for pairwise scan registration is assessed on two large-scale outdoor datasets, and compared to the performance of a state-of-the-art ICP implementation. S-NDT is around four times faster and as robust as ICP while reaching similar precision. The algorithm is thereafter applied to the problem of LiDAR tracking of a spacecraft in close-range in the context of space rendezvous, demonstrating the performance and applicability to real-time applications.
Download

Paper Nr: 229
Title:

On Computing Three-Dimensional Camera Motion from Optical Flow Detected in Two Consecutive Frames

Authors:

Norio Tagawa and Ming Yang

Abstract: This study deals with the problem of estimating camera motion from optical flow, which is the motion vector between consecutive frames. The problem is formulated as a geometric fitting problem using the values of the depth map as the nuisance parameters. It is a problem whose maximum likelihood estimation does not satisfy the Cramer–Rao lower bound, and it has long been known as the Neyman–Scott problem. One of the authors previously proposed an objective function for this problem that, when minimized, yields an estimator with less variance in the estimation error than that obtained by maximum likelihood estimation. The author also proposed linear and nonlinear optimization methods for minimizing the objective function. In this paper, we provide new knowledge on these methods and evaluate their effectiveness by examining methods with low estimation error and low computational cost in practice.
Download

Short Papers
Paper Nr: 60
Title:

DeNos22: A Pipeline to Learn Object Tracking Using Simulated Depth

Authors:

Dominik Penk, Maik Horn, Christoph Strohmeyer, Frank Bauer and Marc Stamminger

Abstract: We propose a novel pipeline to construct a learning based 6D object pose tracker, which is solely trained on synthetic depth images. The only required input is a (geometric) CAD model of the target object. Training data is synthesized by rendering stereo images of the CAD model, in front of a large variety of backgrounds generated by point-based re-renderings of prerecorded background scenes. Finally, depth from stereo is applied in order to mimic the behavior of depth sensors. The synthesized training input generalizes well to real-world scenes, but we further show how to improve real-world inference using robust estimators to counteract the errors introduced by the sim-to-real transfer. As a result, we show that our 6D pose trackers achieve state-of-the-art results without any annotated real-world data, solely based on a CAD-model of the target object.
Download

Paper Nr: 87
Title:

Flow-Based Visual-Inertial Odometry for Neuromorphic Vision Sensors Using non-Linear Optimization with Online Calibration

Authors:

Mahmoud Z. Khairallah, Abanob Soliman, Fabien Bonardi, David Roussel and Samia Bouchafa

Abstract: Neuromorphic vision sensors (also known as event-based cameras) operate according to detected variations in the scene brightness intensity. Unlike conventional CCD/CMOS cameras, they provide information about the scene with a very high temporal resolution (in the order of microsecond) and high dynamic range (exceeding 120 dB). These mentioned capabilities of neuromorphic vision sensors induced their integration in various robotics applications such as visual odometry and SLAM. The way neuromorphic vision sensors trigger events is strongly coherent with the brightness constancy condition that describes optical flow. In this paper, we exploit optical flow information with the IMU readings to estimate a 6-DoF pose. Based on the proposed optical flow tracking method, we introduce an optimization scheme set up with a twist graph instead of a pose graph. Upon validation on high-quality simulated and real-world sequences, we show that our algorithm does not require any triangulation or key-frame selection and can be fine-tuned to meet real-time requirements according to the events’ frequency.
Download

Paper Nr: 187
Title:

3D Human Body Reconstruction from Head-Mounted Omnidirectional Camera and Light Sources

Authors:

Ritsuki Hasegawa, Fumihiko Sakaue and Jun Sato

Abstract: In this paper, we propose a method for reconstructing a whole 3D shape of the human body from a single image taken by a head-mounted omnidirectional camera. In the image of a head-mounted camera, many parts of the human body are self-occluded, and it is very difficult to reconstruct the 3D shape of the human body including the invisible parts. The proposed method focuses on the shadows of the human body generated by the light sources in the scene and uses it to perform highly accurate 3D reconstruction of the whole human body including the hidden parts.
Download

Paper Nr: 188
Title:

3D Reconstruction of Occluded Luminous Objects

Authors:

Akira Nagatsu, Fumihiko Sakaue and Jun Sato

Abstract: In this paper, we propose a method for recovering the 3D shape and luminance distribution of an invisible object such as a human around a corner. The human body is a heat-generating object, so it does not emit visible light but emits far-infrared light. When a luminous object is around the corner, it cannot be observed directly, but the light emitted by the luminous object reflects on the floor or wall and reaches the observer. Since the luminous intensity of an object such as a human body surface is not uniform and unknown, its 3D reconstruction is not easy. In this paper, we propose a method to recover an occluded luminous object with non-uniform luminance distribution from changes in intensity patterns on the intermediate observation surface.
Download

Paper Nr: 221
Title:

System for 3D Acquisition and 3D Reconstruction Using Structured Light for Sewer Line Inspection

Authors:

Johannes Künzel, Darko Vehar, Rico Nestler, Karl-Heinz Franke, Anna Hilsmann and Peter Eisert

Abstract: The assessment of sewer pipe systems is a highly important, but at the same time cumbersome and error-prone task. We introduce an innovative system based on single-shot structured light modules that facilitates the detection and classification of spatial defects like jutting intrusions, spallings, or misaligned joints. This system creates highly accurate 3D measurements with sub-millimeter resolution of pipe surfaces and fuses them into a holistic 3D model. The benefit of such a holistic 3D model is twofold: on the one hand, it facilitates the accurate manual sewer pipe assessment, on the other, it simplifies the detection of defects in downstream automatic systems as it endows the input with highly accurate depth information. In this work, we provide an extensive overview of the system and give valuable insights into our design choices.
Download

Paper Nr: 242
Title:

3D Mapping of Indoor Parking Space Using Edge Consistency Census Transform Stereo Odometry

Authors:

Junesuk Lee and Soon-Yong Park

Abstract: In this paper, we propose a real-time 3D mapping system for indoor parking ramps and spaces. Visual odometry is calculated by applying the proposed Edge Consistency Census Transform (ECCT) stereo matching method. ECCT works strongly in repeated patterns and reduces drift errors in the vertical direction of the ground caused by Kanade-Lucas-Tomasi stereo matching of VINS-FUSION algorithm. We propose a mobile mapping system that uses a stereo camera and 2D lidar for data set acquisition. The parking ramp and spaces dataset are obtained using the mobile mapping system and are reconstructed using the proposed system. The proposed system performs the 3D mapping of the parking ramp and spaces dataset that is obtained using the mobile mapping system. We present the error of the normal vector with respect to the ground of the parking space as a quantitative evaluation for performance comparison with the previous method. Also, we present 3D mapping results as qualitative results.
Download

Paper Nr: 245
Title:

Real-Time Monitoring of Crowd Panic Based on Biometric and Spatiotemporal Data

Authors:

Ilias Lazarou, Anastasios L. Kesidis and Andreas Tsatsaris

Abstract: Panic is one of the most important indicators when it comes to Emergency Response Systems (ERS). Until now, panic events of any cause tend to be treated in a local manner based on traditional methods such as visual surveillance technologies and community engagement systems. This paper aims to present an approach for crowd panic event detection that takes advantage of wearable devices tracking real-time biometric data that are combined with location information. The real-time biometric and spatiotemporal nature of the data in the proposed approach is spatially unrestricted and information is flawlessly transmitted right from the source of the event, the human body. First, a machine learning classifier is demonstrated that successfully detects whether a subject has developed panic or not, based on its biometric and spatiotemporal data. Second, a real-time analysis model is proposed that uses the geospatial information of the labeled subjects to expose hidden patterns that possibly reveal crowd panic. The experimental results demonstrate the applicability of the proposed method in detecting and visualizing in real-time areas where an event of abnormal crowd behavior occurs.
Download

Paper Nr: 11
Title:

Upper Bound Tracker: A Multi-Animal Tracking Solution for Closed Laboratory Settings

Authors:

Alexander Dolokov, Niek Andresen, Katharina Hohlbaum, Christa Thöne-Reineke, Lars Lewejohann and Olaf Hellwich

Abstract: When tracking multiple identical objects or animals in video, many erroneous results are implausible right away, because they ignore a fundamental truth about the scene. Often the number of visible targets is bounded. This work introduces a multiple object pose estimation solution for the case that this upper bound is known. It dismisses all detections that would exceed the maximally permitted number and is able to re-identify an individual after an extended period of occlusion including the re-appearance in a different place. An example dataset with four freely interacting laboratory mice is additionally introduced and the tracker’s performance demonstrated on it. The dataset contains various conditions ranging from almost no opportunity to hide for the mice to a fairly cluttered environment. The approach is able to significantly reduce the occurrences of identity switches - the error when a known individual is suddenly identified as a different one - compared to other current solutions.
Download

Paper Nr: 112
Title:

Multi-Camera 3D Pedestrian Tracking Using Graph Neural Networks

Authors:

Isabella de Andrade and João P. Lima

Abstract: Tracking the position of pedestrians over time through camera images is a rising computer vision research topic. In multi-camera settings, the researches are even more recent. Many solutions use supervised neural networks to solve this problem, requiring much effort to annotate the data and time spent training the network. This work aims to develop variations of pedestrian tracking algorithms, avoid the need to have annotated data and compare the results obtained through accuracy metrics. Therefore, this work proposes an approach for tracking pedestrians in 3D space in multi-camera environments using the Message Passing Neural Network framework inspired by graphs. We evaluated the solution using the WILDTRACK dataset and a generalizable detection method, reaching 77.1% of MOTA when training with data obtained by a generalizable tracking algorithm, similar to current state-of-the-art accuracy. However, our algorithm can track the pedestrians at a rate of 40 fps, excluding the detection time, which is twice the most accurate competing solution.
Download

Paper Nr: 231
Title:

Low-Cost 3D Reconstruction of Caves

Authors:

João M. Teixeira, Narjara Pimentel, Eder Barbier, Enrico Bernard, Veronica Teichrieb and Gimena Chaves

Abstract: Caves are spatially complex environments, frequently formed by different shapes and structures. Capturing cave’s spatial complexity is often necessary for different purposes – from geological to biological aspects – but difficult due to the challenging logistics, frequent absence of light, and because the necessary equipment is prohibitively expensive. Efficient and low-cost mapping systems could produce direct and indirect benefits for cave users and policy-makers, enabling from non-invasive research of fragile structures (like speleothems) to new forms of interactive experiences in tourism, for example. Here we present a low-cost solution that combines hardware and software to allow capturing cave spatial information through RGB-D sensors and the later interpretation of the processed data. Our solution allows the navigation in a 3D reconstructed cave, and may be used to estimate volume and area information, frequently necessary for conservation or environmental licensing. We validated the proposed solution by partially reconstructing one cave in Northeastern Brazil. Although some challenges have to be overcome, our approach showed that it was possible to retrieve relevant information despite using low-cost RGB-D sensors.
Download