VISAPP 2020 Abstracts


Area 1 - Image and Video Formation, Preprocessing and Analysis

Full Papers
Paper Nr: 11
Title:

Trajectory Extraction and Deep Features for Classification of Liquid-gas Flow under the Context of Forced Oscillation

Authors:

Luong P. Nguyen, Julien Mille, Dominique Li, Donatello Conte and Nicolas Ragot

Abstract: Computer vision and deep learning techniques are increasingly applied to analyze experimental processes in engineering domains. In this paper, we propose a new dataset of liquid-gas flow videos captured from a mechanical model simulating a cooling gallery of an automobile engine, through forced oscillations. The analysis of this dataset is of interest for fluid-mechanic field to validate the simulation environment. From computer vision point of view, it provides a new dynamic texture dataset with challenging tasks since liquid and gas keep changing constantly and the form of liquid-gas flow is closely related to the external environment. In particular predicting the rotation velocity of the engine corresponding to liquid-gas movements is a first step before precise analysis of flow patterns and of their trajectories. The paper also provides an experimental analysis showing that such rotation velocity can be hard to predict accurately. It could be achieved using deep learning approaches but not with state-of-the-art method dedicated to trajectory analysis. We show also that a preprocessing step with difference of Gaussian (DoG) over multiple scales as input of deep neural networks is mandatory to obtain satisfying results, up to 81.39% on the test set. This study opens an exploratory field for complex tasks on dynamic texture analysis such as trajectory analysis of heterogeneous masses.
Download

Paper Nr: 16
Title:

3DSAL: An Efficient 3D-CNN Architecture for Video Saliency Prediction

Authors:

Yasser D. Djilali, Mohamed Sayah, Kevin McGuinness and Noel E. O’Connor

Abstract: In this paper, we propose a novel 3D CNN architecture that enables us to train an effective video saliency prediction model. The model is designed to capture important motion information using multiple adjacent frames. Our model performs a cubic convolution on a set of consecutive frames to extract spatio-temporal features. This enables us to predict the saliency map for any given frame using past frames. We comprehensively investigate the performance of our model with respect to state-of-the-art video saliency models. Experimental results on three large-scale datasets, DHF1K, UCF-SPORTS and DAVIS, demonstrate the competitiveness of our approach.
Download

Paper Nr: 60
Title:

Learn to See by Events: Color Frame Synthesis from Event and RGB Cameras

Authors:

Stefano Pini, Guido Borghi and Roberto Vezzani

Abstract: Event cameras are biologically-inspired sensors that gather the temporal evolution of the scene. They capture pixel-wise brightness variations and output a corresponding stream of asynchronous events. Despite having multiple advantages with respect to traditional cameras, their use is partially prevented by the limited applicability of traditional data processing and vision algorithms. To this aim, we present a framework which exploits the output stream of event cameras to synthesize RGB frames, relying on an initial or a periodic set of color key-frames and the sequence of intermediate events. Differently from existing work, we propose a deep learning-based frame synthesis method, consisting of an adversarial architecture combined with a recurrent module. Qualitative results and quantitative per-pixel, perceptual, and semantic evaluation on four public datasets confirm the quality of the synthesized images.
Download

Paper Nr: 90
Title:

Dual Single Pixel Imaging in SWIR using Compressed Sensing

Authors:

Martin Oja, Sebastian Olsson, Carl Brännlund, Andreas Brorsson, David Bergström and David Gustafsson

Abstract: In this paper, we present a dual Single Pixel Camera (SPC) operating in the Short Wave InfraRed (SWIR) spectral range that reconstructs high resolution images from an ensemble of compressed measurements. The SWIR spectrum provides significant benefits in many applications due to its night vision capabilities and its ability to penetrate smoke and fog. Walsh-Hadamard matrices are used for generating pseudo-random measurements which speed up the reconstruction and enable reconstruction of high resolution images. Total variation regularization is used for finding a sparse solution in the gradient space. The detectors have been fitted with analog filters and amplification in order to capture scenes in low light. A number of outdoor scenes with varying illumination have been collected using the dual single pixel sensor. Visual inspection of the reconstructed SWIR images indicate that most scenes and objects can be identified with a lower subsampling ratio (SR) compared to a single detector setup. The image quality is consistently better than with one detector, with similar results achieved with fewer samples or better results with the same number of samples. We also present measurements on moving objects in the scene and movements in the SPC unit and compare the results between single and dual detectors.
Download

Paper Nr: 98
Title:

Enhancing Deep Spectral Super-resolution from RGB Images by Enforcing the Metameric Constraint

Authors:

Tarek Stiebel, Philipp Seltsam and Dorit Merhof

Abstract: The task of spectral signal reconstruction from RGB images requires to solve a heavily underconstrained set of equations. In recent work, deep learning has been applied to solve this inherently difficult problem. Based on a given training set of corresponding RGB images and spectral images, a neural network is trained to learn an optimal end-to-end mapping. However, in such an approach no additional knowledge is incorporated into the networks prediction. We propose and analyze methods for incorporating prior knowledge based on the idea, that when reprojecting any reconstructed spectrum into the camera RGB space it must be (ideally) identical to the originally measured camera signal. It is therefore enforced, that every reconstruction is at least a metamer of the ideal spectrum with respect to the observed signal and observer. This is the one major constraint that any reconstruction should fulfil to be physically plausible, but has been neglected so far.
Download

Paper Nr: 126
Title:

Fast Local LUT Upsampling

Authors:

Hiroshi Tajima, Teppei Tsubokawa, Yoshihiro Maeda and Norishige Fukushima

Abstract: Edge-preserving filters have been used in various applications in image processing. As the number of pixels of digital cameras has been increasing, the computational cost becomes higher, since the order of the filters depends on the image size. There are several acceleration approaches for the edge-preserving filtering; however, most approaches reduce the dependency of filtering kernel size to the processing time. In this paper, we propose a method to accelerate the edge-preserving filters for high-resolution images. The method subsamples an input image and then performs the edge-preserving filtering on the subsampled image. Our method then upsamples the subsampled image with the guidance, which is the high-resolution input images. For this upsampling, we generate per-pixel LUTs for high-precision upsampling. Experimental results show that the proposed method has higher performance than the conventional approaches.
Download

Paper Nr: 136
Title:

Reflective Surface Reconstruction from Inverse Deflectometric Measurements

Authors:

Dominik Penk, Roman Sturm, Lars Seifert, Marc Stamminger and Günther Greiner

Abstract: Reconstructing reflective surfaces is a difficult task since most algorithms rely on photometric consistency between multiple views on the target object. However specular reflections are highly view dependent and thus violate this assumption. Previous work therefore often incorporates additional information, like polarization or the distortion of a known pattern, to perform specular surface reconstruction. We present a novel analysis by synthesis approach that defines an optimization problem using samples directly on the reconstructed surface. Based on this framework we describe two different setups for reconstruction, one using a line laser to create a reflection pattern and a second one, that uses point measurements to provide ray-measurement correspondences achieving improved accuracy.
Download

Paper Nr: 180
Title:

Image Restoration using Plug-and-Play CNN MAP Denoisers

Authors:

Siavash Bigdeli, David Honzátko, Sabine Süsstrunk and L. A. Dunbar

Abstract: Plug-and-play denoisers can be used to perform generic image restoration tasks independent of the degradation type. These methods build on the fact that the Maximum a Posteriori (MAP) optimization can be solved using smaller sub-problems, including a MAP denoising optimization. We present the first end-to-end approach to MAP estimation for image denoising using deep neural networks. We show that our method is guaranteed to minimize the MAP denoising objective, which is then used in an optimization algorithm for generic image restoration. We provide theoretical analysis of our approach and show the quantitative performance of our method in several experiments. Our experimental results show that the proposed method can achieve 70x faster performance compared to the state-of-the-art, while maintaining the theoretical perspective of MAP.
Download

Paper Nr: 194
Title:

Configural Representation of Facial Action Units for Spontaneous Facial Expression Recognition in the Wild

Authors:

Nazil Perveen and Chalavadi K. Mohan

Abstract: In this paper, we propose an approach for spontaneous expression recognition in the wild using configural representation of facial action units. Since all configural features do not contribute to the formation of facial expressions, we consider configural features from only those facial regions where significant movement is observed. These chosen configural features are used to identify the relevant facial action units, which are combined to recognize facial expressions. Such combinational rules are also known as coding system. However, the existing coding systems incur significant overlap among facial action units across expressions, we propose to use a coding system based on subjective interpretation of the expressions to reduce the overlap between facial action units, which leads to better recognition performance while recognizing expressions. The proposed approach is evaluated for various facial expression recognition tasks on different datasets: (a) expression recognition in controlled environment on two benchmark datasets, CK+ and JAFFE, (b) spontaneous expression recognition on two wild datasets, SFEW and AFEW, (c) laughter localization on MAHNOB laughter dataset, and (d) recognizing posed and spontaneous smiles on UVA-NEMO smile dataset.
Download

Paper Nr: 249
Title:

Two-step Multi-spectral Registration Via Key-point Detector and Gradient Similarity: Application to Agronomic Scenes for Proxy-sensing

Authors:

Jehan-Antoine Vayssade, Gawain Jones, Jean-Noel Paoli and Christelle Gee

Abstract: The potential of multi-spectral images is growing rapidly in precision agriculture, and is currently based on the use of multi-sensor cameras. However, their development usually concerns aerial applications and their parameters are optimized for high altitudes acquisition by drone (UAV ≈ 50 meters) to ensure surface coverage and reduce technical problems. With the recent emergence of terrestrial robots (UGV), their use is diverted for nearby agronomic applications. Making it possible to explore new agronomic applications, maximizing specific traits extraction (spectral index, shape, texture . . . ) which requires high spatial resolution. The problem with these cameras is that all sensors are not aligned and the manufacturers’ methods are not suitable for close-field acquisition, resulting in offsets between spectral images and degrading the quality of extractable informations. We therefore need a solution to accurately align images in such condition. In this study we propose a two-steps method applied to the six-bands Airphen multi-sensor camera with (i) affine correction using pre-calibrated matrix at different heights, the closest transformation can be selected via internal GPS and (ii) perspective correction to refine the previous one, using key-points matching between enhanced gradients of each spectral bands. Nine types of key-point detection algorithms (ORB, GFTT, AGAST, FAST, AKAZE, KAZE, BRISK, SURF, MSER) with three different modalities of parameters were evaluated on their speed and performances, we also defined the best reference spectra on each of them. The results show that GFTT is the most suitable methods for key-point extraction using our enhanced gradients, and the best spectral reference was identified to be the band centered on 570 nm for this one. Without any treatment the initial error is about 62 px, with our method, the remaining residual error is less than 1 px, where the manufacturer’s involves distortions and loss of information with an estimated residual error of approximately 12 px.
Download

Paper Nr: 261
Title:

Thermal Image Super-resolution: A Novel Architecture and Dataset

Authors:

Rafael E. Rivadeneira, Angel D. Sappa and Boris X. Vintimilla

Abstract: This paper proposes a novel CycleGAN architecture for thermal image super-resolution, together with a large dataset consisting of thermal images at different resolutions. The dataset has been acquired using three thermal cameras at different resolutions, which acquire images from the same scenario at the same time. The thermal cameras are mounted in a rig trying to minimize the baseline distance to make easier the registration problem. The proposed architecture is based on ResNet6 as a Generator and PatchGAN as a Discriminator. The novelty on the proposed unsupervised super-resolution training (CycleGAN) is possible due to the existence of aforementioned thermal images—images of the same scenario with different resolutions. The proposed approach is evaluated in the dataset and compared with classical bicubic interpolation. The dataset and the network are available.
Download

Paper Nr: 268
Title:

A Least Squares based Groupwise Image Registration Technique

Authors:

Nefeli Lamprinou, Nikolaos Nikolikos and Emmanouil Z. Psarakis

Abstract: Compared with pairwise registration, groupwise registration is capable of handling a large-scale population of images simultaneously in an unbiased way. In this work we improve upon the state-of-the-art pixel-level, Least-Squares (LS) based groupwise image registration methods. Specifically, we propose a new iterative algorithm which outperforms in terms of its computational cost, a recently introduced LS based iterative congealing scheme. Namely, the particle system that was introduced in that work is used and by imposing its “center of mass” to be motionless, during each iteration of the minimization process, a sequence of “centroid” images whose limit is the unknown “mean” image is optimally in closed form defined, thus solving in a reduced computational cost the groupwise problem. Moreover, the registration technique is properly adapted by the use of Self Quotient Images (SQI) in order to become capable for solving the groupwise registration of multimodal images. Since the proposed congealing technique is invariant to the size of the image set, it can be used for the successful solution of the problem on large image sets with low complexity. From the application of the proposed technique on a series of experiments for the groupwise registration of face, unimodal and multimodal magnetic resonance image sets its performance seems to be very good.
Download

Paper Nr: 288
Title:

Recovering Raindrop Removal Images under Heavy Rain

Authors:

Kosuke Matsumoto, Fumihiko Sakaue and Jun Sato

Abstract: In this paper, we propose a new method for removing raindrops in images under heavy rain. When we drive in heavy rain, the raindrops attached to the windshield form a film and our visibility degrades drastically. In such situations, the existing raindrop removal methods cannot recover clear images, since these methods assume that the background scene is visible through the gap between the raindrops, which does not happen anymore in heavy rain. Thus, we in this paper propose a new method for recovering raindrop removal images under heavy rain from sequential images by using conditional GAN. The results of our experiments on real images and synthetic images show that the proposed method outperforms the state-of-the-art raindrop removal method.
Download

Paper Nr: 295
Title:

Polygonal Meshes of Highly Noisy Images based on a New Symmetric Thinning Algorithm with Theoretical Guarantees

Authors:

Mohammed A. Siddiqui and Vitaliy Kurlin

Abstract: Microscopic images of vortex fields are important for understanding phase transitions in superconductors. These optical images include noise with high and variable intensity, hence are manually processed to extract numerical data from underlying meshes. The current thinning and skeletonization algorithms struggle to find connected meshes in these noisy images and often output edge pixels with numerous gaps and superfluous branching point. We have developed a new symmetric thinning algorithms to extract from such highly noisy images 1-pixel wide skeletons with theoretical guarantees. The resulting skeleton is converted into a polygonal mesh that has only polygonal edges at sub-pixel resolution. The experiments on over 100 real and 6250 synthetic images establish the state-of-the-art in extracting optimal meshes from highly noisy images.
Download

Short Papers
Paper Nr: 49
Title:

User-controllable Multi-texture Synthesis with Generative Adversarial Networks

Authors:

Aibek Alanov, Max Kochurov, Denis Volkhonskiy, Daniil Yashkov, Evgeny Burnaev and Dmitry Vetrov

Abstract: We propose a novel multi-texture synthesis model based on generative adversarial networks (GANs) with a user-controllable mechanism. The user control ability allows to explicitly specify the texture which should be generated by the model. This property follows from using an encoder part which learns a latent representation for each texture from the dataset. To ensure a dataset coverage, we use an adversarial loss function that penalizes for incorrect reproductions of a given texture. In experiments, we show that our model can learn descriptive texture manifolds for large datasets and from raw data such as a collection of high-resolution photos. We show our unsupervised learning pipeline may help segmentation models. Moreover, we apply our method to produce 3D textures and show that it outperforms existing baselines.
Download

Paper Nr: 81
Title:

Slag Removal Path Estimation by Slag Distribution and Deep Learning

Authors:

Junesuk Lee, Geon-Tae Ahn, Byoung-Ju Yun and Soon-Yong Park

Abstract: In the steel manufacturing process, de-slagging machine is used to remove slag floating on molten metal in a ladle. In general, temperature of floating slag on the surface of the molten metal is above 1,500℃. The process of removing such slag at high temperatures is dangerous and is only performed by trained human operators. In this paper, we propose a deep learning method for estimating the slag removal path to automate slag removal task. We propose an idea of developing a slag distribution image structure(SDIS); combined with a deep learning model to estimate the removal path in an environment in which the flow of molten metal cannot be controlled. The SDIS is given as the input into to the proposed deep learning model, which we train by imitating the removal task of experienced operators. We use both quantitative and qualitative analyses to evaluate the accuracy of the proposed method with the experienced operators.
Download

Paper Nr: 88
Title:

A Hierarchical Loss for Semantic Segmentation

Authors:

Bruce R. Muller and William P. Smith

Abstract: We exploit knowledge of class hierarchies to aid the training of semantic segmentation convolutional neural networks. We do not modify the architecture of the network itself, but rather propose to compute a loss that is a summation of classification losses at different levels of class abstraction. This allows the network to differentiate serious errors (the wrong superclass) from minor errors (correct superclass but incorrect finescale class) and to learn visual features that are shared between classes that belong to the same superclass. The method is straightforward to implement (we provide a PyTorch implementation that can be used with any existing semantic segmentation network) and we show that it yields performance improvements (faster convergence, better mean Intersection over Union) relative to training with a flat class hierarchy and the same network architecture. We provide results for the Helen facial and Mapillary Vistas road-scene segmentation datasets.
Download

Paper Nr: 138
Title:

Video Summarization through Total Variation, Deep Semi-supervised Autoencoder and Clustering Algorithms

Authors:

Eden Pereira da Silva, Eliaquim M. Ramos, Leandro Tavares da Silva, Jaime S. Cardoso and Gilson A. Giraldi

Abstract: Video summarization is an important tool considering the amount of data to analyze. Techniques in this area aim to yield synthetic and useful visual abstraction of the videos contents. Hence, in this paper we present a new summarization algorithm, based on image features, which is composed by the following steps: (i) Query video processing using cosine similarity metric and total variation smoothing to identify classes in the query sequence; (ii) With this result, build a labeled training set of frames; (iii) Generate the unlabeled training set composed by samples of the video database; (iv) Training a deep semi-supervised autoencoder; (v) Compute the K-means for each video separately, in the encoder space, with the number of clusters set as a percentage of the video size; (vi) Select key-frames in the K-means clusters to define the summaries. In this methodology, the query video is used to incorporate prior knowledge in the whole process through the obtained labeled data. The step (iii) aims to include unknown patterns useful for the summarization process. We evaluate the methodology using some videos from OPV video database. We compare the performance of our algorithm with the VSum. The results indicate that the pipeline was well succeed in the summarization presenting a F-score value superior to VSum.
Download

Paper Nr: 143
Title:

CNN-based Deblurring of Terahertz Images

Authors:

Marina Ljubenović, Shabab Bazrafkan, Jan De Beenhouwer and Jan Sijbers

Abstract: The past decade has seen a rapid development of terahertz (THz) technology and imaging. One way of doing THz imaging is measuring the transmittance of a THz beam through the object. Although THz imaging is a useful tool in many applications, there are several effects of a THz beam not fully addressed in the literature such as reflection and refraction losses and the effects of a THz beam shape. A THz beam has a non-zero waist and therefore introduces blurring in transmittance projection images which is addressed in the current work. We start by introducing THz time-domain images that represent 3D hyperspectral cubes and artefacts present in these images. Furthermore, we formulate the beam shape effects removal as a deblurring problem and propose a novel approach to tackle it by first denoising the hyperspectral cube, followed by a band by band deblurring step using convolutional neural networks (CNN). To the best of our knowledge, this is the first time that a CNN is used to reduce the THz beam shape effects. Experiments on simulated THz images show superior results for the proposed method compared to conventional model-based deblurring methods.
Download

Paper Nr: 162
Title:

An Empirical Evaluation of Cross-scene Crowd Counting Performance

Authors:

Rita Delussu, Lorenzo Putzu and Giorgio Fumera

Abstract: Crowd counting and density estimation are useful but also challenging tasks in many video surveillance systems, especially in cross-scene settings with dense crowds, if the target scene significantly differs from the ones used for training. This also holds for methods based on Convolutional Neural Networks (CNNs) which have recently boosted the performance of crowd counting systems, but nevertheless require massive amounts of annotated and representative training data. As a consequence, when training data is scarce or not representative of deployment scenarios, also CNNs may suffer from over-fitting to a different extent, and may hardly generalise to images coming from different scenes. In this work, we focus on real-world, challenging application scenarios when no annotated crowd images from a given target scene are available, and evaluate the cross-scene effectiveness of several regression-based state-of-the-art crowd counting methods, including CNN-based ones, through extensive cross-data set experiments. Our results show that some of the existing CNN-based approaches are capable of generalising to target scenes which differ from the ones used for training in the background or lighting conditions, whereas their effectiveness considerably degrades under different perspective and scale.
Download

Paper Nr: 173
Title:

ValidNet: A Deep Learning Network for Validation of Surface Registration

Authors:

Joy Mazumder, Mohsen Zand, Sheikh Ziauddin and Michael Greenspan

Abstract: This paper proposes a novel deep learning architecture called ValidNet to automatically validate 3D surface registration algorithms for object recognition and pose determination tasks. The performance of many tasks such as object detection mainly depends on the applied registration algorithms, which themselves are susceptible to local minima. Revealing this tendency and verifying the success of registration algorithms is a difficult task. We treat this as a classification problem, and propose a two-class classifier to distinguish clearly between true positive and false positive instances. Our proposed ValidNet deploys a shared mlp architecture which works on the raw and unordered numeric data of scene and model points. This network is able to perform two fundamental tasks of feature extraction and similarity matching using the powerful capability of deep neural network. Experiments on a large synthetic datasets show that the proposed method can effectively be used in automatic validation of registration.
Download

Paper Nr: 196
Title:

Corner Detection in Manifold-valued Images and in Vector Fields

Authors:

Aleksei Shestov and Mikhail Kumskov

Abstract: This paper is devoted to the problem of corner detection in manifold-valued images and in vector fields on manifolds. Our solution is a generalization of the Harris corner detector (C. Harris, 1988). As in the grayscale case, our algorithm is based on an estimation of a self-similarity of a point neighborhood. We define the self-similarity for the general cases and obtain approximations of it by an action of a bilinear form. This form can be viewed as a generalization of the structure tensor (M. Kass, 1987). The generalized structure tensor is then used as usual in the corner detection procedure. Finally, we describe future experiments: the algorithm will be tested on a task of chemical compounds classification.
Download

Paper Nr: 199
Title:

Latent-space Laplacian Pyramids for Adversarial Representation Learning with 3D Point Clouds

Authors:

Vage Egiazarian, Savva Ignatyev, Alexey Artemov, Oleg Voynov, Andrey Kravchenko, Youyi Zheng, Luiz Velho and Evgeny Burnaev

Abstract: Constructing high-quality generative models for 3D shapes is a fundamental task in computer vision with diverse applications in geometry processing, engineering, and design. Despite the recent progress in deep generative modelling, synthesis of finely detailed 3D surfaces, such as high-resolution point clouds, from scratch has not been achieved with existing learning-based approaches. In this work, we propose to employ the latent-space Laplacian pyramid representation within a hierarchical generative model for 3D point clouds. We combine the latent-space GAN and Laplacian GAN architectures proposed in the recent years to form a multi-scale model capable of generating 3D point clouds at increasing levels of detail. Our initial evaluation demonstrates that our model outperforms the existing generative models for 3D point clouds, emphasizing the need for an in-depth comparative study on the topic of multi-stage generative learning with point clouds.
Download

Paper Nr: 207
Title:

Assessing the Adequability of FFT-based Methods on Registration of UAV-Multispectral Images

Authors:

Jocival D. Dias Junior, André R. Backes, Maurício C. Escarpinati, Leandro P. Silva, Breno S. Costa and Marcelo F. Avelar

Abstract: Precision farming has greatly benefited from new technologies over the years. The use of multispectral and hyperspectral sensors coupled to Unmanned Aerial Vehicles (UAV) has enabled farms to monitor crops, improve the use of resources and reduce costs. Despite widely being used, multispectral images present a natural misalignment among the various spectra due to the use of different sensors, and the registration of these images is a complex process. In this paper, we address the problem of multispectral image registration and present a modification of the framework proposed by (Yasir et al., 2018). Our modification generalizes this framework, originally proposed to work with keypoints based methods, so that spectral domain methods (e.g. Phase Correlation) can be used in the registration process with great accuracy and smaller execution time.
Download

Paper Nr: 219
Title:

NMF vs. ICA for Light Source Separation under AC Illumination

Authors:

Ruri Oya, Ryo Matsuoka and Takahiro Okabe

Abstract: Artificial light sources powered by an electric grid change their intensities in response to the grid’s alternating current (AC). Their flickers are usually too fast to notice with our naked eyes, but can be captured by using cameras with short exposure time settings. In this paper, we propose a method for light source separation under AC illumination on the basis of Blind Source Separation (BSS). Specifically, we show that light source separation results in matrix factorization, since the input images of a scene illuminated by multiple AC light sources are represented by the linear combinations of the basis images, each of which is the image of the scene illuminated by only one of the light sources, with the coefficients, each of which is the intensity of the light source. Then, we make use of Non-negative Matrix Factorization (NMF), because both the pixel values of the basis images and the intensities of the light sources are non-negative. We experimentally confirmed that our method using NMF works better than Independent Component Analysis (ICA), and studied the performance of our method under various conditions: varying exposure times and noise levels.
Download

Paper Nr: 229
Title:

Channel-wise Aggregation with Self-correction Mechanism for Multi-center Multi-Organ Nuclei Segmentation in Whole Slide Imaging

Authors:

Mohamed Abdel-Nasser, Adel Saleh and Domenec Puig

Abstract: In the field of computational pathology, there is an essential need for accurate nuclei segmentation methods for performing different studies, such as cancer grading and cancer subtype classification. The ambiguous boundary between different cell nuclei and the other objects that have a similar appearance beside the overlapping and clumped nuclei may yield noise in the ground truth masks. To improve the segmentation results of cell nuclei in histopathological images, in this paper, we propose a new technique for aggregating the channel maps of semantic segmentation models. This technique is integrated with a self-correction learning mechanism that can handle noisy ground truth. We show that the proposed nuclei segmentation method gives promising results with images of different organs (e.g., breast, bladder, and colon)collected from medical centers that use devices of different manufacturers and stains. Our method reaches the new state-of-the-art. Mainly, we achieve the AJI score of 0.735 on the Multi-Organ Nuclei Segmentation benchmark, which outperforms the previous closest approaches.
Download

Paper Nr: 232
Title:

3D Video Spatiotemporal Multiple Description Coding Considering Region of Interest

Authors:

Ehsan Rahimi and Chris Joslin

Abstract: 3D video applications are being more favourable for observers as their requirements to receive and display 3D videos become more available recently. Therefore the demand for more processing power and bandwidth is increasing to stream and display 3D multimedia services in either the wired or wireless networks. Since channel failure has been always as an integral part of communication between a receiver and transmitters, a robust method of video streaming is always a hot topic for researchers. To make robustness against failure more stronger, it needs to increase redundancies however it destroys the coding and compressing efficiency. Therefore there is a trade-off problem between the coding efficiency and robustness of the stream. Among different methods of reliable video streaming, this paper introduces a new reliable 3D video streaming using hybrid multiple description coding. The proposed multiple description coding creates 3D video descriptions identifying interesting objects of the scene. To this end, a map for the region of interest is extracted from the depth map image first with a not complex algorithm compared to the available machine learning algorithm. Having realized region of interest, the proposed hybrid multiple description coding algorithm creates the descriptions for the color video using the advantages of both spatial and temporal multiple description coding methods; To this end, a non-identical decimation method concerning the identified objects assigns more bandwidth to those objects; second, background quality is improved with the temporal information. This way, first, the proposed method provides better visual performance as the human eye is more sensitive to objects than it is to pixels; second, the background is reconstructed with higher quality as it usually has a low movement and temporal information is a better choice to estimate the lost information. The objective test results verify the fact that the proposed method provides an improved performance than previous methods.
Download

Paper Nr: 241
Title:

Synthetic Ground Truth for Presegmentation of Known Objects for Effortless Pose Estimation

Authors:

Frederik Haarslev, William K. Juel, Norbert Krüger and Leon Bodenhagen

Abstract: We present a method for generating synthetic ground truth for training segmentation networks for presegmenting point clouds in pose estimation problems. Our method replaces global pose estimation algorithms such as RANSAC which requires manual fine-tuning with a robust CNN, without having to hand-label segmentation masks for the given object. The data is generated by blending cropped images of the objects with arbitrary backgrounds. We test the method in two scenarios, and show that networks trained on the generated data segments the objects with high accuracy, allowing them to be used in a pose estimation pipeline.
Download

Paper Nr: 242
Title:

Automatic Estimation of Sphere Centers from Images of Calibrated Cameras

Authors:

Levente Hajder, Tekla Tóth and Zoltán Pusztai

Abstract: Calibration of devices with different modalities is a key problem in robotic vision. Regular spatial objects, such as planes, are frequently used for this task. This paper deals with the automatic detection of ellipses in camera images, as well as to estimate the 3D position of the spheres corresponding to the detected 2D ellipses. We propose two novel methods to (i) detect an ellipse in camera images and (ii) estimate the spatial location of the corresponding sphere if its size is known. The algorithms are tested both quantitatively and qualitatively. They are applied for calibrating the sensor system of autonomous cars equipped with digital cameras, depth sensors and LiDAR devices.
Download

Paper Nr: 244
Title:

Transfer Learning from Synthetic Data in the Camera Pose Estimation Problem

Authors:

Jorge L. Charco, Angel D. Sappa, Boris X. Vintimilla and Henry O. Velesaca

Abstract: This paper presents a novel Siamese network architecture, as a variant of Resnet-50, to estimate the relative camera pose on multi-view environments. In order to improve the performance of the proposed model a transfer learning strategy, based on synthetic images obtained from a virtual-world, is considered. The transfer learning consists of first training the network using pairs of images from the virtual-world scenario considering different conditions (i.e., weather, illumination, objects, buildings, etc.); then, the learned weight of the network are transferred to the real case, where images from real-world scenarios are considered. Experimental results and comparisons with the state of the art show both, improvements on the relative pose estimation accuracy using the proposed model, as well as further improvements when the transfer learning strategy (synthetic-world data transfer learning real-world data) is considered to tackle the limitation on the training due to the reduced number of pairs of real-images on most of the public data sets.
Download

Paper Nr: 250
Title:

Cracking Biometric Authentication Cryptosystems

Authors:

Maryam Lafkih

Abstract: Biometric systems are becoming an alternative solution to replace traditional authentication systems. However, security and privacy concerns against these systems arise from the direct storage and the misuse of biometric information. In order to overcome these problems, biometric cryptosystems are proposed as template protection solution improving the confidentiality and the security. Biometric cryptosystems present a secret key mechanism where a secret key is used to overlap biometric data. Several approaches using biometric cryptosystems have been proposed, however a few works have been published giving detailed analysis of these systems and their security. In this paper we give a rigorous discussion on biometric cryptosystems taking into account their security evaluation. Besides, a conception framework of different attacks on biometric cryptosystems is proposed. On the other hand, several measures that can be exploited to decrease the probability of such type of attacks are also presented.
Download

Paper Nr: 251
Title:

Objects Detection from Digitized Herbarium Specimen based on Improved YOLO V3

Authors:

Abdelaziz Triki, Bassem Bouaziz, Walid Mahdi and Jitendra Gaikwad

Abstract: Automatic measurement of functional trait data from digitized herbarium specimen images is of great interest as traditionally, scientists extract such information manually, which is time-consuming and prone to errors. One challenging task in the automated measurement process of functional traits from specimen images is the existence of other objects such as scale-bar, color pallet, specimen label, envelopes, bar-code and stamp, which are mostly placed at different locations on the herbarium-mounting sheet and require special detection method. To detect automatically all these objects, we train a model based on an improved YOLO V3 full-regression deep neural network architecture, which has gained obvious advantages in both speed and accuracy through capturing deep and high-level features. We made some improvements to adjust YOLO V3 for detecting object from digitized herbarium specimen images. A new scale of feature map is added to the existing scales to improve the detection effect on small targets. At the same time, we adopted the fourth detection layer by a 4* up-sampled layer instead of 2* to get a feature map with higher resolution deeper level. The experimental results indicate that our model performed better with mAP-50 of 93.2% compared to 90.1% mean IoU trained by original YOLO V3 model on the test set.
Download

Paper Nr: 254
Title:

Compact Early Vision Signal Analyzers in Neuromorphic Technology

Authors:

Valentina Baruzzi, Giacomo Indiveri and Silvio P. Sabatini

Abstract: Reproducing the dynamics of biological neural systems using mixed signal analog/digital neuromorphic circuits makes these systems ideal platforms to implement low-power bio-inspired devices for a wide range of application domains. Despite these principled assets, neuromorphic system design has to cope with the limited resources presently available on hardware. Here, different spiking networks were designed, tested in simulation, and implemented on the neuromorphic processor DYNAP-SE, to obtain silicon neurons that are tuned to visual stimuli oriented at specific angles and with specific spatial frequencies, provided by the event camera DVS. Recurrent clustered inhibition was successfully tested on spiking neural networks, both in simulation and on the DYNAP-SE board, to obtain neurons with highly structured Gabor-like receptive fields (RFs); these neurons are characterized by tuning curves that are sharper or at least comparable to the ones obtained using equivalent feed-forward schemes, but require a significantly lower number of synapses. The resulting harmonic signal description provided by the proposed neuromorphic circuit could be potentially used for a complete characterization of the 2D local structure of the visual signal in terms of phase relationships from all the available oriented channels.
Download

Paper Nr: 255
Title:

A Contrario Elliptical Arc, Circular Arc and Line Segment Detection

Authors:

Boshra Rajaei and Rafael G. von Gioi

Abstract: In this paper, we propose a joint elliptical arc, circular arc, and line segment detector based on the a contrario statistical approach. Our method is an extension of the ELSDc method, recently proposed for line segment and elliptical arc detection. The main contribution is a more general geometrical model, which allows the joint evaluation of the best combination of elliptical arcs, circular arcs, and line segments that corresponds to a given contour. Different interpretations in terms of these elements are tried for the whole contour, instead of locally as it is done in ELSDc. In addition, several minor improvements were performed to the heuristic algorithm used to propose candidates. The performance of the proposed method is compared to the original one on synthetic and real images.
Download

Paper Nr: 262
Title:

A Novel Dispersion Covariance-guided One-Class Support Vector Machines

Authors:

Soumaya Nheri, Riadh Ksantini, Mohamed-Bécha Kaâniche and Adel Bouhoula

Abstract: In order to handle spherically distributed data, in a proper manner, we intend to exploit the subclass information. In one class classification process, many recently proposed methods try to incorporate subclass information in the standard optimization problem. We presume that we should minimize the within-class variance, instead of minimizing the global variance, with respect to subclass information. Covariance-guided One-Class Support Vector Machine (COSVM) emphasizes the low variance direction of the training dataset which results in higher accuracy. However, COSVM does not handle multi-modal target class data. More precisely, it does not take advantage of target class subclass information. Therefore, to reduce the dispersion of the target data with respect to newly obtained subclass information, we express the within class dispersion and we incorporate it in the optimization problem of the COSVM. So, we introduce a novel variant of the COSVM classifier, namely Dispersion COSVM, that exploits subclass information in the kernel space, in order to jointly minimize the dispersion within and between subclasses and improve classification performance. A comparison of our method to contemporary one-class classifiers on numerous real data sets demonstrate clearly its superiority in terms of classification performance.
Download

Paper Nr: 266
Title:

Different Modal Stereo: Simultaneous Estimation of Stereo Image Disparity and Modality Translation

Authors:

Ryota Tanaka, Fumihiko Sakaue and Jun Sato

Abstract: We propose a stereo matching method from the different modal image pairs. In this method, input images are taken in different viewpoints by different modal cameras, e.g., an RGB camera and an IR camera. Our proposed method estimates the disparity of the two images and translates the modality of the input images to different modality simultaneously. To achieve this simultaneous estimation, we utilize two networks, i.e., a disparity estimation method from a single image and modality translation method. Both methods are based on the neural networks, and then we train the network simultaneously. In this training, we focus on several consistencies between the different modal images. By these consistencies, two kinds of networks are effectively trained. Furthermore, we utilize image synthesis optimization on conditional GAN, and the optimization provides quite good results. Several experimental results by open databases show that the proposed method can estimate disparity and translate the modality even if the modalities of the input image pair are different.
Download

Paper Nr: 270
Title:

Single-shot Acquisition of Cylindrical Mesostructure Normals using Diffuse Illumination

Authors:

Inseung Hwang, Daniel S. Jeon and Min H. Kim

Abstract: Capturing high-quality surface normals is critical to acquire the surface geometry of mesostructures, such as hair and metal wires with high resolution. Existing image-based acquisition methods have assumed a specific type of surface reflectance. The shape-from-shading approach, a.k.a. photometric stereo, makes use of the shading information by a point light, assuming that surfaces are perfectly diffuse. The shape-from-specularity approach captures specular reflection densely, assuming that surfaces are overly smooth. These existing methods often fail, however, due to the difference between the presumed and the actual reflectance of real-world objects. Also, these existing methods require multiple images with different light vectors. In this work, we present a single-shot normal acquisition method, designed especially for cylindrical mesostructures on a near-flat geometry. We leverage diffuse illumination to eliminate the reflectance assumption. We then propose a local shape-from-intensity approach combined with local orientation detection. We conducted several experiments with synthetic and real objects. Quantitative and qualitative results validate that our method can capture surface normals of cylindrical mesostructures with high accuracy.
Download

Paper Nr: 276
Title:

CAR-CNN: A Deep Residual Convolutional Neural Network for Compression Artifact Removal in Video Surveillance Systems

Authors:

Miloud Aqqa and Shishir K. Shah

Abstract: Video compression algorithms are pervasively applied at the camera level prior to video transmission due to bandwidth constraints, thereby reducing the quality of video available for video analytics. These artifacts may lead to decreased performance of some core applications in video surveillance systems such as object detection. To remove such distortions during video decoding, it is required to recover original video frames from distorted ones. To this end, we present a fully convolutional residual network for compression artifact removal (CAR-CNN) without prior knowledge on the noise distribution trained using a novel, differentiable loss function. To provide a baseline, we also trained our model by optimizing the Structural Similarity (SSIM) and Mean Squared Error (MSE). We test CAR-CNN on self-collected data, and we show that it can be applied as a pre-processing step for the object detection task in practical, non-idealized applications where quality distortions may be present.
Download

Paper Nr: 279
Title:

Towards Keypoint Guided Self-Supervised Depth Estimation

Authors:

Kristijan Bartol, David Bojanić, Tomislav Petković, Tomislav Pribanić and Yago D. Donoso

Abstract: This paper proposes to use keypoints as a self-supervision clue for learning depth map estimation from a collection of input images. As ground truth depth from real images is difficult to obtain, there are many unsupervised and self-supervised approaches to depth estimation that have been proposed. Most of these unsupervised approaches use depth map and ego-motion estimations to reproject the pixels from the current image into the adjacent image from the image collection. Depth and ego-motion estimations are evaluated based on pixel intensity differences between the correspondent original and reprojected pixels. Instead of reprojecting the individual pixels, we propose to first select image keypoints in both images and then reproject and compare the correspondent keypoints of the two images. The keypoints should describe the distinctive image features well. By learning a deep model with and without the keypoint extraction technique, we show that using the keypoints improve the depth estimation learning. We also propose some future directions for keypoint-guided learning of structure-from-motion problems.
Download

Paper Nr: 292
Title:

Efficient One-to-One Pair Matching for 2-D and 3-D Edge Detection Evaluation

Authors:

Samuel Smith and Ian Williams

Abstract: This paper introduces a novel efficient method of obtaining one to one correspondence matching for fast, accurate, performance evaluation of edge detectors. The proposed Efficient Pairing Strategy (EPS) overcomes the computational cost limitations of the Hungarian algorithm, enabling a fast and accurate evaluation of 3-D data and large 2-D data sets. In this work, the accuracy of the EPS method is measured against the optimal Hungarian method across a data set of 124240 images, and is shown to produce accurate results with a Pearson Pairwise Correlation coefficient of 0.99 . Additionally the efficiency of the EPS method is compared against the fast Closest Distance Match (CDM), the Cost Scaling Assignment (CSA), and the commonly applied Pratt figure of Merit (PFOM) methods. Analysis shows the EPS and CSA methods both produce cost scaling accuracy comparable to the Hungarian algorithm. However the EPS method outperforms the CSA method in computational efficiency, achieving linear computation time comparable to the efficient sub-optimal methods. More generally, we make recommendations for using one to one correspondence matching over other methods in order to produce reliable performance scores across 2-D and 3-D image data.
Download

Paper Nr: 308
Title:

The Choice of Feature Representation in Small-Scale MobileNet-Based Imbalanced Image Recognition

Authors:

Michał Koziarski, Bogusław Cyganek and Kazimierz Wiatr

Abstract: Data imbalance remains one of the most wide-spread challenges in the contemporary machine learning. Presence of imbalanced data can affect the learning possibility of most traditional classification algorithms. One of the the strategies for handling data imbalance are data-level algorithms that modify the original data distribution. However, despite the amount of existing methods, most are ill-suited for handling image data. One of the possible solutions to this problem is using alternative feature representations, such as high-level features extracted from convolutional layers of a neural network. In this paper we experimentally evaluate the possibility of using both the high-level features, as well as the original image representation, on several popular benchmark datasets with artificially introduced data imbalance. We examine the impact of different data-level algorithms on both strategies, and base the classification on MobileNet neural architecture. Achieved results indicate that despite their theoretical advantages, high-level features extracted from a pretrained neural network result in a worse performance than end-to-end image classification.
Download

Paper Nr: 17
Title:

Gender Classification using the Gaze Distributions of Observers on Privacy-protected Training Images

Authors:

Michiko Inoue, Masashi Nishiyama and Yoshio Iwai

Abstract: We propose a method for classifying the gender of pedestrians using a classifier trained by images containing privacy-protection of the head region. Recently, manipulated training images containing pedestrians have been required to protect the privacy of pedestrians. In particular, the head regions of the training images are manipulated. However, the accuracy of gender classification decreases when privacy-protected training images are directly used. To overcome this issue, we aim to use the human visual ability to correctly discriminate males from females even though the head regions have been manipulated. We measure the gaze distributions of observers who view pedestrian images and use them to pre-process gender classifiers. The experimental results show that our method using gaze distribution improved the accuracy of gender classification when the head regions of the training images have been manipulated with masking, pixelization, and blur for privacy-protection.
Download

Paper Nr: 19
Title:

Cooperative Stereo-Zoom Matching for Disparity Computation

Authors:

Bo-Yang Zhuo and Huei-Yung Lin

Abstract: This paper investigates a stereo matching approach incorporated with the zooming information. Conventional stereo vision algorithms take one pair of images for correspondence matching, while our proposed method adopts two zoom lens cameras to acquire multiple stereo image pairs with zoom changes. These image sequences are able to provide more accurate results for stereo matching algorithms. The new framework makes the rectified images compliant with the zoom characteristics by the definition of the relationship between the left and right images. Our approach can be integrated with existing stereo matching algorithms under some requirements and adjustments. In the experiments, we test the proposed framework on 2014 Middlebury benchmark dataset and our own zoom image dataset. The results have demonstrated the improvement of disparity computation of the our technique.
Download

Paper Nr: 21
Title:

Real-time Object Detection and Tracking in Mixed Reality using Microsoft HoloLens

Authors:

Alessandro Farasin, Francesco Peciarolo, Marco Grangetto, Elena Gianaria and Paolo Garza

Abstract: This paper presents a mixed reality system that, using the sensors mounted on the Microsoft Hololens headset and a cloud service, acquires and processes in real-time data to detect and track different kinds of objects and finally superimposes geographically coherent holographic texts on the detected objects. Such a goal has been achieved dealing with the intrinsic headset hardware limitations, by performing part of the overall computation in a edge/cloud environment. In particular, the heavier object detection algorithms, based on Deep Neural Networks (DNNs), are executed in the cloud. At the same time we compensate for cloud transmission and computation latencies by running light scene detection and object tracking on board the headset. The proposed pipeline allows meeting the real-time constraint by exploiting at the same time the power of state of art DNNs and the potential of Microsoft Hololens. This paper presents the design choices and describes the original algorithmic steps we devised to achieve real time tracking in mixed reality. Finally, the proposed system is experimentally validated.
Download

Paper Nr: 24
Title:

Systematic Comparison of ORB-SLAM2 and LDSO based on Varying Simulated Environmental Factors

Authors:

Adam Kalisz, Tong Ling, Florian Particke, Christian Hofmann and Jörn Thielecke

Abstract: Although the number of outstanding but highly complex Visual SLAM systems which are published as open source has increased in recent years, they often lack a systematic evaluation of their weaknesses and failure cases. This work systematically discusses the key differences of two state-of-the-art Visual SLAM algorithms, the indirect ORB-SLAM2 and the direct LDSO, by extensive experiments in varying environments. The evaluation is principally focused to the trajectory accuracy and robustness of the algorithms in specific situations. However, details about individual components used for the estimation of trajectories in both systems are presented. In order to investigate crucial aspects, a custom dataset was created in a 3D modeling software, Blender, to acquire the data for all experiments. The experimental results demonstrate the strengths and weaknesses of the systems. In particular, this research contributes insight into: 1. The influence of moving objects in a usually static scene. 2. How both systems react on periodicly changing scene lighting, both local and global. 3. The role of initialization on the resistance to dynamic changes in the scene.
Download

Paper Nr: 29
Title:

The Recipe for Some Invariant Numbers and for a New Projective Invariant Feature Descriptor

Authors:

Raphael S. Evangelista and Leandro F. Fernandes

Abstract: The Computer Vision literature provides a range of techniques designed to detect and describe local features in images. The applicability of these techniques in visual tasks is directly related to the invariance of each kind of descriptor to a group of geometric transformations. To the best of our knowledge, there is no local feature descriptor solely based on single intensity images that are invariant to projective transformations. We present how to use existing monomials invariant to similarity, affine, and projective transformations to compute invariant numbers from junctions’ geometry. In addition, we present a new junction-based invariant number and use it to propose a new local feature descriptor invariant to projective transformations in digital images.
Download

Paper Nr: 33
Title:

End-to-End Denoising of Dark Burst Images using Recurrent Fully Convolutional Networks

Authors:

Lan Ma, Di Zhao, Songnan Li and Dahai Yu

Abstract: When taking photos in dim-light environments, due to the small amount of light entering, the shot images are usually extremely dark, with a great deal of noise, and the color cannot reflect real-world color. Under this condition, the traditional methods used for single image denoising have always failed to be effective. One common idea is to take multiple frames of the same scene to enhance the signal-to-noise ratio. This paper proposes a recurrent fully convolutional network (RFCN) to process burst photos taken under extremely low-light conditions, and to obtain denoised images with improved brightness. Our model maps raw burst images directly to sRGB outputs, either to produce a best image or to generate a multi-frame denoised image sequence. This process has proven to be capable of accomplishing the low-level task of denoising, as well as the high-level task of color correction and enhancement, all of which is end-to-end processing through our network. Our method has achieved better results than state-of-the-art methods. In addition, we have applied the model trained by one type of camera without fine-tuning on photos captured by different cameras and have obtained similar end-to-end enhancements.
Download

Paper Nr: 34
Title:

An Audio-Visual based Feature Level Fusion Approach Applied to Deception Detection

Authors:

Safa Chebbi and Sofia Ben Jebara

Abstract: Due to the increasing requirement of security and antiterrorism issues, research activities in the field of deception detection have been receiving a big attention. For this reason, many studies dealing with deception detection have been developed varying in terms of approaches, modalities, features and learning algorithms. Despite the wide range of proposed approaches in this task, there is no universal and effective system until today capable of identifying deception with a high recognition rate. In this paper, a feature level fusion approach, combining audio and video modalities, has been proposed to build an automated system that can help in decision making of honesty or lie. Thus a high feature vector size, combining verbal features (72 pitch-based ones) and nonverbal ones related to facial expressions and body gestures, is extracted. Then, a feature level fusion is applied in order to select the most relevant ones. A special interest is given to mutual information-based criteria that are well adapted to continuous and binary features combination. Simulation results on a realistic database of suspicious persons interrogation achieved 97% as deception/truth classification accuracy using 19 audio/video mixed features, which outperforms the state-of-the-art results.
Download

Paper Nr: 48
Title:

Individual Avatar Skeletal based Animation Feedback for Assisted Motion Control

Authors:

Lars Lehmann, Christian Wiede and Gangolf Hirtz

Abstract: In medical training therapy (MTT), the precise execution of the training exercises is of decisive importance for the success of the therapy. Currently, a therapist has to treat up to 15 patients simultaneously on an outpatient basis. Recently an assistance system that can assess both the quantity and the quality of movement was developed. A feedback system models target-oriented recommendations for actions and communicates them directly to the patient. A hardware accelerated visualisation system using OpenGL and GLSL shaders was realised to animate a real-time rendered 3D mesh connected to an extracted motion skeleton. An avatar visualises the error by colouring the body regions with traffic light trenches. The individualisation of the underlying three-dimensional avatar increases the willingness of the patients to participate in the exercises which they perform autonomously without the supervision of therapists.
Download

Paper Nr: 68
Title:

Removing Reflection from In-vehicle Camera Image

Authors:

Keisuke Inoue, Fumihiko Sakaue and Jun Sato

Abstract: When taking images with an in-vehicle camera, objects in the vehicle are often reflected on the windshield due to sunlight, and they appear in the camera image. Since these reflections cause malfunction of autonomous driving systems, it is very important to remove the reflections from in-vehicle camera images. Thus, we in this paper propose a method for separating reflections and background scenes, and for generating images without reflections. Unlike the existing reflection removal methods, our method conducts the signal separation and the motion field computation simultaneously, so that we can separate images without using edge information. The efficiency of the proposed method is demonstrated by comparing with existing state-of-the-art methods.
Download

Paper Nr: 74
Title:

A Hierarchical Approach for Indoor Action Recognition from New Infrared Sensor Preserving Anonymity

Authors:

Félix Polla, Hélène Laurent and Bruno Emile

Abstract: This article is made in the context of action recognition from infrared video footage for indoor installations. The sensor we use has some peculiarities that make the acquired images very different from those of the visible imagery. It is developed within the CoCAPS project in which our work takes place. In this context, we propose a hierarchical model that takes an image set as input, segments it, constructs the corresponding motion history image (MHI), extracts and selects characteristics that are then used by three classifiers for activity recognition purposes. The proposed model presents promising results, notably compared to other models extracted from deep learning literature. The dataset, designed for the CoCAPS project in collaboration with industrial partners, targets office situations. Seven action classes are concerned, namely: no action, restlessness, sitting down, standing up, turning on a seat, slow walking, fast walking.
Download

Paper Nr: 80
Title:

Fast Scene Text Detection with RT-LoG Operator and CNN

Authors:

Dinh C. Nguyen, Mathieu Delalandre, Donatello Conte and The A. Pham

Abstract: Text detection in scene images is of particular importance for the computer-based applications. The text detection methods must be robust against variabilities and deformations of text entities. In addition, to be embedded into mobile devices, the methods have to be time efficient. In this paper, the keypoint grouping method is proposed by first applying the real-time Laplacian of Gaussian operator (RT-LoG) to detect keypoints. These keypoints will be grouped to produce the character patterns. The patterns will be filtered out by using a CNN model before aggregating into words. Performance evaluation is discussed on the ICDAR2017 RRC-MLT and the Challenge 4 of ICDAR2015 datasets. The results are given in terms of detection accuracy and time processing against different end-to-end systems in the literature. Our system performs as one of the strongest detection accuracy while supporting at approximately 15.6 frames per second to the HD resolution on a regular CPU architecture. It is one of the best candidates to guarantee the trade-off between accuracy and speed in the literature.
Download

Paper Nr: 86
Title:

Regularization in Higher-order Photometric Stereo Inspection for Non-Lambertian Reflections

Authors:

Doris Antensteiner and Svorad Štolc

Abstract: In this paper we present and compare two regularized higher-order photometric stereo approaches for the reconstruction of varying albedos and surface normals of non-Lambertian materials. We evaluate the two different higher-order polynomial methods, which we additionally regularize with Tikhonov’s method. The reconstruction of surface properties is essential for a vast amount of industrial applications, such as the identification of surface defects, the analysis of security features or the detection of forged documents. For the reconstruction of Lambertian objects, lower order models can be used to achieve an accurate representation, while higher-order models allow the description of non-Lambertian behaviors accurately. Qualitative and quantitative results on a ground truth dataset as well as on real-world data show that the use of a regularized higher-order polynomial model can significantly improve the surface normal and albedo reconstructions.
Download

Paper Nr: 94
Title:

Location Estimation of an Urban Scene using Computer Vision Techniques

Authors:

Paul Gordan, Hanniel Boros and Ion Giosan

Abstract: The process of adding the geographical identification data to an image is called geotagging and is important for a range of applications starting from tourism to law enforcement agencies. The most convenient way of adding location metadata to an image is GPS geotagging. This article presents an alternative way of adding the approximate location metadata to an urban scene image by finding similar images in a dataset of geotagged images. The matching is done by extracting the image features and descriptors and matching them. The dataset consists in geotagged 360◦ panoramic images. We explored three methods of matching the images, each one being an iteration of the previous method. The first method used only feature detection and matching using AKAZE and FLANN, the second method performed image segmentation to provide a mask for extracting features and descriptors only from buildings and the third method preprocessed the dataset to obtain better accuracy. We managed to improve the accuracy of the system by 25%. Following the in-depth analysis of the results we will present the results as well as future improvements.
Download

Paper Nr: 95
Title:

Image Time Series Classification based on a Planar Spatio-temporal Data Representation

Authors:

Mohamed Chelali, Camille Kurtz, Anne Puissant and Nicole Vincent

Abstract: Image time series such as MRI functional sequences or Satellite Image Time Series (SITS) provide valuable information for the automatic analysis of complex patterns through time. A major issue when analyzing such data is to consider at the same time their temporal and spatial dimensions. In this article we present a novel data representation that makes image times series compatible with classical deep learning model, such as Convolutional Neural Networks (CNN). The proposed approach is based on a novel planar representation of image time series that converts 2D + t data as 2D images without loosing too much spatial or temporal information. Doing so, CNN can learn at the same time the parameters of 2D filters involving temporal and spatial knowledge. Preliminary results in the remote sensing domain highlight the ability of our approach to discriminate complex agricultural land-cover classes from a SITS.
Download

Paper Nr: 101
Title:

Learning-based Material Classification in X-ray Security Images

Authors:

Benedykciuk Emil, Denkowski Marcin and Dmitruk Krzysztof

Abstract: Although a large number of papers have been published on material classification in the X-ray images, relatively few of them study X-ray security raw images as regards of material classification. This paper takes into consideration the task of materials classification into four main types of organics and metals in images obtained from Dual-Energy X-ray (DEXA) security scanner. We adopt well-known methods of machine learning and conduct experiments to examine the effects of various combinations of data and algorithms for generalization of the material classification problem. The methods giving the best results (Random Forests and Support Vector Machine) were used to predict the materials at every pixel in the testing image. The results motivate a novel segmentation scheme based on the multi-scale patch classification. This paper also introduces a new, open dataset of X-ray images (MDD) of various materials. The database contains over one million samples, labelled and stored in its raw, original 16-bit depth form.
Download

Paper Nr: 113
Title:

Thyroid Ultrasound Images Classification using the Shearlet Coefficients and the Generic Fourier Descriptor

Authors:

Noura Aboudi and Nawres Khlifa

Abstract: To ameliorate the classification accuracy of the thyroid ultrasound imaging computer-aided diagnosis (CAD) system based on feature extraction, we used the Shearlet Transform (ST) to extract texture features, and the Generic Fourier Descriptor (GFD) to extract shape feature descriptor based on contours information. The ST supplies a rotation invariant descriptor at various scales. The GFD descriptor is autonomous, robust, and has no redundant features. Then, we applied a feature selection method on the extracted shearlet descriptor to build up the performance metrics. Finally, we used the objective metrics(sensitivity, specificity, and accuracy) to validate the performance of the proposed method. Experimentally, we apply our novel methods on a public dataset and we use the Support Vector Machine(SVM) and Random Forest (RF) as classifier. The obtained results prove the superiority of the proposed method.
Download

Paper Nr: 118
Title:

Image-quality Improvement of Omnidirectional Free-viewpoint Images by Generative Adversarial Networks

Authors:

Oto Takeuchi, Hidehiko Shishido, Yoshinari Kameda, Hansung Kim and Itaru Kitahara

Abstract: This paper proposes a method to improve the quality of omnidirectional free-viewpoint images using generative adversarial networks (GAN). By estimating the 3D information of the capturing space while integrating the omnidirectional images taken from multiple viewpoints, it is possible to generate an arbitrary omnidirectional appearance. However, the image quality of free-viewpoint images deteriorates due to artifacts caused by 3D estimation errors and occlusion. We solve this problem by using GAN and, moreover, by focusing on projective geometry during training, we further improve image quality by converting the omnidirectional image into perspective-projection images.
Download

Paper Nr: 133
Title:

Skeleton-geodesic Distances for Shape Recognition: Efficient Computation by Continuous Skeleton

Authors:

Nikita Lomov

Abstract: We consider the problem of determining the distance between points of a planar shape, which would be informative and resistant to shape transformations, including flexible articulations. The proposed distance is defined as the length of the shortest path through the skeleton between the projections of the points on the skeleton and called skeleton-geodesic distance. To calculate the values of interest, a continuous medial representation of polygonal shape is used. The method of calculating the distance is based on the following principle: at first, calculate all skeleton-geodesic distances between pairs of “reference” points, which are the vertices of the skeleton, using the traditional graph algorithms; then refine them by adding the distances from the points in question to the nearest reference points. This approach allows us to achieve computational efficiency and to derive analytical formulas for direct calculation. An analogue of shape context using skeleton-geodesic distances and angles between branches of the skeleton is proposed. Examples of using these descriptors in the task of recognition of flexible objects are presented, showing that the distance proposed often provides greater performance compared to Euclidean or geodesic distances.
Download

Paper Nr: 152
Title:

Mirror Symmetry Detection in Digital Images

Authors:

L. Mestetskiy and A. Zhuravskaya

Abstract: This article proposes an approach to the recognition of symmetrical objects in digital images, based on a quantitative asymmetry measure construction of such objects. The object asymmetry measure is determined through the Fourier descriptor of a discrete object boundary points sequence. A method has been developed for calculating the asymmetry measure and determining the most likely symmetry axis based on minimizing the asymmetry measure. The proposed solution using the Fourier descriptor has a quadratic complexity in the number of the object boundary points. A practical assessment of the efficiency and effectiveness of the algorithm is obtained by computational experiments with silhouettes of aircraft in remote sensing images.
Download

Paper Nr: 156
Title:

Multi-pooled Inception Features for No-reference Video Quality Assessment

Authors:

Domonkos Varga

Abstract: Video quality assessment (VQA) is an important element of a broad spectrum of applications ranging from automatic video streaming to surveillance systems. Furthermore, the measurement of video quality requires an extensive investigation of image and video features. In this paper, we introduce a novel feature extraction method for no-reference video quality assessment (NR-VQA) relying on visual features extracted from multiple Inception modules of pretrained convolutional neural networks (CNN). Hence, we show a solution which incorporates both intermediate- and high-level deep representations from a CNN to predict digital videos’ perceptual quality. Second, we demonstrate that processing all frames of a video to be evaluated is unnecessary and examining only the so-called intra-frames saves computational time and improves performance significantly. The proposed architecture was trained and tested on the recently published KoNViD-1k database.
Download

Paper Nr: 158
Title:

Robust Perceptual Night Vision in Thermal Colorization

Authors:

Feras Almasri and Olivier Debeir

Abstract: Transforming a thermal infrared image into a robust perceptual colour visual image is an ill-posed problem due to the differences in their spectral domains and in the objects’ representations. Objects appear in one spectrum but not necessarily in the other, and the thermal signature of a single object may have different colours in its visual representation. This makes a direct mapping from thermal to visual images impossible and necessitates a solution that preserves texture captured in the thermal spectrum while predicting the possible colour for certain objects. In this work, a deep learning method to map the thermal signature from the thermal image’s spectrum to a visual representation in their low-frequency space is proposed. A pan-sharpening method is then used to merge the predicted low-frequency representation with the high-frequency representation extracted from the thermal image. The proposed model generates colour values consistent with the visual ground truth when the object does not vary much in its appearance and generates averaged grey values in other cases. The proposed method shows robust perceptual night vision images in preserving the object’s appearance and image context compared with the existing state-of-the-art.
Download

Paper Nr: 159
Title:

Comparative Study of a Commercial Tracking Camera and ORB-SLAM2 for Person Localization

Authors:

Safa Ouerghi, Nicolas Ragot, Remi Boutteau and Xavier Savatier

Abstract: Aiming at localizing persons in industrial sites is a major concern towards the development of the factory of the future. During the last years, developments have been made in several active research domains targeting the localization problem, among which the vision-based Simultaneous Localization and Mapping paradigm. This has led to the development of multiple algorithms in this field such as ORB-SLAM2 known to be the most complete method as it incorporates the majority of the state-of-the-art techniques. Recently, new commercial and low-cost systems have also emerged in the market that can estimate the 6-DOF motion. In particular, we refer here to the Intel Realsense T265, a standalone 6-DOF tracking sensor that runs a visual-inertial SLAM algorithm and that accurately estimates the 6-DOF motion as claimed by the Intel company. In this paper, we present an evaluation of the Intel T265 tracking camera by comparing its localization performances to the ORB-SLAM2 algorithm. This benchmarking fits within a specific use-case: the person localization in an industrial site. The experiments have been conducted in a platform equipped with a VICON motion capture system, which physical structure is similar to a one that we could find in an industrial site. The Vicon system is made of fifteen high-speedtracking cameras (100 Hz) which provides highly accurate poses that were used as ground truth reference. The sequences have been recorded using both an Intel RealSense D435 camera to use its stereo images with ORB-SLAM2 and the Intel RealSense T265. The two sets of timestamped poses (VICON and the ones provided by the cameras) were aligned then calibrated using the point set registration method. The Absolute Trajectory Error, the Relative Trajectory Error and the Euclidian Distance Error metrics were employed to benchmark the localization accuracy from ORB-SLAM2 and T265. The results show a competitive accuracy of both systems for a handheld camera in an indoor industrial environment with a better reliability with the T265 Tracking system.
Download

Paper Nr: 160
Title:

Investigating Synthetic Data Sets for Crowd Counting in Cross-scene Scenarios

Authors:

Rita Delussu, Lorenzo Putzu and Giorgio Fumera

Abstract: Crowd counting and density estimation are crucial functionalities in intelligent video surveillance systems but are also very challenging computer vision tasks in scenarios characterised by dense crowds, due to scale and perspective variations, overlapping and occlusions. Regression-based crowd counting models are used for dense crowd scenes, where pedestrian detection is infeasible. We focus on real-world, cross-scene application scenarios where no manually annotated images of the target scene are available for training regression models, but only images with different backgrounds and camera views can be used (e.g., from publicly available data sets), which can lead to low accuracy. To overcome this issue, we propose to build the training set using synthetic images of the target scene, which can be automatically annotated with no manual effort. This work provides a preliminary empirical evaluation of the effectiveness of the above solution. To this aim, we carry out experiments using real data sets as the target scenes (testing set) and using different kinds of synthetically generated crowd images of the target scenes as training data. Our results show that synthetic training images can be effective, provided that also their background, beside their perspective, closely reproduces the one of the target scene.
Download

Paper Nr: 164
Title:

Ambient Lighting Generation for Flash Images with Guided Conditional Adversarial Networks

Authors:

José Chávez, Rensso Mora and Edward Cayllahua-Cahuina

Abstract: To cope with the challenges that low light conditions produce in images, photographers tend to use the light provided by the camera flash to get better illumination. Nevertheless, harsh shadows and non-uniform illumination can arise from using a camera flash, especially in low light conditions. Previous studies have focused on normalizing the lighting on flash images; however, to the best of our knowledge, no prior studies have examined the sideways shadows removal, reconstruction of overexposed areas, and the generation of synthetic ambient shadows or natural tone of scene objects. To provide more natural illumination on flash images and ensure high-frequency details, we propose a generative adversarial network in a guided conditional mode. We show that this approach not only generates natural illumination but also attenuates harsh shadows, simultaneously generating synthetic ambient shadows. Our approach achieves promising results on a custom FAID dataset, outperforming our baseline studies. We also analyze the components of our proposal and how they affect the overall performance and discuss the opportunities for future work.
Download

Paper Nr: 178
Title:

Comparison of Binary Images based on Jaccard Measure using Symmetry Information

Authors:

Sofia Fedotova, Olesia Kushnir and Oleg Seredin

Abstract: Method of comparing binary raster images using information about the axes of symmetry of the shapes is proposed, which will allow to take into account the translation, rotation and scaling of a pair of images. The symmetry axis of the figure is searched by one of the previously developed methods: based on the skeleton representation of the figure (Kushnir et al., 2016), the adjustment of the skeleton axis or exhaustive search (Kushnir et al., 2019). Jaccard measure is used as a measure of similarity. Three comparison algorithms were developed. The paper demonstrates that using information about the symmetry of the shapes with simple principle of comparison as the Jaccard measure allows to obtain significant results. The possibility of using this approach for image classification is also investigated. The algorithms were experimentally studied on the “Flavia” and “Butterflies” datasets.
Download

Paper Nr: 197
Title:

Image-based Classification of Swiss Traditional Costumes using Contextual Features

Authors:

Artem Khatchatourov and Christoph Stamm

Abstract: In this work we propose a method for feature-based clothing recognition, and prove its applicability by performing image-based recognition of Swiss traditional costumes. We employ an estimation of a simplified human skeleton (a poselet) to extract visually indistinguishable but reproducible features. The descriptors of those features are constructed, while accounting for possible displacement of clothes along human body. The similarity metrics mean squared error and correlation coefficient are surveyed, and color spaces YIQ and CIELAB are investigated for their ability to isolate scene brightness in a separate channel. We show that the model trained with mean squared error performs best in the CIELAB color space and achieves an F0.5-score of 0.77. Furthermore, we show that omission of the brightness channel produces less biased, but overall poorer descriptors.
Download

Paper Nr: 204
Title:

Raindrop Removal in a Vehicle Camera Video Considering the Temporal Consistency for Driving Support

Authors:

Hiroki Inoue, Keisuke Doman, Jun Adachi and Yoshito Mekada

Abstract: This paper proposes a recursive framework for raindrop removal in a vehicle camera video considering the temporal consistency. Raindrops attached to a vehicle camera lens may prevent a driver or a camera-based system from recognizing the traffic environment. This research aims to develop a framework for raindrop detection and removal in order to deal with such a situation. The proposed method sequentially and recursively restores a video containing no raindrops from original one that may contain raindrops. The proposed method uses an output (restored) image as one of the input frames for the next image restoration process in order to improve the restoration quality, which is the key concept of the proposed framework. In each restoration process, the proposed method first detects raindrops in each input video frame, and then restores the raindrop regions based on the optical flow. The optical flow can be calculated in the outer part of the raindrop region more accurately than the inner part due to the difficulty of finding a corresponding pixel, which is the assumption for designing the proposed method. We confirmed that the proposed framework has the potential for improving the restoration accuracy through several preliminary experiments and evaluation experiments.
Download

Paper Nr: 213
Title:

Federated Learning on Distributed Medical Records for Detection of Lung Nodules

Authors:

Pragati Baheti, Mukul Sikka, K. V. Arya and R. Rajesh

Abstract: In this work, the concept of federated Learning is applied on medical records of CT scans images for detection of pulmonary lung nodules. Instead of using the naive ways, the authors have come up with decentralizing the training technique by bringing the model to the data rather than accumulating the data at a central place and thus maintaining differential privacy of the records. The training on distributed electronic medical records includes two models: detection of location of nodules and its confirmation. The experiments have been carried out on CT scan images from LIDC dataset and the results shows that the proposed method outperformed the existing methods in terms of detection accuracy.
Download

Paper Nr: 214
Title:

Automatic Skin Lesion Segmentation based on Saliency and Color

Authors:

Giuliana Ramella

Abstract: Segmenting skin lesions in dermoscopic images is a key step for the automatic diagnosis of melanoma. In this framework, this paper presents a new algorithm that after a pre-processing phase aimed at reducing the computation burden, removing artifacts and improving contrast, selects the skin lesion pixels in terms of their saliency and color. The method is tested on a publicly available dataset and is evaluated both qualitatively and quantitatively.
Download

Paper Nr: 245
Title:

Defect Detection using Deep Learning from Minimal Annotations

Authors:

Manpreet S. Minhas and John Zelek

Abstract: Visual defect assessment is an important task for infrastructure asset monitoring to detect faults (e.g., road distresses, bridge cracks, etc) for recognizing and tracking the distress. This is essential to make a decision on the best course of action, whether that be a minor or major repair or the status quo. Typically a lot of this surveillance and annotation is done by human operators. Until now, visual defect assessment has been carried out manually because of the challenging nature of the task. However, the manual inspection method has several drawbacks, such as training time and cost, human bias and subjectivity, among others. As a result, automation in visual defect detection has attracted a lot of attention. Deep learning approaches are encouraging the automation of this detection activity. The actual perceptual surveillance can be conducted with camera-equipped land vehicles or drones. The automatic defect detection task can be formulated as the problem of anomaly detection in which samples that deviate from the normal or defect-free ones need to be identified. Recently, Convolutional Neural Networks (CNNs) have shown tremendous potential in image-related tasks and have outperformed the traditional hand-crafted feature-based methods. But, CNNs require a large number of labelled data, which is virtually unavailable for all the practical applications and is a major drawback. This paper proposes the application of network-based transfer learning using CNNs for the task of visual defect detection that overcomes the challenge of training from a limited number of samples. Results obtained show that the proposed method achieves high performance from limited data samples with average F1 score and AUROC values of 0.8914 and 0.9766 respectively. The number of training defect samples were as low as 20 images for the Fray category of the Magnetic Tile defect data-set.
Download

Paper Nr: 278
Title:

Scene Adaptive Structured Light 3D Imaging

Authors:

Tomislav Pribanic, Tomislav Petkovic, David Bojanic, Kristijan Bartol and Mohit Gupta

Abstract: A 3D structured light (SL) system is one powerful 3D imaging alternative which in the simplest case is composed of a single camera and a single projector. The performance of 3D SL system has been studied considering many aspects, for example, accuracy and precision, robustness to various imaging factors, applicability to a dynamic scene capture, hardware and image processing complexity, to name but a few. In this work we consider the spatial projector-camera set up and its influence on the uncertainty of points’ depth reconstruction. In particular, we show how a depth precision is in a great extent determined by the angle of pattern projection and the angle of imaging from a projector and a camera, respectively. For a fixed camera projector configuration, those angles are scene dependent for various points in space. Consequently, the attainable depth precision will typically vary considerably across the volume of reconstruction which is not a desirable property. To that end, we study a scene dependent 3D imaging approach during which we propose how to conveniently detect points with a lower depth precision and to influence other factors of a depth precision, in order to improve a depth precision in scene parts where necessary.
Download

Paper Nr: 293
Title:

AutoPOSE: Large-scale Automotive Driver Head Pose and Gaze Dataset with Deep Head Orientation Baseline

Authors:

Mohamed Selim, Ahmet Firintepe, Alain Pagani and Didier Stricker

Abstract: In computer vision research, public datasets are crucial to objectively assess new algorithms. By the wide use of deep learning methods to solve computer vision problems, large-scale datasets are indispensable for proper network training. Various driver-centered analysis depend on accurate head pose and gaze estimation. In this paper, we present a new large-scale dataset, AutoPOSE. The dataset provides ∼ 1.1M IR images taken from the dashboard view, and ∼ 315K from Kinect v2 (RGB, IR, Depth) taken from center mirror view. AutoPOSE’s ground truth -head orientation and position-was acquired with a sub-millimeter accurate motion capturing system. Moreover, we present a head orientation estimation baseline with a state-of-the-art method on our AutoPOSE dataset. We provide the dataset as a downloadable package from a public website.
Download

Paper Nr: 294
Title:

Application of U-Net and Auto-Encoder to the Road/Non-road Classification of Aerial Imagery in Urban Environments

Authors:

Amanda Spolti, Vitor C. Guizilini, Caio T. Mendes, Matheus D. Croce, André R. de Geus, Henrique C. Oliveira, André R. Backes and Jefferson R. Souza

Abstract: One of the challenges in extracting road network from aerial images is an enormous amount of different cartographic features interacting with each other. This paper presents a methodology to detect the road network from aerial images. The methodology applies a Deep Learning (DL) architecture named U-Net and a fully convolutional Auto-Encoder for comparison. High-resolution RGB images of an urban area were obtained from a conventional photogrammetric mission. The experiments show that both architectures achieve satisfactory results for detecting road network while maintaining low inference time once DL networks are trained.
Download

Paper Nr: 297
Title:

Pitching Classification and Habit Detection by V-Net

Authors:

Sota Kato and Kazuhiro Hotta

Abstract: In this paper, we propose a method that is classified pitching motions using deep learning and detected the habits of pitching. In image classification, there is a method called Grad-CAM to visualize the location related to classification. However, it is difficult to apply the Grad-CAM to conventional video classification methods using 3D-Convolution. To solve this problem, we propose a video classification method based on V-Net. By reconstructing input video, it is possible to visualize the frame and location related to classification result based on Grad-CAM. In addition, we improved the classification accuracy in comparison with conventional methods using 3D-Convolution and reconstruction. From experimental results, we confirmed the effectiveness of our method.
Download

Paper Nr: 298
Title:

Semantic Segmentation using Light Attention Mechanism

Authors:

Yuki Hiramatsu and Kazuhiro Hotta

Abstract: Semantic segmentation using convolutional neural networks (CNN) can be applied to various fields such as automatic driving. Semantic segmentation is pixel-wise class classification, and various methods using CNN have been proposed. We introduce a light attention mechanism to the encoder-decoder network. The network that introduced a light attention mechanism pays attention to features extracted during training, emphasizes the features judged to be effective for training and suppresses the features judged to be irrelevant for each pixel. As a result, training can be performed by focusing on only necessary features. We evaluated the proposed method using the CamVid dataset and obtained higher accuracy than conventional segmentation methods.
Download

Paper Nr: 302
Title:

Variability Evaluation of CNNs using Cross-validation on Viruses Images

Authors:

André R. de Geus, André R. Backes and Jefferson R. Souza

Abstract: Virus description and recognition is an essential issue in medicine. It helps researchers to study virus attributes such as its morphology, chemical compositions, and modes of replication. Although it can be performed through visual inspection, it is a task highly dependent on a qualified expert. Therefore, the automation of this task has received great attention over the past few years. In this study, we applied transfer learning from pre-trained deep neural networks for virus species classification. Given that many image datasets do not specify a fixed training and test sets, and to avoid any bias, we evaluated the impact of a cross-validation scheme on the classification accuracy. The experimental results achieved up to 89% of classification accuracy, outperforming previous studies by 2.8% of accuracy.
Download

Area 2 - Mobile and Egocentric Vision for Humans and Robots

Full Papers
Paper Nr: 200
Title:

Localizing Visitors in Natural Sites Exploiting Modality Attention on Egocentric Images and GPS Data

Authors:

Giovanni Pasqualino, Stefano Scafiti, Antonino Furnari and Giovanni M. Farinella

Abstract: Localizing the visitors of an outdoor natural site can be advantageous to study their behavior as well as to provide them information on where they are and what to visit in the site. Despite GPS can generally be used to perform outdoor localization, we show that this kind of signal is not always accurate enough in real-case scenarios. On the contrary, localization based on egocentric images can be more accurate but it generally results in more expensive computation. In this paper, we investigate how fusing image- and GPS-based predictions can allow to achieve efficient and accurate localization of the visitors of a natural site. Specifically, we compare different fusion techniques, including a modality attention approach which is shown to provide the best performances. Results point out that the proposed technique achieve promising results, allowing to obtain the performances of very deep models (e.g., DenseNet) with a less expensive architecture (e.g., SqueezeNet) which employ a memory footprint of about 3MB and an inference speed of about 25ms.
Download

Short Papers
Paper Nr: 144
Title:

Regression-based 3D Hand Pose Estimation using Heatmaps

Authors:

Chaitanya Bandi and Ulrike Thomas

Abstract: 3D hand pose estimation is a challenging problem in human-machine interaction applications. We introduce a simple and effective approach for 3D hand pose estimation in grasping scenarios taking advantage of a low-cost RGB-D camera. 3D hand pose estimation plays a major role in an environment where objects are handed over between the human and robot hand to avoid collisions and to collaborate in shared workspaces. We consider Convolutional Neural Networks (CNNs) to determine a solution to our challenge. The idea of cascaded CNNs is very appropriate for real-time applications. In the paper, we introduce an architecture for direct 3D normalized coordinates regression and a small-scale dataset for human-machine interaction applications. In a cascaded network, the first network minimizes the search space, then the second network is trained within the confined region to detect more accurate 2D heatmaps of joint’s locations. Finally, 3D normalized joints are regressed directly on RGB images and depth maps can lift normalized coordinates to camera coordinates.
Download

Paper Nr: 150
Title:

CAD-based Learning for Egocentric Object Detection in Industrial Context

Authors:

Julia Cohen, Carlos Crispim-Junior, Céline Grange-Faivre and Laure Tougne

Abstract: Industries nowadays have an increasing need of real-time and accurate vision-based algorithms. Although the performance of object detection methods improved a lot thanks to massive public datasets, instance detection in industrial context must be approached differently, since annotated images are usually unavailable or rare. In addition, when the video stream comes from a head-mounted camera, we observe a lot of movements and blurred frames altering the image content. For this purpose, we propose a framework to generate a dataset of egocentric synthetic images using only CAD models of the objects of interest. To evaluate different strategies exploiting synthetic and real images, we train a Convolutional Neural Network (CNN) for the task of object detection in egocentric images. Results show that training a CNN on synthetic images that reproduce the characteristics of egocentric vision may perform as well as training on a set of real images, reducing, if not removing, the need to manually annotate a large quantity of images to achieve an accurate performance.
Download

Paper Nr: 198
Title:

Monocular 3D Object Detection via Geometric Reasoning on Keypoints

Authors:

Ivan Barabanau, Alexey Artemov, Evgeny Burnaev and Vyacheslav Murashkin

Abstract: Monocular 3D object detection is well-known to be a challenging vision task due to the loss of depth information; attempts to recover depth using separate image-only approaches lead to unstable and noisy depth estimates, harming 3D detections. In this paper, we propose a novel keypoint-based approach for 3D object detection and localization from a single RGB image. We build our multi-branch model around 2D keypoint detection in images and complement it with a conceptually simple geometric reasoning method. Our network performs in an end-to-end manner, simultaneously and interdependently estimating 2D characteristics, such as 2D bounding boxes, keypoints, and orientation, along with full 3D pose in the scene. We fuse the outputs of distinct branches, applying a reprojection consistency loss during training. The experimental evaluation on the challenging KITTI dataset benchmark demonstrates that our network achieves state-of-the-art results among other monocular 3D detectors.
Download

Paper Nr: 205
Title:

A First-person Database for Detecting Barriers for Pedestrians

Authors:

Zenonas Theodosiou, Harris Partaourides, Tolga Atun, Simoni Panayi and Andreas Lanitis

Abstract: Egocentric vision, which relates to the continuous interpretation of images captured by wearable cameras, is increasingly being utilized in several applications to enhance the quality of citizens life, especially for those with visual or motion impairments. The development of sophisticated egocentric computer vision techniques requires automatic analysis of large databases of first-person point of view visual data collected through wearable devices. In this paper, we present our initial findings regarding the use of wearable cameras for enhancing the pedestrians safety while walking in city sidewalks. For this purpose, we create a first-person database that entails annotations on common barriers that may put pedestrians in danger. Furthermore, we derive a framework for collecting visual lifelogging data and define 24 different categories of sidewalk barriers. Our dataset consists of 1796 annotated images covering 1969 instances of barriers. The analysis of the dataset by means of object classification algorithms, depict encouraging results for further study.
Download

Paper Nr: 238
Title:

Towards Visual Loop Detection in Underwater Robotics using a Deep Neural Network

Authors:

Antoni Burguera and Francisco Bonin-Font

Abstract: This paper constitutes a first step towards the use of Deep Neural Networks to fast and robustly detect underwater visual loops. The proposed architecture is based on an autoencoder, replacing the decoder part by a set of fully connected layers. Thanks to that it is possible to guide the training process by means of a global image descriptor built upon clusters of local SIFT features. After training, the NN builds two different descriptors of the input image. Both descriptors can be compared among different images to decide if they are likely to close a loop. The experiments, performed in coastal areas of Mallorca (Spain), evaluate both descriptors, show the ability of the presented approach to detect loop candidates and favourably compare our proposal to a previously existing method.
Download

Paper Nr: 273
Title:

Evaluation of 3D Vision Systems for Detection of Small Objects in Agricultural Environments

Authors:

Justin L. Louedec, Bo Li and Grzegorz Cielniak

Abstract: 3D information provides unique information about shape, localisation and relations between objects, not found in standard 2D images. This information would be very beneficial in a large number of applications in agriculture such as fruit picking, yield monitoring, forecasting and phenotyping. In this paper, we conducted a study on the application of modern 3D sensing technology together with the state-of-the-art machine learning algorithms for segmentation and detection of strawberries growing in real farms. We evaluate the performance of two state-of-the-art 3D sensing technologies and showcase the differences between 2D and 3D networks trained on the images and point clouds of strawberry plants and fruit. Our study highlights limitations of the current 3D vision systems for detection of small objects in outdoor applications and sets out foundations for future work on 3D perception for challenging outdoor applications such as agriculture.
Download

Paper Nr: 313
Title:

3D Model-based 6D Object Pose Tracking on RGB Images using Particle Filtering and Heuristic Optimization

Authors:

Mateusz Majcher and Bogdan Kwolek

Abstract: We present algorithm for tracking 6D pose of the object in a sequence of RGB images. The images are acquired by a calibrated camera. The object of interest is segmented by an U-Net neural network. The network is trained in advance to segment a set of objects from the background. The 6D pose of the object is estimated through projecting the 3D model to image and then matching the rendered object with the segmented object. The objective function is calculated using object silhouette and edge scores determined on the basis of distance transform. A particle filter is used to estimate the posterior probability distribution. A k-means++ algorithm, which applies a sequentially random selection strategy according to a squared distance from the closest center already selected is executed on particles representing multi-modal probability distribution. A particle swarm optimization is then used to find the modes in the probability distribution. Results achieved by the proposed algorithm were compared with results obtained by a particle filter and a particle swarm optimization.
Download

Paper Nr: 42
Title:

Geo-localization using Ridgeline Features Extracted from 360-degree Images of Sand Dunes

Authors:

Shogo Fukuda, Shintaro Nakatani, Masashi Nishiyama and Yoshio Iwai

Abstract: We propose a method to extract the features of sand-dune ridgelines using a 360-degree camera to improve the accuracy of estimating geo-locations. It is difficult to estimate geo-locations in an outdoor environment with almost no texture such as in sand dunes. We focus on the feature of the ridgeline, which is the boundary between the ground region and the sky region. A 360-degree camera can quickly detect the ridgeline signal in all directions in a sand dune. Our method determines the current location by searching for the nearest ridgeline signal from target signals and pairing with their geo-locations. We evaluated the accuracy of this geo-localization method using synthesized images generated from a digital elevation model. We also evaluated it using real 360-degree images collected in sand dunes. We confirmed that our method significantly outperformed the existing geo-localization method on both synthesized and real images.
Download

Paper Nr: 100
Title:

A Comparison of Visual Navigation Approaches based on Localization and Reinforcement Learning in Virtual and Real Environments

Authors:

Marco Rosano, Antonino Furnari, Luigi Gulino and Giovanni M. Farinella

Abstract: Visual navigation algorithms allow a mobile agent to sense the environment and autonomously find its way to reach a target (e.g. an object in the environment). While many recent approaches tackled this task using reinforcement learning, which neglects any prior knowledge about the environments, more classic approaches strongly rely on self-localization and path planning. In this study, we compare the performance of single-target and multi-target visual navigation approaches based on the reinforcement learning paradigm, and simple baselines which rely on image-based localization. Experiments performed on discrete-state environments of different sizes, comprised of both real and virtual images, show that the two paradigms tend to achieve complementary results, hence suggesting that a combination of the two approaches to visual navigation may be beneficial.
Download

Paper Nr: 253
Title:

Dynamic Detectors of Oriented Spatial Contrast from Isotropic Fixational Eye Movements

Authors:

Simone Testa, Giacomo Indiveri and Silvio P. Sabatini

Abstract: Good vision proficiency and a complex set of eye movements are frequently coexisting. Even during fixation, our eyes keep moving in microscopic and erratic fashion, thus avoiding stationary scenes from fading perceptually by preventing retinal adaptation. We artificially replicate the functionalities of biological vision by exploiting this active strategy with an event-based camera. The resulting neuromorphic active system redistributes the low temporal frequency power of a static image into a range the sensor can detect and encode in the timing of events. A spectral analysis of its output attested both whitening and amplification effects already postulated in biology depending on whether or not the stimulus’ contrast matched the 1/k falloff typical of natural images. Further evaluations revealed that the isotropic statistics of fixational eye movements is crucial for equalizing the response of the system to all possible stimulus orientations. Finally, the design of a biologically-realistic spiking neural network allowed the detection of stimulus’ local orientation by anisotropic spatial summation of synchronous activity with both ON/OFF polarities.
Download

Area 3 - Image and Video Understanding

Full Papers
Paper Nr: 14
Title:

Using Local Refinements on 360 Stitching from Dual-fisheye Cameras

Authors:

Rafael Roberto, Daniel Perazzo, João P. Lima, Veronica Teichrieb, Jonysberg P. Quintino, Fabio Q. B. da Silva, Andre M. Santos and Helder Pinho

Abstract: Full panoramic images have several applications, ranging from virtual reality to 360º broadcasting. Such visualization method is growing, especially after the popularization of dual-fisheye cameras, which are compact and easy-to-use 360º imaging devices, and low-cost platforms that allow immersive experiences. However, low-quality registration and compositing in which artifacts are noticeable in the stitching area can harm the user experience. Although it is challenging to compose such images due to their narrow overlap area, recent works can provide good results when performing a global alignment. Nevertheless, they often cause artifacts since global alignment is not able to address every aspect of an image. In this work, we present a stitching method that performs local refinements to improve the registration and compositing quality of 360º images and videos. It builds on a feature clustering approach for global alignment. The proposed technique applies seam estimation and rigid moving least squares to remove undesired artifacts locally. Finally, we evaluate both to select the best result between them using a seam evaluation metric. Experiments showed that our method reduced the stitching error in at least 42.56% for images and 49.45% for videos when compared with existing techniques. Moreover, it provided the best results in all tested images and in 94.52% of the video frames.
Download

Paper Nr: 18
Title:

Audio-guided Video Interpolation via Human Pose Features

Authors:

Takayuki Nakatsuka, Masatoshi Hamanaka and Shigeo Morishima

Abstract: This paper describes a method that generates in-between frames of two videos of a musical instrument being played. While image generation achieves a successful outcome in recent years, there is ample scope for improvement in video generation. The keys to improving the quality of video generation are the high resolution and temporal coherence of videos. We solved these requirements by using not only visual information but also aural information. The critical point of our method is using two-dimensional pose features to generate high-resolution in-between frames from the input audio. We constructed a deep neural network with a recurrent structure for inferring pose features from the input audio and an encoder-decoder network for padding and generating video frames using pose features. Our method, moreover, adopted a fusion approach of generating, padding, and retrieving video frames to improve the output video. Pose features played an essential role in both end-to-end training with a differentiable property and combining a generating, padding, and retrieving approach. We conducted a user study and confirmed that the proposed method is effective in generating interpolated videos.
Download

Paper Nr: 20
Title:

Semantic Scene Completion from a Single 360-Degree Image and Depth Map

Authors:

Aloisio Dourado, Hansung Kim, Teofilo E. de Campos and Adrian Hilton

Abstract: We present a method for Semantic Scene Completion (SSC) of complete indoor scenes from a single 360◦ RGB image and corresponding depth map using a Deep Convolution Neural Network that takes advantage of existing datasets of synthetic and real RGB-D images for training. Recent works on SSC only perform occupancy prediction of small regions of the room covered by the field-of-view of the sensor in use, which implies the need of multiple images to cover the whole scene, being an inappropriate method for dynamic scenes. Our approach uses only a single 360◦ image with its corresponding depth map to infer the occupancy and semantic labels of the whole room. Using one single image is important to allow predictions with no previous knowledge of the scene and enable extension to dynamic scene applications. We evaluated our method on two 360◦ image datasets: a high-quality 360◦ RGB-D dataset gathered with a Matterport sensor and low-quality 360◦ RGB-D images generated with a pair of commercial 360◦ cameras and stereo matching. The experiments showed that the proposed pipeline performs SSC not only with Matterport cameras but also with more affordable 360◦ cameras, which adds a great number of potential applications, including immersive spatial audio reproduction, augmented reality, assistive computing and robotics.
Download

Paper Nr: 44
Title:

FootAndBall: Integrated Player and Ball Detector

Authors:

Jacek Komorowski, Grzegorz Kurzejamski and Grzegorz Sarwas

Abstract: The paper describes a deep neural network-based detector dedicated for ball and players detection in high resolution, long shot, video recordings of soccer matches. The detector, dubbed FootAndBall, has an efficient fully convolutional architecture and can operate on input video stream with an arbitrary resolution. It produces ball confidence map encoding the position of the detected ball, player confidence map and player bounding boxes tensor encoding players’ positions and bounding boxes. The network uses Feature Pyramid Network desing pattern, where lower level features with higher spatial resolution are combined with higher level features with bigger receptive field. This improves discriminability of small objects (the ball) as larger visual context around the object of interest is taken into account for the classification. Due to its specialized design, the network has two orders of magnitude less parameters than a generic deep neural network-based object detector, such as SSD or YOLO. This allows real-time processing of high resolution input video stream.
Download

Paper Nr: 50
Title:

Fine-tuning Siamese Networks to Assess Sport Gestures Quality

Authors:

Mégane Millan and Catherine Achard

Abstract: This paper presents an Action Quality Assessment (AQA) approach that learns to automatically score action realization from temporal sequences like videos. To manage the small size of most of databases capturing actions or gestures, we propose to use Siamese Networks. In the literature, Siamese Networks are widely used to rank action scores. Indeed, their purpose is not to regress scores but to predict a value that respects true scores order so that it can be used to rank actions according to their quality. For AQA, we need to predict real scores, as well as the difference between these scores and their range. Thus, we first introduce a new loss function to train Siamese Networks in order to regress score gaps. Once the Siamese network is trained, a branch of this network is extracted and fine-tuned for score prediction. We tested our approach on a public database, the AQA-7 dataset, composed of videos from 7 sports. Our results outperform state of the art on AQA task. Moreover, we show that the proposed method is also more efficient for action ranking.
Download

Paper Nr: 56
Title:

Pre- and Post-processing Strategies for Generic Slice-wise Segmentation of Tomographic 3D Datasets Utilizing U-Net Deep Learning Models Trained for Specific Diagnostic Domains

Authors:

Gerald A. Zwettler, Werner Backfrieder and David R. Holmes III

Abstract: An automated and generally applicable method for segmentation is still in focus of medical image processing research. Since a few years artificial inteligence methods show promising results, especially with widely available scalable Deep Learning libraries. In this work, a five layer hybrid U-net is developed for slice-by-slice segmentation of liver data sets. Training data is taken from the Medical Segmentation Decathlon database, providing 131 fully segmented volumes. A slice-oriented segmentation model is implemented utilizing deep learning algorithms with adaptions for variable parenchyma shape along the stacking direction and similarities between adjacent slices. Both are transformed for coronal and sagittal views. The implementation is on a GPU rack with TensorFlow and Keras. For a quantitative measure of segmentation accuracy, standardized volume and surface metrics are used. Results DSC=97.59, JI=95.29 and NSD=99.37 show proper segmentation comparable to 3D U-Nets and other state of the art. The development of a 2D-slice oriented segmentation is justified by short training time and less complexity and therefore massively reduced memory consumption. This work manifests the high potential of AI methods for general use in medical segmentation as fully- or semi-automated tool supervised by the expert user.
Download

Paper Nr: 57
Title:

Estimation of Muscle Fascicle Orientation in Ultrasonic Images

Authors:

Regina Pohle-Fröhlich, Christoph Dalitz, Charlotte Richter, Tobias Hahnen, Benjamin Stäudle and Kirsten Albracht

Abstract: We compare four different algorithms for automatically estimating the muscle fascicle angle from ultrasonic images: the vesselness filter, the Radon transform, the projection profile method and the gray level cooccurence matrix (GLCM). The algorithm results are compared to ground truth data generated by three different experts on 425 image frames from two videos recorded during different types of motion. The best agreement with the ground truth data was achieved by a combination of pre-processing with a vesselness filter and measuring the angle with the projection profile method. The robustness of the estimation is increased by applying the algorithms to subregions with high gradients and performing a LOESS fit through these estimates.
Download

Paper Nr: 66
Title:

Deep Learning for Astronomical Object Classification: A Case Study

Authors:

Ana Martinazzo, Mateus Espadoto and Nina T. Hirata

Abstract: With the emergence of photometric surveys in astronomy, came the challenge of processing and understanding an enormous amount of image data. In this paper, we systematically compare the performance of five popular ConvNet architectures when applied to three different image classification problems in astronomy to determine which architecture works best for each problem. We show that a VGG-style architecture pre-trained on ImageNet yields the best results on all studied problems, even when compared to architectures which perform much better on the ImageNet competition.
Download

Paper Nr: 70
Title:

Semi-automatic Learning Framework Combining Object Detection and Background Subtraction

Authors:

Sugino N. Alejandro, Tsubasa Minematsu, Atsushi Shimada, Takashi Shibata, Rin-ichiro Taniguchi, Eiji Kaneko and Hiroyoshi Miyano

Abstract: Public datasets used to train modern object detection models do not contain all the object classes appearing in real-world surveillance scenes. Even if they appear, they might be vastly different. Therefore, object detectors implemented in the real world must accommodate unknown objects and adapt to the scene. We implemented a framework that combines background subtraction and unknown object detection to improve the pretrained detector’s performance and apply human intervention to review the detected objects to minimize the latent risk of introducing wrongly labeled samples to the training. The proposed system enhanced the original YOLOv3 object detector performance in almost all the metrics analyzed, and managed to incorporate new classes without losing previous training information
Download

Paper Nr: 103
Title:

Anomaly Event Detection based on People Trajectories for Surveillance Videos

Authors:

Rensso M. Colque, Edward Cayllahua, Victor C. de Melo, Guillermo C. Chavez and William R. Schwartz

Abstract: In this work, we propose a novel approach to detect anomalous events in videos based on people movements, which are represented through time as trajectories. Given a video scenario, we collect trajectories of normal behavior using people pose estimation techniques and employ a multi-tracking data association heuristic to smooth trajectories. We propose two distinct approaches to describe the trajectories, one based on a Convolutional Neural Network and second based on a Recurrent Neural Network. We use these models to describe all trajectories where anomalies are those that differ much from normal trajectories. Experimental results show that our model is comparable with state-of-art methods and also validates the idea of using trajectories as a resource to compute one type of useful information to understand people behavior; in this case, the existence of rare trajectories.
Download

Paper Nr: 146
Title:

Fuzzy Fusion for Two-stream Action Recognition

Authors:

Anderson E. Santos, Helena A. Maia, Marcos E. Souza, Marcelo B. Vieira and Helio Pedrini

Abstract: There are several aspects that may help in the characterization of an action being performed in a video, such as scene appearance and estimated movement of the involved objects. Many works in the literature combine different aspects to recognize the actions, which has shown to be superior than individual results. Just as important as the definition of representative and complementary aspects is the choice of good combination methods that exploit the strengths of each aspect. In this work, we propose a novel fusion strategy based on two fuzzy integral methods. This strategy is capable of generalizing other common operators, besides it allows more combinations to be evaluated by having a distinct impact in sets linearly dependent. Our experiments show that the fuzzy fusion outperforms the most commonly-used weighted average on the challenging UCF101 and HMDB51 datasets.
Download

Paper Nr: 148
Title:

Coarse to Fine Vertebrae Localization and Segmentation with SpatialConfiguration-Net and U-Net

Authors:

Christian Payer, Darko Štern, Horst Bischof and Martin Urschler

Abstract: Localization and segmentation of vertebral bodies from spine CT volumes are crucial for pathological diagnosis, surgical planning, and postoperative assessment. However, fully automatic analysis of spine CT volumes is difficult due to the anatomical variation of pathologies, noise caused by screws and implants, and the large range of different field-of-views. We propose a fully automatic coarse to fine approach for vertebrae localization and segmentation based on fully convolutional CNNs. In a three-step approach, at first, a U-Net localizes the rough position of the spine. Then, the SpatialConfiguration-Net performs vertebrae localization and identification using heatmap regression. Finally, a U-Net performs binary segmentation of each identified vertebrae in a high resolution, before merging the individual predictions into the resulting multi-label vertebrae segmentation. The evaluation shows top performance of our approach, ranking first place and winning the MICCAI 2019 Large Scale Vertebrae Segmentation Challenge (VerSe 2019).
Download

Paper Nr: 151
Title:

Self-supervised Depth Estimation based on Feature Sharing and Consistency Constraints

Authors:

Julio Mendoza and Helio Pedrini

Abstract: In this work, we propose a self-supervised approach to depth estimation. Our method uses depth consistency to generate soft visibility mask that reduces the error contribution of inconsistent regions produced by occlusions. In addition, we allow the pose network to take advantage of the depth network representations to produce more accurate results. The experiments are conducted on the KITTI 2015 dataset. We analyze the effect of each component in the performance of the model and demonstrate that the consistency constraint and feature sharing can effectively improve our results. We show that our method is competitive when compared to the state of the art.
Download

Paper Nr: 167
Title:

RaDE: A Rank-based Graph Embedding Approach

Authors:

Filipe Alves de Fernando, Daniel G. Pedronette, Gustavo José de Sousa, Lucas P. Valem and Ivan R. Guilherme

Abstract: Due to possibility of capturing complex relationships existing between nodes, many application benefit of being modeled with graphs. However, performance issues can be observed on large scale networks, making it computationally unfeasible to process information in various scenarios. Graph Embedding methods are usually used for finding low-dimensional vector representations for graphs, preserving its original properties such as topological characteristics, affinity and shared neighborhood between nodes. In this way, retrieval and machine learning techniques can be exploited to execute tasks such as classification, clustering, and link prediction. In this work, we propose RaDE (Rank Diffusion Embedding), an efficient and effective approach that considers rank-based graphs for learning a low-dimensional vector. The proposed approach was evaluated on 7 network datasets such as a social, co-reference, textual and image networks, with different properties. Vector representations generated with RaDE achieved effective results in visualization and retrieval tasks when compared to vector representations generated by other recent related methods.
Download

Paper Nr: 176
Title:

3D Plant Growth Prediction via Image-to-Image Translation

Authors:

Tomohiro Hamamoto, Hideaki Uchiyama, Atsushi Shimada and Rin-ichiro Taniguchi

Abstract: This paper presents a method to predict three-dimensional (3D) plant growth with RGB-D images. Based on neural network based image translation and time-series prediction, we construct a system that gives the predicted result of RGB-D images from several past RGB-D images. Since both RGB and depth images are incorporated into our system, the plant growth can be represented in 3D space. In the evaluation, the performance of our proposed network is investigated by focusing on clarifying the importance of each module in the network. We have verified how the prediction accuracy changes depending on the internal structure of the our network.
Download

Paper Nr: 181
Title:

ConvPoseCNN: Dense Convolutional 6D Object Pose Estimation

Authors:

Catherine Capellen, Max Schwarz and Sven Behnke

Abstract: 6D object pose estimation is a prerequisite for many applications. In recent years, monocular pose estimation has attracted much research interest because it does not need depth measurements. In this work, we introduce ConvPoseCNN, a fully convolutional architecture that avoids cutting out individual objects. Instead we propose pixel-wise, dense prediction of both translation and orientation components of the object pose, where the dense orientation is represented in Quaternion form. We present different approaches for aggregation of the dense orientation predictions, including averaging and clustering schemes. We evaluate ConvPoseCNN on the challenging YCB-Video Dataset, where we show that the approach has far fewer parameters and trains faster than comparable methods without sacrificing accuracy. Furthermore, our results indicate that the dense orientation prediction implicitly learns to attend to trustworthy, occlusion-free, and feature-rich object regions.
Download

Paper Nr: 201
Title:

Quantitative Analysis of Facial Paralysis using GMM and Dynamic Kernels

Authors:

Nazil Perveen, Chalavadi K. Mohan and Yen W. Chen

Abstract: In this paper, the quantitative assessment for facial paralysis is proposed to detect and measure the different degrees of facial paralysis. Generally, difficulty in facial muscle movements determines the degree with which patients are affected by facial paralysis. In the proposed work, the movements of facial muscles are captured using spatio-temporal features and facial dynamics are learned using large Gaussian mixture model (GMM). Also, to handle multiple disparities occurred during facial muscle movements, dynamic kernels are used, which effectively preserve the local structure information while handling the variation across the different degree of facial paralysis. Dynamic kernels are known for handling variable-length data patterns efficiently by mapping it onto a fixed length pattern or by the selection of a set of discriminative virtual features using multiple GMM statistics. These kernel representations are then classified using a support vector machine (SVM) for the final assessment. To show the efficacy of the proposed approach, we collected the video database of 39 facially paralyzed patients of different ages group, gender, and from multiple angles (views) for robust assessment of the different degrees of facial paralysis. We employ and compare the trade-off between accuracy and computational loads for three different categories of the dynamic kernels, namely, explicit mapping based, probability-based, and matching based dynamic kernel. We have shown that the matching based kernel, which is very low in computational loads achieves better classification performance of 81.5% than the existing methods. Also, with the higher-order statistics, the probability kernel involves more communication overhead but gives significantly high classification performance of 92.46% than state-of-the-art methods.
Download

Paper Nr: 227
Title:

Improving Dense Crowd Counting Convolutional Neural Networks using Inverse k-Nearest Neighbor Maps and Multiscale Upsampling

Authors:

Greg Olmschenk, Hao Tang and Zhigang Zhu

Abstract: Gatherings of thousands to millions of people frequently occur for an enormous variety of events, and automated counting of these high-density crowds is useful for safety, management, and measuring significance of an event. In this work, we show that the regularly accepted labeling scheme of crowd density maps for training deep neural networks is less effective than our alternative inverse k-nearest neighbor (ikNN) maps, even when used directly in existing state-of-the-art network structures. We also provide a new network architecture MUD-ikNN, which uses multi-scale drop-in replacement upsampling via transposed convolutions to take full advantage of the provided ikNN labeling. This upsampling combined with the ikNN maps further improves crowd counting accuracy. Our new network architecture performs favorably in comparison with the state-of-the-art. However, our labeling and upsampling techniques are generally applicable to existing crowd counting architectures.
Download

Paper Nr: 237
Title:

Geometric Deep Learning on Skeleton Sequences for 2D/3D Action Recognition

Authors:

Rasha Friji, Hassen Drira and Faten Chaieb

Abstract: Deep Learning models, albeit successful on data defined on Euclidean domains, are so far constrained in many fields requiring data which underlying structure is a non-Euclidean space, namely computer vision and imaging. The purpose of this paper is to build a geometry aware deep learning architecture for skeleton based action recognition. In this perspective, we propose a framework for non-Euclidean data classification based on 2D/3D skeleton sequences, specifically for Parkinson's disease classification and action recognition. As a baseline, we first design two Euclidean deep learning architectures without considering the Riemannian structure of the data. Then, we introduce new architectures that extend Convolutional Neural Networks (CNNs) and Recurrent Neural Networks(RNNs) to non-Euclidean data. Experimental results show that our method outperforms state-of-the-art performances for 2D abnormal behavior classification and 3D human action recognition.
Download

Paper Nr: 306
Title:

Mini V-Net: Depth Estimation from Single Indoor-Outdoor Images using Strided-CNN

Authors:

Ahmed J. Afifi, Olaf Hellwich and Toufique A. Soomro

Abstract: Depth estimation plays a vital role in many computer vision tasks including scene understanding and reconstruction. However, it is an ill-posed problem when it comes to estimating the depth from a single view due to the ambiguity and the lack of cues and prior knowledge. Proposed solutions so far estimate blurry depth images with low resolutions. Recently, Convolutional Neural Network (CNN) has been applied successfully to solve different computer vision tasks such as classification, detection, and segmentation. In this paper, we present a simple fully-convolutional encoder-decoder CNN for estimating depth images from a single RGB image with the same image resolution. For robustness, we leverage a non-convex loss function which is robust to the outliers to optimize the network. Our results show that a light simple model trained using a robust loss function outperforms or achieves comparable results with other methods quantitatively and qualitatively and produces better depth information of the scenes with sharper objects’ boundaries. Our model predicts the depth information in one shot with the same input resolution and without any further post-processing steps.
Download

Short Papers
Paper Nr: 13
Title:

A CNN-based Feature Space for Semi-supervised Incremental Learning in Assisted Living Applications

Authors:

Tobias Scheck, Ana P. Grassi and Gangolf Hirtz

Abstract: A Convolutional Neural Network (CNN) is sometimes confronted with objects of changing appearance ( new instances) that exceed its generalization capability. This requires the CNN to incorporate new knowledge, i.e., to learn incrementally. In this paper, we are concerned with this problem in the context of assisted living. We propose using the feature space that results from the training dataset to automatically label problematic images that could not be properly recognized by the CNN. The idea is to exploit the extra information in the feature space for a semi-supervised labeling and to employ problematic images to improve the CNN’s classification model. Among other benefits, the resulting semi-supervised incremental learning process allows improving the classification accuracy of new instances by 40% as illustrated by extensive experiments.
Download

Paper Nr: 27
Title:

Resources and End-to-End Neural Network Models for Arabic Image Captioning

Authors:

Obeida ElJundi, Mohamad Dhaybi, Kotaiba Mokadam, Hazem Hajj and Daniel Asmar

Abstract: Image Captioning (IC) is the process of automatically augmenting an image with semantically-laden descriptive text. While English IC has made remarkable strides forward in the past decade, very little work exists on IC for other languages. One possible solution to this problem is to boostrap off of existing English IC systems for image understanding, and then translate the outcome to the required language. Unfortunately, as this paper will show, translated IC is lacking due to the error accumulation of the two tasks; IC and translation. In this paper, we address the problem of image captioning in Arabic. We propose an end-to-end model that directly transcribes images into Arabic text. Due to the lack of Arabic resources, we develop an annotated dataset for Arabic image captioning (AIC). We also develop a base model for AIC that relies on text translation from English image captions. The two models are evaluated with the new dataset, and the results show the superiority of our end-to-end model.
Download

Paper Nr: 35
Title:

SSD-ML: Hierarchical Object Classification for Traffic Surveillance

Authors:

M. H. Zwemer, R. G. J. Wijnhoven and P. H. N. de With

Abstract: We propose a novel CNN detection system with hierarchical classification for traffic object surveillance. The detector is based on the Single-Shot multibox Detector (SSD) and inspired by the hierarchical classification used in the YOLO9000 detector. We separate localization and classification during training, by introducing a novel loss term that handles hierarchical classification. This allows combining multiple datasets at different levels of detail with respect to the label definitions and improves localization performance with non-overlapping labels. We experiment with this novel traffic object detector and combine the public UADETRAC, MIO-TCD datasets and our newly introduced surveillance dataset with non-overlapping class definitions. The proposed SSD-ML detector obtains 96.4% mAP in localization performance, outperforming default SSD with 5.9%. For this improvement, we additionally introduce a specific hard-negative mining method. The effect of incrementally adding more datasets reveals that the best performance is obtained when training with all datasets combined (we use a separate test set). By adding hierarchical classification, the average classification performance increases with 1.4% to 78.6% mAP. This positive result is based on combining all datasets, although label inconsistencies occur in the additional training data. In addition, the final system can recognize the novel ‘van’ class that is not present in the original training data.
Download

Paper Nr: 52
Title:

A Deep Transfer Learning Framework for Pneumonia Detection from Chest X-ray Images

Authors:

Kh T. Islam, Sudanthi Wijewickrema, Aaron Collins and Stephen O’Leary

Abstract: Pneumonia occurs when the lungs are infected by a bacterial, viral, or fungal infection. Globally, it is the largest solo infectious disease causing child mortality. Early diagnosis and treatment of this disease are critical to avoid death, especially in infants. Traditionally, pneumonia diagnosis was performed by expert radiologists and/or doctors by analysing X-ray images of the chest. Automated diagnostic methods have been developed in recent years as an alternative to expert diagnosis. Deep learning-based image processing has been shown to be effective in automated diagnosis of pneumonia. However, deep leaning typically requires a large number of labelled samples for training, which is time consuming and expensive to obtain in medical applications as it requires the input of human experts. Transfer learning, where a model pretrained for a task on an existing labelled database is adapted to be reused for a different but related task, is a common workaround to this issue. Here, we explore the use of deep transfer learning to diagnose pneumonia using X-ray images of the chest. We demonstrate that using two individual pretrained models as feature extractors and training an artificial neural network on these features is an effective way to diagnose pneumonia. We also show through experiments that the proposed method outperforms similar existing methods with respect to accuracy and time.
Download

Paper Nr: 99
Title:

FSSSD: Fixed Scale SSD for Vehicle Detection

Authors:

Jiwon Jun, Hyunjeong Pak and Moongu Jeon

Abstract: Since surveillance cameras are commonly installed in high places, the objects in the taken images are relatively small. Detecting small objects is a hard issue for the one-stage detector, and its performance in the surveillance system is not good. Two-stage detectors work better, but their speed is too slow to use in the real-time system. To remedy the drawbacks, we propose an efficient method, named as Fixed Scale SSD(FSSSD), which is an extension of SSD. The proposed method has three key points: high-resolution inputs to detect small objects, a lightweight Backbone to speed up, and prediction blocks to enrich features. FSSSD achieve 63.7% AP at 16.7 FPS in the UA-DETRAC test dataset. The performance is similar to two-stage detectors and faster than any other one-stage method.
Download

Paper Nr: 107
Title:

MobText: A Compact Method for Scene Text Localization

Authors:

Luis L. Decker, Allan S. Pinto, Jose F. Campana, Manuel C. Neira, Andreza D. Santos, Jhonatas S. Conceição, Marcus A. Angeloni, Lin T. Li and Ricardo S. Torres

Abstract: Multiple research initiatives have been reported to yield highly effective results for the text detection problem. However, most of those solutions are very costly, which hamper their use in several applications that rely on the use of devices with restrictive processing power, like smartwatches and mobile phones. In this paper, we address this issue by investigating the use of efficient object detection networks for this problem. We propose the combination of two light architectures, MobileNetV2 and Single Shot Detector (SSD), for the text detection problem. Experimental results in the ICDAR’11 and ICDAR’13 datasets demonstrate that our solution yields the best trade-off between effectiveness and efficiency and also achieved the state-of-the-art results in the ICDAR’11 dataset with an f-measure of 96.09%.
Download

Paper Nr: 122
Title:

Monocular 3D Head Reconstruction via Prediction and Integration of Normal Vector Field

Authors:

Oussema Bouafif, Bogdan Khomutenko and Mohamed Daoudi

Abstract: Reconstructing the geometric structure of a face from a single input image is a challenging active research area in computer vision. In this paper, we present a novel method for reconstructing 3D heads from an input image using a hybrid approach based on learning and geometric techniques. We introduce a deep neural network trained on synthetic data only, which predicts the map of normal vectors of the face surface from a single photo. Afterward, using the network output we recover the 3D facial geometry by means of weighted least squares. Through qualitative and quantitative evaluation tests, we show the accuracy and robustness of our proposed method. Our method does not require accurate alignment due to the image-to-image translation network and also successfully recovers 3D geometry for real images, despite the fact that the model was trained only on synthetic data.
Download

Paper Nr: 127
Title:

Explaining Spatial Relation Detection using Layerwise Relevance Propagation

Authors:

Gabriel Farrugia and Adrian Muscat

Abstract: In computer vision, learning to detect relationships between objects is an important way to thoroughly understand images. Machine Learning models have been developed in this area. However, in critical scenarios where a simple decision is not enough, reasons to back up each decision are required and reliability comes into play. We investigate the role that geometric, language and depth features play in the task of predicting Spatial Relations by generating feature relevance measures using Layerwise Relevance Propagation. We carry out the evaluation of feature contributions on a per-class basis.
Download

Paper Nr: 131
Title:

Encrypted Image Display based on Individual Visual Characteristics

Authors:

Ryota Niwa, Fumihiko Sakaue and Jun Sato

Abstract: In this paper, we propose an encoded image displaying system based on light field decoding by human visual systems. We focus on the difference of individual characteristics of the human visual systems, and we indicate that the difference of the characteristics makes a significant difference in the observation results. Based on the observation difference, we achieve image encoding to a light-field which can be decoded by the human who has a particular visual characteristic. To achieve the image encoding system, we construct a 5D light field display system. This 5D LF display controls the spectral distribution of the light rays as well as position and directions. By using the 5D LF display, we utilize the characteristics of the spectral sensitivity and opticalcharacteristics of the human visual system. Several experimental results show that our proposed method can disturb the observation of the general audience and provide appropriate information to a target person.
Download

Paper Nr: 147
Title:

Manifold Learning-based Clustering Approach Applied to Anomaly Detection in Surveillance Videos

Authors:

Leonardo T. Lopes, Lucas P. Valem, Daniel G. Pedronette, Ivan R. Guilherme, João P. Papa, Marcos S. Santana and Danilo Colombo

Abstract: The huge increase in the amount of multimedia data available and the pressing need for organizing them in different categories, especially in scenarios where there are no labels available, makes data clustering an essential task in different scenarios. In this work, we present a novel clustering method based on an unsupervised manifold learning algorithm, in which a more effective similarity measure is computed by the manifold learning and used for clustering purposes. The proposed approach is applied to anomaly detection in videos and used in combination with different background segmentation methods to improve their effectiveness. An experimental evaluation is conducted on three different image datasets and one video dataset. The obtained results indicate superior accuracy in most clustering tasks when compared to the baselines. Results also demonstrate that the clustering step can improve the results of background subtraction approaches in the majority of cases.
Download

Paper Nr: 165
Title:

CNN Hyperparameter Tuning Applied to Iris Liveness Detection

Authors:

Gabriela Y. Kimura, Diego R. Lucio, Alceu S. Britto Jr. and David Menotti

Abstract: The iris pattern has significantly improved the biometric recognition field due to its high level of stability and uniqueness. Such physical feature has played an important role in security and other related areas. However, presentation attacks, also known as spoofing techniques, can be used to bypass the biometric system with artifacts such as printed images, artificial eyes, and textured contact lenses. To improve the security of these systems, many liveness detection methods have been proposed, and the first Internacional Iris Liveness Detection competition was launched in 2013 to evaluate their effectiveness. In this paper, we propose a hyperparameter tuning of the CASIA algorithm, submitted by the Chinese Academy of Sciences to the third competition of Iris Liveness Detection, in 2017. The modifications proposed promoted an overall improvement, with 8.48% Attack Presentation Classification Error Rate (APCER) and 0.18% Bonafide Presentation Classification Error Rate (BPCER) for the evaluation of the combined datasets. Other threshold values were evaluated in an attempt to reduce the trade-off between the APCER and the BPCER on the evaluated datasets and worked out successfully.
Download

Paper Nr: 182
Title:

A New Approach Combining Trained Single-view Networks with Multi-view Constraints for Robust Multi-view Object Detection and Labelling

Authors:

Yue Zhang, Adrian Hilton and Jean-Yves Guillemaut

Abstract: We propose a multi-view framework for joint object detection and labelling based on pairs of images. The proposed framework extends the single-view Mask R-CNN approach to multiple views without need for additional training. Dedicated components are embedded into the framework to match objects across views by enforcing epipolar constraints, appearance feature similarity and class coherence. The multi-view extension enables the proposed framework to detect objects which would otherwise be mis-detected in a classical Mask R-CNN approach, and achieves coherent object labelling across views. By avoiding the need for additional training, the approach effectively overcomes the current shortage of multi-view datasets. The proposed framework achieves high quality results on a range of complex scenes, being able to output class, bounding box, mask and an additional label enforcing coherence across views. In the evaluation, we show qualitative and quantitative results on several challenging outdoor multi-view datasets and perform a comprehensive comparison to verify the advantages of the proposed method.
Download

Paper Nr: 188
Title:

A Method for Detecting Human-object Interaction based on Motion Distribution around Hand

Authors:

Tatsuhiro Tsukamoto, Toru Abe and Takuo Suganuma

Abstract: Detecting human-object interaction in video images is an important issue in many computer vision applications. Among various types of human-object interaction, especially the type of interaction where a person is in the middle of moving an object with his/her hand is a key to observing several critical events such as stealing luggage and abandoning suspicious substances in public spaces. This paper proposes a novel method for detecting such type of human-object interaction. In the proposed method, an area surrounding each hand is set in input video frames, and the motion distribution in every surrounding area is analyzed. Whether or not each hand moves an object is decided by whether or not its surrounding area contains regions where movements similar to those of the hand are concentrated. Since the proposed method needs not explicitly extract object regions and recognize their relations to person regions, the effectiveness in detecting the human-object interaction, technically hands which are right in the middle of moving objects, is expected to be improved for diverse situations, e.g., several persons individually move unknown objects with their hands in a scene.
Download

Paper Nr: 211
Title:

A Multi-purpose RGB-D Dataset for Understanding Everyday Objects

Authors:

Shuichi Akizuki and Manabu Hashimoto

Abstract: This paper introduces our ongoing work which is a project of establishing a novel dataset for the benchmarking of multiple robot vision tasks that aims to handle everyday objects. Our dataset is composed of 3D models, RGB-D input scenes and multi-type annotations. The 3D models are full-3D scan data of 100 everyday objects. Input scenes are over 54k RGB-D images that capture the table-top environment, including randomly placed everyday objects. Our dataset also provides four types of annotation: bounding boxes, affordance labels, object class labels, and 6 degrees of freedom (6DoF) poses. These are labeled for all objects in an image. These annotations are easily assigned to images via an original 6DoF annotation tool that has a simple graphical interface. We also report benchmarking results for modern object recognition algorithms.
Download

Paper Nr: 212
Title:

Dynamic Mode Decomposition via Dictionary Learning for Foreground Modeling in Videos

Authors:

Israr U. Haq, Keisuke Fujii and Yoshinobu Kawahara

Abstract: Accurate extraction of foregrounds in videos is one of the challenging problems in computer vision. In this study, we propose dynamic mode decomposition via dictionary learning (dl-DMD), which is applied to extract moving objects by separating the sequence of video frames into foreground and background information with a dictionary learned using block patches on the video frames. Dynamic mode decomposition (DMD) decomposes spatiotemporal data into spatial modes, each of whose temporal behavior is characterized by a single frequency and growth/decay rate and is applicable to split a video into foregrounds and the background when applying it to a video. And, in dl-DMD, DMD is applied on coefficient matrices estimated over a learned dictionary, which enables accurate estimation of dynamical information in videos. Due to this scheme, dl-DMD can analyze the dynamics of respective regions in a video based on estimated amplitudes and temporal evolution over patches. The results on synthetic data exhibit that dl-DMD outperforms the standard DMD and compressed DMD (cDMD) based methods. Also, the results of an empirical performance evaluation in the case of foreground extraction from videos using publicly available dataset demonstrates the effectiveness of the proposed dl-DMD algorithm and achieves a performance that is comparable to that of the state-of-the-art techniques in foreground extraction tasks.
Download

Paper Nr: 215
Title:

Anomaly Detection in Surveillance Videos by Future Appearance-motion Prediction

Authors:

Tuan-Hung Vu, Sebastien Ambellouis, Jacques Boonaert and Abdelmalik Taleb-Ahmed

Abstract: Anomaly detection in surveillance videos is the identification of rare events which produce different features from normal events. In this paper, we present a survey about the progress of anomaly detection techniques and introduce our proposed framework to tackle this very challenging objective. Our approach is based on the more recent state-of-the-art techniques and casts anomalous events as unexpected events in future frames. Our framework is so flexible that you can replace almost important modules by existing state-of-the-art methods. The most popular solutions only use future predicted informations as constraints for training a convolutional encode-decode network to reconstruct frames and take the score of the difference between both original and reconstructed information. We propose a fully future prediction based framework that directly defines the feature as the difference between both future predictions and ground truth informations. This feature can be fed into various types of learning model to assign anomaly label. We present our experimental plan and argue that our framework’s performance will be competitive with state-of-the art scores by presenting early promising results in feature extraction.
Download

Paper Nr: 235
Title:

Plant Species Identification using Discriminant Bag of Words (DBoW)

Authors:

Fiza Murtaza, Umber Saba, Muhammad H. Yousaf and Serestina Viriri

Abstract: Plant species identification is necessary for protecting biodiversity which is declining rapidly throughout the world. This research work focuses on plant species identification in simple and complex background using Computer Vision techniques. Intra-class variability and inter-class similarity are the key challenges in a large plant species dataset. In this paper, multiple organs of plants such as leaf, flower, stem, fruit, etc. are classified using hand-crafted features for identification of plant species. We propose a novel encoding scheme named as Discriminant Bag of Words (DBoW) to identify multiple organs of plants. The proposed DBoW extracts the class specific codewords, and assigns the weights to codewords in order to signify discriminant power of the codewords. We evaluated our proposed method on two publicly available datasets: Flavia and ImageClef. The experimental results achieved classification accuracy rates of 98% and 94% on FLAVIA and ImageClef datasets, respectively.
Download

Paper Nr: 243
Title:

Proxy Embeddings for Face Identification among Multi-Pose Templates

Authors:

Weronika Gutfeter and Andrzej Pacut

Abstract: Many of a large scale face identification systems operates on databases containing images showing heads in multiple poses (from frontal to full profiles). However, as it was shown in the paper, off-the-shelf methods are not able to take advantage of this particular data structure. The main idea behind our work was to adapt the methods proposed for multi-view and semi-3D objects classification to the multi-pose face recognition problem. The proposed approach involves neural network training with proxy embeddings and building the gallery templates out of aggregated samples. A benchmark testing scenario is proposed for the purpose of the problem, which is based on the linked gallery and probes databases. The gallery database consists of multi-pose face images taken under controlled conditions, and the probes database contains samples of in-the-wild type. Both databases must be linked, having at least partially common labels. Two variants of the proposed training procedures were tested, namely, the neighbourhood component analysis with proxies (NCA-proxies) and the triplet margin loss with proxies (triplet-proxies). It is shown that the proposed methods perform better than models trained with cross-entropy loss and than off-the-shelf methods. Rank-1 accuracy was improved from 48.82% for off-the-shelf baseline to 86.86% for NCA-proxies. In addition, transfer of proxy points between two independently trained models was discussed, similarly to hyper-parameters transfer methodology. Proxy embeddings transfer opens a possibility of training two domain-specific networks with respect to two datasets identification schema.
Download

Paper Nr: 260
Title:

Who Loves Virtue as much as He Loves Beauty?: Deep Learning based Estimator for Aesthetics of Portraits

Authors:

Tobias Gerlach, Michael Danner, Le P. Peng, Aidas Kaminickas, Wu Fei and Matthias Rätsch

Abstract: ”I have never seen one who loves virtue as much as he loves beauty,” Confucius once said. If beauty is more important as goodness, it becomes clear why people invest so much effort in their first impression. The aesthetic of faces has many aspects and there is a strong correlation to all characteristics of humans, like age and gender. Often, research on aesthetics by social and ethic scientists lacks sufficient labelled data and the support of machine vision tools. In this position paper we propose the Aesthetic-Faces dataset, containing training data which is labelled by Chinese and German annotators. As a combination of three image subsets, the AF-dataset consists of European, Asian and African people. The research communities in machine learning, aesthetics and social ethics can benefit from our dataset and our toolbox. The toolbox provides many functions for machine learning with state-of-the-art CNNs and an Extreme-Gradient-Boosting regressor, but also 3D Morphable Model technologies for face shape evaluation and we discuss how to train an aesthetic estimator considering culture and ethics.
Download

Paper Nr: 267
Title:

Light Field Scattering in Participating Media

Authors:

Takuya Mokutani, Fumihiko Sakaue and Jun Sato

Abstract: In this paper, we propose a representation of the light scattering in participating media, which can represent all order light scattering simply. To achieve the model, we focus on the light field in the participating media, and it is shown that the convolution of the light field can describe the attenuation of the light rays and scattering of them. By analyzing the convolution kernel, we derive a simple kernel that represents all order light scattering. Also, we introduce the estimation method of the characteristics of the participating media based on our proposed model. Several experimental results show that our proposed model can describe light scattering more appropriately than existing models.
Download

Paper Nr: 283
Title:

Automated Generation of Synthetic in-Car Dataset for Human Body Pose Detection

Authors:

João Borges, Bruno Oliveira, Helena Torres, Nelson Rodrigues, Sandro Queirós, Maximilian Shiller, Victor Coelho, Johannes Pallauf, José H. Brito, José Mendes and Jaime C. Fonseca

Abstract: In this paper, a toolchain for the generation of realistic synthetic images for human body pose detection in an in-car environment is proposed. The toolchain creates a customized synthetic environment, comprising human models, car, and camera. Poses are automatically generated for each human, taking into account a per-joint axis Gaussian distribution, constrained by anthropometric and range of motion measurements. Scene validation is done through collision detection. Rendering is focused on vision data, supporting time-of-flight (ToF) and RGB cameras, generating synthetic images from these sensors. Ground-truth data is then generated, comprising the car occupants’ body pose (2D/3D), as well as full body RGB segmentation frames with different body parts’ labels. We demonstrate the feasibility of using synthetic data, combined with real data, to train distinct machine learning agorithms, demonstrating the improvement in their algorithmic accuracy for the in-car scenario.
Download

Paper Nr: 303
Title:

Towards Unsupervised Image Harmonisation

Authors:

Alan Dolhasz, Carlo Harvey and Ian Williams

Abstract: The field of image synthesis intrinsically relies on the process of image compositing. This process can be automatic or manual, and depends upon artistic intent. Compositing can introduce errors, due to human-detectable differences in the general pixel level transforms of component elements of an image composite. We report on a pilot study evaluating a proof-of-concept automatic image composite harmonisation system consisting of a state-of-the-art deep harmonisation model and a perceptually-based composite luminance artifact detector. We evaluate the performance of both systems on a large data-set of 68128 automatically generated image composites and find that without any task-specific adaptations, the end-to-end system achieves comparable results to the baseline harmoniser fed with ground truth composite masks. We discuss these findings in the context of extending this to an end-to-end, multi-task system.
Download

Paper Nr: 305
Title:

A Preliminary Study on the Automatic Visual based Identification of UAV Pilots from Counter UAVs

Authors:

Dario Cazzato, Claudio Cimarelli and Holger Voos

Abstract: Two typical Unmanned Aerial Vehicles (UAV) countermeasures involve the detection and tracking of the UAV position, as well as of the human pilot; they are of critical importance before taking any countermeasure, and they already obtained strong attention from national security agencies in different countries. Recent advances in computer vision and artificial intelligence are already proposing many visual detection systems from an operating UAV, but they do not focus on the problem of the detection of the pilot of another approaching unauthorized UAV. In this work, a first attempt of proposing a full autonomous pipeline to process images from a flying UAV to detect the pilot of an unauthorized UAV entering a no-fly zone is introduced. A challenging video sequence has been created flying with a UAV in an urban scenario and it has been used for this preliminary evaluation. Experiments show very encouraging results in terms of recognition, and a complete dataset to evaluate artificial intelligence-based solution will be prepared.
Download

Paper Nr: 307
Title:

3D Convolutional Neural Network for Falling Detection using Only Depth Information

Authors:

Sara Luengo Sánchez, Sergio de López Diz, David Fuentes-Jiménez, Cristina Losada-Gutiérrez, Marta Marrón-Romera and Ibrahim Sarker

Abstract: Nowadays, one of the major challenges global society is facing is population aging, which involves an increment of the medical expenses. Since falls are the major cause of injuries for elderly people, the need of a low-cost falling detector has increased rapidly over the years. In this context, we propose a fall-detection system based on 3D Convolutional Neural Networks (3D-CNN). Due to the fact that the system only uses depth information obtained by a RGB-D sensor placed in a overhead position to avoid occlusions, it results in a less invasive and intrusive fall-detection method for users than systems based on wearables. In addition, depth information preserves the privacy of people since they cannot be identified from this information. The 3D-CNN obtains spatial and temporal features from depth data, which allows to classify users’ actions and detect when a fall appears. Since there are no other available datasets for action recognition using only depth data from a top-view camera, the authors have recorded and labeled the GOTPD3, that has been made available to the scientific community. Thus, training and evaluation of the network has been carried out within the GOTPD3 dataset, and the achieved results validate the proposal.
Download

Paper Nr: 310
Title:

Hidden Markov Models for Pose Estimation

Authors:

László Czúni and Amr M. Nagy

Abstract: Estimation of the pose of objects is essential in order to interact with the real world in many applications such as robotics, augmented reality or autonomous driving. The key challenges we must face in the recognition of objects and their pose is due to the diversity of their visual appearance in addition to the complexity of the environment, the variations of illumination, and possibilities of occlusions. We have previously shown that Hidden Markov Models (HMMs) can improve the recognition of objects even with the help of weak object classifiers if orientation information is also utilized during the recognition process. In this paper we describe our first attempts when we apply HMMs to improve the pose selection of elementary convolutional neural networks (CNNs).
Download

Paper Nr: 23
Title:

Towards Deep People Detection using CNNs Trained on Synthetic Images

Authors:

Roberto Martín-López, David Fuentes-Jiménez, Sara Luengo-Sánchez, Cristina Losada-Gutiérrez, Marta Marrón-Romera and Carlos Luna

Abstract: In this work, we propose a people detection system that uses only depth information, provided by an RGB-D camera in frontal position. The proposed solution is based on a Convolutional Neural Network (CNN) with an encoder-decoder architecture, formed by ResNet residual layers, that have been widely used in detection and classification tasks. The system takes a depth map as input, generated by a time-of-flight or a structured-light based sensor. Its output is a probability map (with the same size of the input) where each detection is represented as a Gaussian function, whose mean is the position of the person’s head. Once this probability map is generated, some refinement techniques are applied in order to improve the detection precision. During the system training process, there have only been used synthetic images generated by the software Blender, thus avoiding the need to acquire and label large image datasets. The described system has been evaluated using both, synthetic and real images acquired using a Microsoft Kinect II camera. In addition, we have compared the obtained results with those from other works of the state-of-the-art, proving that the results are similar in spite of not having used real data during the training procedure.
Download

Paper Nr: 31
Title:

Multi-Branch Convolutional Descriptors for Content-based Remote Sensing Image Retrieval

Authors:

Raffaele Imbriaco, Tunc Alkanat, Egor Bondarev and Peter N. de With

Abstract: Context-based remote sensing image retrieval (CBRSIR) is an important problem in computer vision with many applications such as military, agriculture, and surveillance. In this study, inspired by recent developments in person re-identification, we design and fine-tune a multi-branch deep learning architecture that combines global and local features to obtain rich and discriminative image representations. Additionally, we propose a new evaluation strategy that fully separates the test and training sets and where new unseen data is used for querying, thereby emphasizing the generalization capability of retrieval systems. Extensive evaluations show that our method significantly outperforms the existing approaches by up to 10.7% in mean precision@20 on popular CBRSIR datasets. Regarding the new evaluation strategy, our method attains excellent retrieval performance, yielding more than 95% precision@20 score on the challenging PatternNet dataset.
Download

Paper Nr: 37
Title:

Robust Method for Detecting Defect in Images Printed on 3D Micro-textured Surfaces: Modified Multiple Paired Pixel Consistency

Authors:

Sheng Xiang, Shun’ichi Kaneko and Dong Liang

Abstract: When attempting to examine three-dimensional micro-textured surfaces or illumination fluctuations, problems such as shadowing can occur with many conventional visual inspection methods. Thus, we propose a modified method comprising orientation codes based on consistency of multiple pixel pairs to inspect defects in logotypes printed on three-dimensional micro-textured surfaces. This algorithm comprises a training stage and a detection stage. The aim of the training stage is to locate and pair supporting pixels that show similar change trends as a target pixel and create a statistical model for each pixel pair. Here, we introduce our modified method that uses the chi-square test and skewness to increase the precision of the statistical model. The detection stage identifies whether the target pixel matches its model and judges whether it is defective or not. The results show the effectiveness of our proposed method for detecting defects in real product images.
Download

Paper Nr: 41
Title:

Vessel-speed Enforcement System by Multi-camera Detection and Re-identification

Authors:

H. G. J. Groot, M. H. Zwemer, R. G. J. Wijnhoven, Y. Bondarev and P. H. N. de With

Abstract: In crowded waterways, maritime traffic is bound to speed regulations for safety reasons. Although several speed measurement techniques exist for road traffic, no known systems are available for maritime traffic. In this paper, we introduce a novel vessel speed enforcement system, based on visual detection and re-identification between two cameras along a waterway. We introduce a newly captured Vessel-reID dataset, containing 2,474 unique vessels. Our vessel detector is based on the Single Shot Multibox Detector and localizes vessels in each camera individually. Our re-identification algorithm, based on the TriNet model, matches vessels between the cameras. In general, vessels are detected over a large range of their in-view trajectory (over 92% and 95%, for Camera 1 and 2, respectively), which makes the re-identification experiments reliable. For re-identification, application specific techniques, i.e. trajectory matching and time filtering, improve our baseline re-identification model (49.5% mAP) with over 20% mAP. In the final evaluation, we show that 77% (Rank-1 score) of the vessels are correctly re-identified in the other camera. This final result presents a feasible score for our novel vessel re-identification application. Moreover, our result could be further improved, as we have tested on new unseen data during other weather conditions.
Download

Paper Nr: 43
Title:

Semisupervised Classification of Anomalies Signatures in Electrical Wafer Sorting (EWS) Maps

Authors:

Luigi C. Viagrande, Filippo M. Milotta, Paola Giuffrè, Giuseppe Bruno, Daniele Vinciguerra and Giovanni Gallo

Abstract: We focused onto a very specific kind of data from semiconductor manufacturing called Electrical Wafer Sorting (EWS) maps, that are generated during the wafer testing phase performed in semiconductor device fabrication. Yield detractors are identified by specific and characteristic anomalies signatures. Unfortunately, new anomalies signatures may appear among the huge amount of EWS maps generated per day. Hence, it’s unfeasible to define just a finite set of possible signatures, as this will not represent a real use-case scenario. Our goal is anomalies signatures classification. For this purpose, we present a semisupervised approach by combining hierarchical clustering to create the starting Knowledge Base, and a supervised classifier trained leveraging clustering phase. Our dataset is daily increased, and the classifier is dynamically updated considering possible new created clusters. Training a Convolutional Neural Network, we reached performance comparable with other state-of-the-art techniques, even if our method does not rely on any labeled dataset and can be daily updated. Our dataset is skewed and the proposed method was proved to be rotation invariant. The proposed method can grant benefits like reduction of wafer test results review time, or improvement of processes, yield, quality, and reliability of production using the information obtained during clustering process.
Download

Paper Nr: 54
Title:

Facial Expression Recognition using the Bilinear Pooling

Authors:

Marwa Ben Jabra, Ramzi Guetari, Aladine Chetouani, Hedi Tabia and Nawres Khlifa

Abstract: Emotions taint our life and allow expressing the different facets of the personality. Among the expressions of the human body, facial ones are the most representative of the mindscape of a person. Several works are devoted to it and applications have already been developed. The latter, based on computer vision, are nevertheless facing some limitations and difficulties that are related to the point of view, lighting, occlusions, etc. Artificial Neural Networks (ANN) have been introduced to solve some of these limitations. The latter give satisfactory results, but still have not solved all the problems such as camera angle, the position of the head and, the occlusions, etc. In this paper, we review models of neural networks used in the field of recognition of facial emotions. We also propose an architecture based on the bilinear pooling in order to improve the results obtained by previous works and to provide solutions to solve these recurring constraints. This technique greatly improves the results obtained by architectures based on conventional CNNs.
Download

Paper Nr: 59
Title:

Acquisition of Optimal Connection Patterns for Skeleton-based Action Recognition with Graph Convolutional Networks

Authors:

Katsutoshi Shiraki, Tsubasa Hirakawa, Takayoshi Yamashita and Hironobu Fujiyoshi

Abstract: Action recognition from skeletons is gaining attention since skeleton data can be easily obtained from depth sensors and highly accurate pose estimation methods such as OpenPose. A method using graph convolutional networks (GCN) has been proposed for action recognition using skeletons as input. Among the action recognition methods using GCN, spatial temporal GCN (ST-GCN) achieves a higher accuracy by capturing skeletal data as spatial and temporal graphs. However, because ST-GCN defines human skeleton patterns in advance and applies convolution processing, it is not possible to capture features that take into account the joint relationships specific to each action. The purpose of this work is to recognize actions considering the connection patterns specific to action classes. The optimal connection pattern is obtained by acquiring features of each action class by introducing multitask learning and selecting edges on the basis of the value of the weight matrix indicating the importance of the edges. Experimental results show that the proposed method has a higher classification accuracy than the conventional method. Moreover, we visualize the obtained connection patterns by the proposed method and show that our method can obtain specific connection patterns for each action class.
Download

Paper Nr: 63
Title:

Clothing Category Classification using Common Models Adaptively Adjusted to Observation

Authors:

Jingyu Hu, Nobuyuki Kita and Yasuyo Kita

Abstract: This paper proposes a method of automatically classifying the category of clothing items by adaptively adjusting common models subject to each observation. In the previous work(Hu and Kita, 2015), we proposed a two-stage method of categorizing a clothing item using a dual-arm robot. First, to alleviate the effect of large physical deformation, the method reshaped a clothing item of interest into one of a small number of limited shapes by using a fixed basic sequence of re-grasp actions. The shape was then matched with shape potential images of clothing category, each of which was configured by combining the clothing contours of various designed items of the same category. However, there was a problem that the shape potential images were too general to be highly discriminative. In this paper, we propose to configure high discriminative shape potential images by adjusting them subject to observation. Concretely, we restrict the contours used for potential images according to simply observable information. Two series of experiments using various clothing items of five categories demonstrate the effect of the proposed method.
Download

Paper Nr: 71
Title:

Deep Body-pose Estimation via Synthetic Depth Data: A Case Study

Authors:

Christopher Pramerdorfer and Martin Kampel

Abstract: Computer Vision research is nowadays largely data-driven due to the prevalence of deep learning. This is one reason why depth data have become less popular, as no datasets exist that are comparable to common color datasets in terms of size and quality. However, depth data have advantages in practical applications that involve people, in which case utilizing cameras raises privacy concerns. We consider one such application, namely 3D human pose estimation for a health care application, to study whether the lack of large depth datasets that represent this problem can be overcome via synthetic data, which aspects must be considered to ensure generalization, and how this compares to alternative approaches for obtaining training data. Furthermore, we compare the pose estimation performance of our method on depth data to that of state-of-the-art methods for color images and show that depth data is a suitable alternative to color images in this regard.
Download

Paper Nr: 97
Title:

Detecting Anomalous Regions from an Image based on Deep Captioning

Authors:

Yusuke Hatae, Qingpu Yang, Muhammad F. Fadjrimiratno, Yuanyuan Li, Tetsu Matsukawa and Einoshin Suzuki

Abstract: In this paper we propose a one-class anomalous region detection method from an image based on deep captioning. Such a method can be installed on an autonomous mobile robot, which reports anomalies from observation without any human supervision and would interest a wide range of researchers, practitioners, and users. In addition to image features, which were used by conventional methods, our method exploits recent advances in deep captioning, which is based on deep neural networks trained on a large-scale data on image - caption pairs, enabling anomaly detection in the semantic level. Incremental clustering is adopted so that the robot is able to model its observation into a set of clusters and report substantially new observations as anomalies. Extensive experiments using two real-world data demonstrate the superiority of our method in terms of recall, precision, F measure, and AUC over the traditional approach. The experiments also show that our method exhibits excellent learning curve and low threshold dependency.
Download

Paper Nr: 116
Title:

Multi-stream Architecture with Symmetric Extended Visual Rhythms for Deep Learning Human Action Recognition

Authors:

Hemerson Tacon, André S. Brito, Hugo L. Chaves, Marcelo B. Vieira, Saulo M. Villela, Helena A. Maia, Darwin T. Concha and Helio Pedrini

Abstract: Despite the significant progress of Deep Learning models on the image classification task, it still needs enhancements for the Human Action Recognition task. In this work, we propose to extract horizontal and vertical Visual Rhythms as well as their data augmentations as video features. The data augmentation is driven by crops extracted from the symmetric extension of the time dimension, preserving the video frame rate, which is essential to keep motion patterns. The crops provide a 2D representation of the video volume matching the fixed input size of a 2D Convolutional Neural Network. In addition, multiple crops with stride guarantee coverage of the entire video. We verified that the combination of horizontal and vertical directions leads do better results than previous methods. A multi-stream strategy combining RGB and Optical Flow information is modified to include the additional spatiotemporal streams: one for the horizontal Symmetrically Extended Visual Rhythm (SEVR), and another for the vertical one. Results show that our method achieves accuracy rates close to the state of the art on the challenging UCF101 and HMDB51 datasets. Furthermore, we assessed the impact of data augmentations methods for Human Action Recognition and verified an increase of 10% for the UCF101 dataset.
Download

Paper Nr: 125
Title:

Time-unfolding Object Existence Detection in Low-quality Underwater Videos using Convolutional Neural Networks

Authors:

Helmut Tödtmann, Matthias Vahl, Uwe F. von Lukas and Torsten Ullrich

Abstract: Monitoring the environment for early recognition of changes is necessary for assessing the success of renaturation measures on a facts basis. It is also used in fisheries and livestock production for monitoring and for quality assurance. The goal of the presented system is to count sea trouts annually over the course of several months. Sea trouts are detected with underwater camera systems triggered by motion sensors. Such a scenario generates many videos that have to be evaluated manually. This article describes the techniques used to automate the image evaluation process. An effective method has been developed to classify videos and determine the times of occurrence of sea trouts, while significantly reducing the annotation effort. A convolutional neural network has been trained via supervised learning. The underlying images are frame compositions automatically extracted from videos on which sea trouts are to be detected. The accuracy of the resulting detection system reaches values of up to 97.7 %.
Download

Paper Nr: 142
Title:

Detecting and Locating Boats using a PTZ Camera with Both Optical and Thermal Sensors

Authors:

Christoffer P. Simonsen, Frederik M. Thiesson, Øyvind Holtskog and Rikke Gade

Abstract: A harbor traffic monitoring system is necessary for most ports, yet current systems are often not able to detect and receive information from boats without transponders. In this paper we propose a computer vision based monitoring system utilizing the multi-modal properties of a PTZ (pan, tilt, zoom) camera with both an optical and thermal sensor in order to detect boats in different lighting and weather conditions. In both domains boats are detected using a YOLOv3 network pretrained on the COCO dataset and retrained using transfer-learning to images of boats in the test environment. The boats are then positioned on the water using ray-casting. The system is able to detect boats with an average precision of 95.53% and 96.82% in the optical and thermal domains, respectively. Furthermore, it is also able to detect boats in low optical lighting conditions, without being trained with data from such conditions, with an average precision of 15.05% and 46.05% in the optical and thermal domains, respectively. The position estimator, based on a single camera, is able to determine the position of the boats with a mean error of 18.58 meters and a standard deviation of 17.97 meters.
Download

Paper Nr: 149
Title:

Multi-stream Deep Networks for Vehicle Make and Model Recognition

Authors:

Mohamed E. Besbes, Yousri Kessentini and Hedi Tabia

Abstract: Vehicle recognition generally aims to classify vehicles based on make, model and year of manufacture. It is a particularly hard problem due to the large number of classes and small inter-class variations. To handle this problem recent state of the art methods use Convolutional Neural Network (CNN). These methods have however several limitations since they extract unstructured vehicle features used for the recognition task. In this paper, we propose more structured feature extraction method by leveraging robust multi-stream deep networks architecture. We employ a novel dynamic combination technique to aggregate different vehicle part features with the entire image. This allows combining global representation with local features. Our system which has been evaluated on publicly available datasets is able to learn highly discriminant representation and achieves state-of-the-art result.
Download

Paper Nr: 161
Title:

Texture-based 3D Face Recognition using Deep Neural Networks for Unconstrained Human-machine Interaction

Authors:

Michael Danner, Patrik Huber, Muhammad Awais, Zhen-Hua Feng, Josef Kittler and Matthias Raetsch

Abstract: 3D assisted 2D face recognition involves the process of reconstructing 3D faces from 2D images and solving the problem of face recognition in 3D. To facilitate the use of deep neural networks, a 3D face, normally represented as a 3D mesh of vertices and its corresponding surface texture, is remapped to image-like square isomaps by a conformal mapping. Based on previous work, we assume that face recognition benefits more from texture. In this work, we focus on the surface texture and its discriminatory information content for recognition purposes. Our approach is to prepare a 3D mesh, the corresponding surface texture and the original 2D image as triple input for the recognition network, to show that 3D data is useful for face recognition. Texture enhancement methods to control the texture fusion process are introduced and we adapt data augmentation methods. Our results show that texture-map-based face recognition can not only compete with state-of-the-art systems under the same preconditions but also outperforms standard 2D methods from recent years.
Download

Paper Nr: 174
Title:

SOANets: Encoder-decoder based Skeleton Orientation Alignment Network for White Cane User Recognition from 2D Human Skeleton Sequence

Authors:

Naoki Nishida, Yasutomo Kawanishi, Daisuke Deguchi, Ichiro Ide, Hiroshi Murase and Jun Piao

Abstract: Recently, various facilities have been deployed to support visually impaired people. However, accidents caused by visual disabilities still occur. In this paper, to support the visually impaired people in public areas, we aim to identify the presence of a white cane user from a surveillance camera by analyzing the temporal transition of a human skeleton in a pedestrian image sequence represented as 2D coordinates. Our previously proposed method aligns the orientation of the skeletons to various orientations and identifies a white cane user from the corresponding sequences, relying on multiple classifiers related to each orientation. The method employs an exemplar-based approach to perform the alignment, and heavily depends on the number of exemplars and consumes excessive memory. In this paper, we propose a method to align 2D skeleton representation sequences to various orientations using the proposed Skeleton Orientation Alignment Networks (SOANets) based on an encoder-decoder model. Using SOANets, we can obtain 2D skeleton representation sequences aligned to various orientations, extract richer skeleton features, and recognize white cane users accurately. Results of an evaluation experiment shows that the proposed method improves the recognition rate by 16%, compared to the previous exemplar-based method.
Download

Paper Nr: 177
Title:

Visual Descriptor Learning from Monocular Video

Authors:

Umashankar Deekshith, Nishit Gajjar, Max Schwarz and Sven Behnke

Abstract: Correspondence estimation is one of the most widely researched and yet only partially solved area of computer vision with many applications in tracking, mapping, recognition of objects and environment. In this paper, we propose a novel way to estimate dense correspondence on an RGB image where visual descriptors are learned from video examples by training a fully convolutional network. Most deep learning methods solve this by training the network with a large set of expensive labeled data or perform labeling through strong 3D generative models using RGB-D videos. Our method learns from RGB videos using contrastive loss, where relative labeling is estimated from optical flow. We demonstrate the functionality in a quantitative analysis on rendered videos, where ground truth information is available. Not only does the method perform well on test data with the same background, it also generalizes to situations with a new background. The descriptors learned are unique and the representations determined by the network are global. We further show the applicability of the method to real-world videos.
Download

Paper Nr: 217
Title:

Distributed Information Integration in Convolutional Neural Networks

Authors:

Dinesh Kumar and Dharmendra Sharma

Abstract: A large body of physiological findings has suggested the vision system understands a scene in terms of its local features such as lines and curves. A highly notable computer algorithm developed that models such behaviour is the Convolutional Neural Network (CNN). Whilst recognising an object in various scales remains trivial for the human vision system, CNNs struggle to achieve the same behaviour. Recent physiological findings are suggesting two new paradigms. Firstly, the visual system uses both local and global features in its recognition function. Secondly, the brain uses a distributed processing architecture to learn information from multiple modalities. In this paper we combine these paradigms and propose a distributed information integration model called D-Net to improve scale-invariant classification of images. We use a CNN to extract local features and, inspired by Google’s INCEPTION model, develop a trainable method using filter pyramids to extract global features called Filter Pyramid Convolutions (FPC). D-Net locally processes CNN and FPC features, fuses the outcomes and obtains a global estimate via the central processor. We test D-Net on classification of scaled images on benchmark datasets. Our results show D-Net’s potential effectiveness towards classification of scaled images.
Download

Paper Nr: 239
Title:

Classification of Histopathological Images using Scale-Invariant Feature Transform

Authors:

Andrzej Bukała, Bogusław Cyganek, Michał Koziarski, Bogdan Kwolek, Bogusław Olborski, Zbigniew Antosz, Jakub Swadźba and Piotr Sitkowski

Abstract: Throughout the years, Scale-Invariant Feature Transform (SIFT) was a widely adopted method in the image matching and classification tasks. However, due to the recent advances in convolutional neural networks, the popularity of SIFT and other similar feature descriptors significantly decreased, leaving SIFT underresearched in some of the emerging applications. In this paper we examine the suitability of SIFT feature descriptors in one such task, the histopathological image classification. In the conducted experimental study we investigate the usefulness of various variants of SIFT on the BreakHis Breast Cancer Histopathological Database. While colour is known to be significant in case of human performed analysis of histopathological images, SIFT variants using different colour spaces have not been thoroughly examined on this type of data before. Observed results indicate the effectiveness of selected SIFT variants, particularly Hue-SIFT, which outperformed the reference convolutional neural network ensemble on some of the considered magnifications, simultaneously achieving lower variance. This proves the importance of using different colour spaces in classification tasks with histopathological data and shows promise to find its use in diversifying classifier ensembles.
Download

Paper Nr: 263
Title:

DA-NET: Monocular Depth Estimation using Disparity Maps Awareness NETwork

Authors:

Antoine Billy and Pascal Desbarats

Abstract: Estimating depth from 2D images has become an active field of study in autonomous driving, scene reconstruction, 3D object recognition, segmentation, and detection. Best performing methods are based on Convolutional Neural Networks, and, as the process of building an appropriate set of data requires a tremendous amount of work, almost all of them rely on the same benchmark to compete between each other : The KITTI benchmark. However, most of them will use the ground truth generated by the LiDAR sensor which generates very sparse depth map with sometimes less than 5% of the image density, ignoring the second image that is given for stereo estimation. Recent approaches have shown that the use of both input images given in most of the depth estimation data set significantly improve the generated results. This paper is in line with this idea, we developed a very simple yet efficient model based on the U-NET architecture that uses both stereo images in the training process. We demonstrate the effectiveness of our approach and show high quality results comparable to state-of-the-art methods on the KITTI benchmark.
Download

Paper Nr: 265
Title:

Mask-guided Image Classification with Siamese Networks

Authors:

Hiba Alqasir, Damien Muselet and Christophe Ducottet

Abstract: This paper deals with a CNN-based image classification task where the class of each image depends on a small detail in the image. Our original idea consists in providing a binary mask to the network so that it knows where is located the important information. This mask as well as the color image are provided as inputs to a siamese network. A contrastive loss function controls the projection of the network outputs in an embedding space enforcing the extraction of image features at the location proposed by the mask. This solution is tested on a real application whose aim is to secure the boarding on ski chairlifts by checking if the safety bar of the carrier is open or closed. Each chairlift has its own safety bar masks (open and close) and we propose to exploit this additional data to help the image classification between close or open safety bar. We show that the use of a siamese network allows to learn a single model that performs very well on 20 different skilifts.
Download

Paper Nr: 289
Title:

Multimodal Dance Recognition

Authors:

Monika Wysoczańska and Tomasz Trzciński

Abstract: Video content analysis is still an emerging technology, and the majority of work in this area extends from the still image domain. Dance videos are especially difficult to analyse and recognise as the performed human actions are highly dynamic. In this work, we introduce a multimodal approach for dance video recognition. Our proposed method combines visual and audio information, by fusing their representations, to improve classification accuracy. For the visual part, we focus on motion representation, as it is the key factor in distinguishing dance styles. For audio representation, we put the emphasis on capturing long-term dependencies, such as tempo, which is a crucial dance discriminator. Finally, we fuse two distinct modalities using a late fusion approach. We compare our model with corresponding unimodal approaches, by giving exhaustive evaluation on the Let’s Dance dataset. Our method yields significantly better results than each single-modality approach. Results presented in this work not only demonstrate the strength of integrating complementary sources of information in the recognition task, but also indicate the potential of applying multimodal approaches within specific research areas.
Download

Paper Nr: 296
Title:

Towards the Automatic Visual Monitoring of Electricity Pylons from Aerial Images

Authors:

Anicetus Odo, Stephen McKenna, David Flynn and Jan Vorstius

Abstract: Visual inspection of electricity transmission and distribution networks relies on flying a helicopter around energized high voltage towers for image collection. The sensed data is taken offline and screened by skilled personnel for faults. This poses high risk to the pilot and crew and is highly expensive and inefficient. This paper reviews work targeted at detecting components of electricity transmission and distribution lines with attention to unmanned aerial vehicle (UAV) platforms. The potential of deep learning as the backbone of image data analysis was explored. For this, we used a new dataset of high resolution aerial images of medium-to-low voltage electricity towers. We demonstrated that reliable classification of towers is feasible using deep learning methods with very good results.
Download

Area 4 - Applications and Services

Full Papers
Paper Nr: 26
Title:

Melanoma Detection System based on a Game Theory Model

Authors:

Djamila Dahmani, Slimane Larabi, Sihame Djelouah, Nafissa Benhebbadj and Mehdi Cheref

Abstract: We propose in this paper a new method for Melanoma detection (the most dangerous form of skin cancer) based on ABCD medical procedure. The ABCD features play a crucial role in the accuracy of diagnosis rates. However, the search for such distinctive data remains difficult, because of the small variability in the appearance of benign and cancerous skin lesions. To cope with this problem, each feature is calculated using different formulas. Then if all the used formulas agree about the lesion classification, it will be classified according to the full agreement. Otherwise, for doubtful pigmented skin lesions, the game theory model is applied for final decision. The game model proposed in our work, estimates that the conflict is between two agents (melanoma and non-melanoma). The different formulas applied in the computation of the features A, B, C, and D are the pure strategies. The value sign in the mixed extension of the game allows classifying correctly the skin lesion. The method was tested on two publically available databases PH2 and ISIC, the obtained results are promising.
Download

Paper Nr: 32
Title:

Towards Detecting Simultaneous Fear Emotion and Deception Behavior in Speech

Authors:

Safa Chebbi and Sofia Ben Jebara

Abstract: In this paper, we propose an approach to detect simultaneous fear emotion and deception behavior from speech analysis. The proposed methodology is the following. First, two separate classifiers to recognize fear and deception are conceived based on adequate voice features using K-Nearest Neighbors’ algorithm. Then, a decision-level fusion based on the belief theory is applied to infer whether the studied emotion and behavior are detected simultaneously or not as well as their degree of presence. The proposed approach is validated on fear/non-fear emotional and deception/non-deception databases separately. Results for separate classifiers reach an accuracy rate in the range of 95% with 24 features for fear recognition and 75% using 8 features for deception detection.
Download

Paper Nr: 87
Title:

Improving Age Estimation in Minors and Young Adults with Occluded Faces to Fight Against Child Sexual Exploitation

Authors:

Deisy Chaves, Eduardo Fidalgo, Enrique Alegre, Francisco Jáñez-Martino and Rubel Biswas

Abstract: Accurate and fast age estimation is crucial in systems for detecting possible victims in Child Sexual Exploitation Materials. Age estimation obtains state of the art results with deep learning. However, these models tend to perform poorly in minors and young adults, because they are trained with unbalanced data and few examples. Furthermore, some Child Sexual Exploitation images present eye occlusion to hide the identity of the victims, which may also affect the performance of age estimators. In this work, we evaluate the performance of Soft Stagewise Regression Network (SSR-Net), a compact size age estimator model, with non-occluded and occluded face images. We propose an approach to improve the age estimation in minors and young adults by using both types of facial images to create SSR-Net models. The proposed strategy builds robust age estimators that improve SSR-Net pre-trained models on IMBD and MORPH datasets, and a Deep EXpectation model, reducing the Mean Absolute Error (MAE) from 7.26, 6.81 and 6.5 respectively, to 4.07 with our proposal.
Download

Paper Nr: 140
Title:

Multi-view Data Capture using Edge-synchronised Mobiles

Authors:

Matteo Bortolon, Paul Chippendale, Stefano Messelodi and Fabio Poiesi

Abstract: Multi-view data capture permits free-viewpoint video (FVV) content creation. To this end, several users must capture video streams, calibrated in both time and pose, framing the same object/scene, from different viewpoints. New-generation network architectures (e.g. 5G) promise lower latency and larger bandwidth connections supported by powerful edge computing, properties that seem ideal for reliable FVV capture. We have explored this possibility, aiming to remove the need for bespoke synchronisation hardware when capturing a scene from multiple viewpoints, making it possible through off-the-shelf mobiles. We propose a novel and scalable data capture architecture that exploits edge resources to synchronise and harvest frame captures. We have designed an edge computing unit that supervises the relaying of timing triggers to and from multiple mobiles, in addition to synchronising frame harvesting. We empirically show the benefits of our edge computing unit by analysing latencies and show the quality of 3D reconstruction outputs against an alternative and popular centralised solution based on Unity3D.
Download

Short Papers
Paper Nr: 9
Title:

Human Climbing and Bouldering Motion Analysis: A Survey on Sensors, Motion Capture, Analysis Algorithms, Recent Advances and Applications

Authors:

Julia Richter, Raul B. Beltrán, Guido Köstermeyer and Ulrich Heinkel

Abstract: Bouldering and climbing motion analysis are increasingly attracting interest in scientific research. Although there is a number of studies dealing with climbing motion analysis, there is no comprehensive survey that exhaustively contemplates sensor technologies, approaches for motion capture and algorithms for the analysis of climbing motions. To promote further advances in this field of research, there is an urgent need to unite available information from different perspectives, such as from a sensory, analytical and application-specific point of view. Therefore, this survey conveys a general understanding of available technologies, algorithms and open questions in the field of climbing motion analysis. The survey is not only aimed at researchers with technical background, but also addresses sports scientists and emphasises the use and advantages of vision-based approaches for climbing motion analysis.
Download

Paper Nr: 28
Title:

Stereoscopic Text-based CAPTCHA on Head-Mounted Displays

Authors:

Tadaaki Hosaka and Shinnosuke Furuya

Abstract: Text-based CAPTCHAs (completely automated public Turing test to tell computers and humans apart) are widely used to prevent unauthorized access by bots. However, there have been advancements in image segmentation and character recognition techniques, which can be used for bot access; therefore, distorted characters that are difficult even for humans to recognize are often utilized. Thus, a new text-based CAPTCHA technology with anti-segmentation properties is required. In this study, we propose CAPTCHA that uses stereoscopy based on binocular disparity. Generating a character area and its background with the identical color patterns, it becomes impossible to extract the character regions if the left and right images are analyzed separately, which is a huge advantage of our method. However, character regions can be extracted by using disparity estimation or subtraction processing using both images; thus, to prevent such attacks, we intentionally add noise to the image. The parameters characterizing the amount of added noise are adjusted based on experiments with subjects wearing a head-mounted display to realize stereo vision. With optimal parameters, the recognition rate reaches 0.84; moreover, sufficient robustness against bot attacks is achieved.
Download

Paper Nr: 189
Title:

MAUI: Tele-assistance for Maintenance of Cyber-physical Systems

Authors:

Philipp Fleck, Fernando Reyes-Aviles, Christian Pirchheim, Clemens Arth and Dieter Schmalstieg

Abstract: In this paper, we present the maintenance assistance user interface (MAUI), a novel approach for providing tele-assistance to a worker charged with maintenance of a cyber-physical system. Such a system comprises both physical and digital interfaces, making it challenging for a worker to understand the required steps and to assess work progress. A remote expert can access the digital interfaces and provide the worker with timely information and advice in an augmented reality display. The remote expert has full control over the user interface of the worker in a manner comparable to remote desktop systems. The worker needs to perform all physical operations and retrieve physical information, such as reading physical labels or meters. Thus, worker and remote expert collaborate not only via shared audio, video or pointing, but also share control of the digital interface presented in the augmented reality space. We report results on two studies: The first study evaluates the benefits of our system against a condition with the same cyber-physical interface, but without tele-assistance. Results indicate significant benefits concerning speed, cognitive load and subjective comfort of the worker. The second study explores how interface designers use our system, leading to initial design guidelines for tele-presence interfaces like ours.
Download

Paper Nr: 192
Title:

Player Tracking using Multi-viewpoint Images in Basketball Analysis

Authors:

Shuji Tanikawa and Norio Tagawa

Abstract: In this study, we aim to realize the automatic tracking of basketball players by avoiding occlusion of players, which is an important issue in basketball video analysis, using multi-viewpoint images. Images taken with a hand-held camera are used, to expand the scope of application to uses such as school club activities. By integrating the player tracking results from each camera image with a 2-D map viewed from above the court, using projective transformation, the occlusion caused by one camera is stably solved using the information from other cameras. In addition, using OpenPose for player detection reduces the occlusion that occurs in each camera image before all camera images are integrated. We confirm the effectiveness of our method by experiments with real image sequences.
Download

Paper Nr: 209
Title:

Steganalysis of Semi-fragile Watermarking Systems Resistant to JPEG Compression

Authors:

Anna Egorova and Victor Fedoseev

Abstract: Recently, dozens of semi-fragile digital watermarking systems have been designed to protect JPEG images from unauthorized changes. Their principle is to embed an invisible protective watermark into the image. Such a watermark is destroyed by any image editing operations, except for JPEG compression with the quality level in a given range of values. Watermarking systems of this type have been assessed in terms of watermark extraction accuracy and visual quality of the protected image. However, their steganographic security (i.e., robustness against detecting protective information traces by a third party) has not been sufficiently studied. Meanwhile, if an attacker detects the presence of a watermark in the image, he can get valuable information on the used image protection technique. It can let him develop a data modification method that alters the content of the protected image without destroying the embedded watermark. In this paper, we propose a specific attack to analyze steganographic security of known semi-fragile watermarking algorithms for JPEG images. We also investigate the efficiency of the proposed attack. In addition, we propose an approach to counter the attack that can be applied in the existing watermarking systems to enforce their steganographic security.
Download

Paper Nr: 233
Title:

Surgery Recording without Occlusions by Multi-view Surgical Videos

Authors:

Tomohiro Shimizu, Kei Oishi, Ryo Hachiuma, Hiroki Kajita, Yoshihumi Takatsume and Hideo Saito

Abstract: Recording surgery is important for sharing various operating techniques. In most surgical rooms, fixed surgical cameras are already installed, but it is almost impossible to capture the surgical field because of occlusion by the surgeon’s head and body. In order to capture the surgical field, we propose the installation of multiple cameras in a surgical lamp system, so that at least one camera can capture the surgical field even when the surgeon’s head and body occlude other cameras. In this paper, we present a method for automatic viewpoint switching from multi-view surgical videos, so that the surgical field can always be recorded. We employ a method for learning-based object detection from videos for automatic evaluation of the surgical field from multi-view surgical videos. In general, frequent camera switching degrades the video quality of view (QoV). Therefore, we apply Dijkstra’s algorithm, widely used in the shortest path problem, as an optimization method for this problem. Our camera scheduling method works so that camera switching is not performed for the minimum frame we specified, and therefore the surgical field observed in the entire video is maximized.
Download

Paper Nr: 300
Title:

Driving Video Prediction based on Motion Estimation of 3D Objects using a Stereo Camera System

Authors:

Takuya Umemura, Fumihiko Sakaue and Jun Sato

Abstract: In this paper, we propose a method to synthesize future images in a driving scene using a stereo camera system fitted on vehicles. In this method, three-dimensional (3D) objects in a driving scenario, such as vehicles, buildings, and humans, are reconstructed by a stereo camera system. The reconstructed objects are separated by semantic image segmentation based on 2D image information. Furthermore, motion prediction using a Kalman filter is applied to each object. 3D objects in future scenes are rendered using this motion prediction. However, some regions, which are occluded in the input images, cannot be predicted. Therefore, an image inpainting technique is used for the occluded regions in the input image. Experimental results show that our proposed method can synthesize natural predicted images.
Download

Paper Nr: 4
Title:

CHAKEL-DB: Online Database for Handwriting Diacritic Arabic Character

Authors:

Houda Nakkach, Sofiene Haboubi and Hamid Amiri

Abstract: Our paper presents an online database for handwriting Arabic characters, so-called “CHAKEL-DB”: with “chakel” means diacritic in Arabic language. The database contains 3150 collected samples that present alpha-numeral characters with diacritical marks. The data are collected from more than 68 writers having different origins, genders and ages. The data are available in character and stroke levels. We built an elementary recognition system to test our database and to manipulate a large vocabulary and a huge quantity of variation style in the collected data. “CHAKEL-DB” is available for the purpose to improve handwriting research field, to facilitate experiments and researches. This database offers files in XML format.
Download

Paper Nr: 15
Title:

Detection, Counting and Maturity Assessment of Cherry Tomatoes using Multi-spectral Images and Machine Learning Techniques

Authors:

I-Tzu Chen and Huei-Yung Lin

Abstract: This paper presents an image-based approach for the yield estimation of cherry tomatoes. The objective is to assist farmers to quickly evaluate the amount of mature tomatoes which are ready to harvest. The proposed technique consists of machine learning based methods for detection, counting, and maturity assessment using multi-spectral images. A convolutional neural network is used for tomato detection from RGB images, followed by the maturity assessment using spectral image analysis with SVM classification. The multi-object tracking algorithm is incorporated to obtain a unique ID for each tomato to avoid double counting during the camera motion. Experiments carried out on the real scene images acquired in an orchard have demonstrated the effectiveness of the proposed method.
Download

Paper Nr: 73
Title:

Vehicle Detection and Classification in Aerial Images using Convolutional Neural Networks

Authors:

Chih-Yi Li and Huei-Yung Lin

Abstract: Due to the popularity of unmanned aerial vehicles, the acquisition of aerial images has become widely available. The aerial images have been used in many applications such as the investigation of roads, buildings, agriculture distribution, and land utilization, etc. In this paper, we propose a technique for vehicle detection and classification from aerial images based on the modification of Faster R-CNN framework. A new dataset for vehicle detection, VAID (Vehicle Aerial Imaging from Drone), is also introduced for public use. The images in the dataset are annotated with 7 common vehicle categories, including sedan, minibus, truck, pickup truck, bus, cement truck and trailer, for network training and testing. We compare the results of vehicle detection in aerial images with widely used network architectures and training datasets. The experiments demonstrate that the proposed method and dataset can achieve high vehicle detection and classification rates under various road and traffic conditions.
Download

Paper Nr: 76
Title:

PlanAR: Accurate and Stable 3D Positioning System via Interactive Plane Reconstruction for Handheld Augmented Reality

Authors:

Ami Miyake, Hideaki Uchiyama, Atsushi Shimada and Rin-ichiro Taniguchi

Abstract: This paper presents a ray-casting-based three-dimensional (3D) positioning system that interactively reconstructs scene structures for handheld augmented reality. The proposed system employs visual simultaneous localization and mapping (vSLAM) technology to acquire camera poses of a smartphone and sparse 3D feature points in an unknown scene. First, users specify a geometric shape region, such as a plane, in captured images while capturing a scene. This is performed by manually selecting some of the feature points generated by vSLAM in the region. Next, the system computes the shape parameter with the selected feature points so that the scene structure is reconstructed densely. Subsequently, users select the pixel of a target point in the scene at one camera view for 3D positioning. Finally, the system computes the intersection between the 3D ray computed with the selected pixel and the reconstructed scene structure to determine the 3D coordinates of the target point. Owing to the proposed interactive reconstruction, the scene structure can be estimated accurately and stably; therefore, 3D positioning will be accurate. Because the geometric shape used for the scene structure is a plane in this study, our system is referred to as PlanAR. In the evaluation, the performance of our system is compared statistically with an existing 3D positioning system to demonstrate the accuracy and stability of our system.
Download

Paper Nr: 128
Title:

Acquisition Evaluation on Outdoor Scanning for Archaeological Artifact Digitalization

Authors:

Aufaclav K. Frisky, Adieyatna Fajri, Simon Brenner and Robert Sablatnig

Abstract: Archaeological archives are important assets because they provide primary information resources for research, mainly digital archives that not degraded. Instead of directly visiting a site, an archaeologist can examine and manipulate the data without harming the real object. However, choosing an efficient scanning scheme with detailed results is a challenging task. In this work, we present new sculpture models obtained in three different ways and assess it in two comparison approaches: A quantitative and qualitative assessment. The quantitative comparison architecture provides a detailed assessment of three different scanning mechanisms in two stages: point cloud and mesh comparison. This evaluation is purposed to describe the differences between unmodified data. Finally, a qualitative evaluation is performed by an expert and practitioner to explain the difference based on four different produced models to help their needs in the real application.
Download

Paper Nr: 228
Title:

Pectoral Muscle Segmentation in Tomosynthesis Images using Geometry Information and Grey Wolf Optimizer

Authors:

Mohamed Abdel-Nasser, Francesc P. Solsona and Domenec Puig

Abstract: Digital breast tomosynthesis (DBT) is quickly replacing full-field digital mammography because it allows a more efficient breast cancer diagnostic workflow and yields a more confident interpretation. The visual characteristics of the pectoral muscle on mediolateral oblique (MLO) views may increase the false positive rate in computer-aided diagnosis systems. Therefore, the pectoral muscle should be extracted from MLO images before further analysis. Notably, most pectoral muscle segmentation method has a fixed parameter setting that may yield good results with some images and fail with others due to the variations in breast density. In this paper, we propose a promising method to segment pectoral muscles from tomosynthesis images based on geometric information of the pectoral muscle and a meta-heuristic optimization algorithm. Concretely, our method involves four steps: 1) a preprocessing step, 2) obtaining of geometric information of pectoral muscle, 3) selection of pectoral muscle pixels, and 4) finding the optimal parameters using the grey wolf optimizer (GWO). The GWO optimizer gets different parameters for each input image as they depend on the visual characteristics of the images (i.e., breast density). With each input image, the GWO optimizer determines different values of the parameters because they rely on the visual characteristics of tomosynthesis images that are highly related to breast density. The proposed method is evaluated with a set of tomosynthesis images obtaining a Dice score of 0.823 and an IoU score of 0.726.
Download

Paper Nr: 256
Title:

Automatic Classification of Cervical Cell Patches based on Non-geometric Characteristics

Authors:

Douglas A. Isidoro, Cláudia M. Carneiro, Mariana T. Resende, Fátima S. Medeiros, Daniela M. Ushizima and Andrea C. Bianchi

Abstract: This work presents a proposal for an efficient classification of cervical cells based on non-geometric characteristics extracted from nuclear regions of interested. This approach is based on the hypothesis that the nuclei store much of the information about the lesions in addition to their areas being more visible even with a high level of celular overlap, a common fact in the Pap smears images. Classification systems were used in two and three classes for a set of real images of the cervix from a supervised learning method. The results demonstrate high classification performance and high efficiency for applicability in realistic environments, both computational and biological.
Download

Paper Nr: 291
Title:

Fully Connected Visual Words for the Classification of Skin Cancer Confocal Images

Authors:

Athanasios Kallipolitis, Alexandros Stratigos, Alexios Zarras and Ilias Maglogiannis

Abstract: Reflectance Confocal Microscopy (RCM) is an ancillary, non-invasive method for reviewing horizontal sections from areas of interest of the skin at a high resolution. In this paper, we propose a method based on the exploitation of Bag of Visual Words (BOVW) technique, coupled with a plain neural network to classify extracted information into discrete patterns of skin cancer types. The paper discusses the technical details of implementation, while providing promising initial results that reach 90% accuracy. Automated classification of RCM images can lead to the establishment of a reliable procedure for the assessment of skin cancer cases and the training of medical personnel through the quantization of image content. Moreover, early detected benign tumours can reduce significantly the number of time and resource consuming biopsies.
Download

Area 5 - Motion, Tracking and Stereo Vision

Full Papers
Paper Nr: 36
Title:

DynaLoc: Real-Time Camera Relocalization from a Single RGB Image in Dynamic Scenes based on an Adaptive Regression Forest

Authors:

Nam-Duong Duong, Amine Kacete, Catherine Soladie, Pierre-Yves Richard and Jérôme Royan

Abstract: Camera relocalization is an important component in localization systems such as augmented reality or robotics when camera tracking loss occurs. It uses models built from known information of a scene. However, these models cannot perform in a dynamic environment which contains moving objects. In this paper, we propose an adaptive regression forest and apply it to our DynaLoc, a real-time camera relocalization approach from a single RGB image in dynamic environments. Our adaptive regression forest is able to fine tune and update continuously itself from evolving data in real-time. This is performed by updating a relevant subset of leaves, which gives uncertain predictions. The results of camera relocalization in dynamic scenes report that our method is able to address a large number of moving objects or a whole scene to gradually change in order to obtain high accuracy avoiding accumulation of error. Moreover, our method achieves results as accurate as the best state-of-the-art methods on static scenes dataset.
Download

Paper Nr: 67
Title:

Recovering 3D Structure of Nonuniform Refractive Space

Authors:

Takahiro Higuchi, Fumihiko Sakaue and Jun Sato

Abstract: We present a novel method for recovering the whole 3D structure of a nonuniform refractive space. The refractive space may consist of a single nonuniform refractive medium such as heated air or multiple refractive media with uniform or nonuniform refractive indices. Unlike most existing methods for recovering transparent objects, our method does not have a limitation on the number of light refractions. Furthermore, our method can recover both gradual and abrupt changes in the refractive index in the space. For recovering the whole 3D structure of a nonuniform refractive space, we combine the ray equation in geometric optics with a sparse estimation of the 3D distribution. Testing showed that the proposed method can efficiently estimate the time varying 3D distribution of the refractive index of heated air.
Download

Paper Nr: 92
Title:

Learning Geometrically Consistent Mesh Corrections

Authors:

Ștefan Săftescu and Paul Newman

Abstract: Building good 3D maps is a challenging and expensive task, which requires high-quality sensors and careful, time-consuming scanning. We seek to reduce the cost of building good reconstructions by correcting views of existing low-quality ones in a post-hoc fashion using learnt priors over surfaces and appearance. We train a convolutional neural network model to predict the difference in inverse-depth from varying viewpoints of two meshes – one of low quality that we wish to correct, and one of high-quality that we use as a reference. In contrast to previous work, we pay attention to the problem of excessive smoothing in corrected meshes. We address this with a suitable network architecture, and introduce a loss-weighting mechanism that emphasises edges in the prediction. Furthermore, smooth predictions result in geometrical inconsistencies. To deal with this issue, we present a loss function which penalises re-projection differences that are not due to occlusions. Our model reduces gross errors by 45.3%–77.5%, up to five times more than previous work.
Download

Paper Nr: 112
Title:

High-speed Imperceptible Structured Light Depth Mapping

Authors:

Avery Cole, Sheikh Ziauddin and Michael Greenspan

Abstract: A novel method is proposed to imperceptibly embed structured light patterns in projected content to extract a real time range image stream suitable for dynamic projection mapping applications. The method is based on a novel pattern injection approach that exploits the dithering sequence of modern Digital Micromirror Device projectors, so that patterns are injected at a frequency and intensity below the thresholds of human perception. A commercially available DLP projector is synchronized with camera capture at rates that allow a stream of grey code patterns to be imperceptibly projected and acquired to realize dense, imperceptible, real time, temporally encoded structured light. The method is deployed on a calibrated stereo procam system that has been rectified to facilitate fast correspondences from the extracted patterns, enabling depth triangulation. The bandwidth achieved imperceptibly is nearly 8 million points per second using a general purpose CPU which is comparable to, and exceeds some, hardware accelerated commercial structured light depth cameras.
Download

Paper Nr: 114
Title:

Filter Learning from Deep Descriptors of a Fully Convolutional Siamese Network for Tracking in Videos

Authors:

Hugo L. Chaves, Kevyn S. Ribeiro, André S. Brito, Hemerson Tacon, Marcelo B. Vieira, Augusto S. Cerqueira, Saulo M. Villela, Helena A. Maia, Darwin T. Concha and Helio Pedrini

Abstract: Siamese Neural Networks (SNNs) attracted the attention of the Visual Object Tracking community due to their relatively low computational cost and high efficacy to compare similarity between a reference and a candidate object to track its trajectory in a video over time. However, a video tracker that purely relies on an SNN might suffer from drifting due to changes in the target object. We propose a framework to take into account the changes of the target object in multiple time-based descriptors. In order to show its validity, we define long-term and short-term descriptors based on the first and the recent appearance of the object, respectively. Such memories are combined into a final descriptor that is the actual tracking reference. To compute the short-term memory descriptor, we estimate a filter bank through the usage of a genetic algorithm strategy. The final method has a low computational cost since it is applied through convolution operations along the tracking. According to the experiments performed in the widely used OTB50 dataset, our proposal improves the performance of an SNN dedicated to visual object tracking, being comparable to the state of the art methods.
Download

Paper Nr: 145
Title:

Domain Adaptation for Person Re-identification on New Unlabeled Data

Authors:

Tiago G. Pereira and Teofilo E. de Campos

Abstract: n the world where big data reigns and there is plenty of hardware prepared to gather a huge amount of non structured data, data acquisition is no longer a problem. Surveillance cameras are ubiquitous and they capture huge numbers of people walking across different scenes. However, extracting value from this data is challenging, specially for tasks that involve human images, such as face recognition and person re-identification. Annotation of this kind of data is a challenging and expensive task. In this work we propose a domain adaptation workflow to allow CNNs that were trained from one domain to be applied to another domain without the need for new annotation of the target data. Our results show that domain adaptation techniques really improve the performance of the CNN when applied in the target domain.
Download

Paper Nr: 221
Title:

Pedestrian Tracking with Occlusion State Estimation

Authors:

Akihiro Enomura, Toru Abe and Takuo Suganuma

Abstract: Visual tracking of multiple pedestrians in video sequences is an important procedure for many computer vision applications. The tracking-by-detection approach is widely used for visual pedestrian tracking. This approach extracts pedestrian regions from each video frame and associates the extracted regions across frames as the same pedestrian according to the similarities of region features (e.g., position, appearance, and movement). When a pedestrian is temporarily occluded by a still obstacle in the scene, he/she disappears at one side of the obstacle in a certain frame and then reappears at the other side of it a few frames later. The occlusion state of the pedestrian, that is the space-time interval where the pedestrian is missing, varies with obstacle areas and pedestrian movements. Such an unknown occlusion state complicates the region association process for the same pedestrian and makes the pedestrian tracking difficult. To solve this difficulty and improve pedestrian tracking robustness, we propose a novel method for tracking pedestrians while estimating their occlusion states. Our method acquires obstacle areas by the pedestrian regions extracted from each frame, estimates the occlusion states from the acquired obstacle areas and pedestrian movements, and reflects the estimated occlusion states in the region association process.
Download

Short Papers
Paper Nr: 25
Title:

Iterative Color Equalization for Increased Applicability of Structured Light Reconstruction

Authors:

Torben Fetzer, Gerd Reis and Didier Stricker

Abstract: The field of 3D reconstruction is one of the most important areas in computer vision. It is not only of theoretical importance, but is also increasingly used in practice, be it in reverse engineering, quality control or robotics. A distinction is made between active and passive methods, depending on whether they are based on active interactions with the object or not. Due to the accuracy and density of the reconstructions obtained, the structured light approach, whenever applicable, is often the method of choice for industrial applications. Nevertheless, it is an active approach which, depending on material properties or coloration, can lead to problems and fail in certain situations. In this paper, a method based on the standard structured light approach is presented that significantly reduces the influence of the color of a scanned object. It improves the results obtained by repeated application in terms of accuracy, robustness and general applicability. Especially in high-precision reconstruction of small structures or high-contrast colored and specular objects, the technique shows its greatest potential. The advanced method requires neither pre-calibrated cameras or projectors nor information about the equipment. It is easy to implement and can be applied to any existing scanning setup.
Download

Paper Nr: 62
Title:

Metrics Performance Analysis of Optical Flow

Authors:

Taha Alhersh, Samir B. Belhaouari and Heiner Stuckenschmidt

Abstract: Significant amount of research has been conducted on optical flow estimation in previous decades. However, only limited number of research has been conducted on performance analysis of optical flow. These evaluations have shortcomings and a theoretical justification of using one approach and why is needed. In practice, design choices are often made based on qualitative unmotivated criteria or by trial and error. In this paper, novel optical flow performance metrics are proposed and evaluated alongside with current metrics. Our empirical findings suggest using two new optical flow performance metrics namely: Normalized Euclidean Error (NEE) and Enhanced Normalized Euclidean Error version one (ENEE1) for optical flow performance evaluation with ground truth.
Download

Paper Nr: 65
Title:

Patient Motion Compensation for Photogrammetric Registration

Authors:

Hardik Jain, Olaf Hellwich, Andreas Rose, Nicholas Norman, Dirk Mucha and Timo Krüger

Abstract: Photogrammetry has evolved as a non-invasive alternative for various medical applications, including co-registration of the patient at the time of a surgical operation with pre-surgically acquired data as well as surgical instruments. In this case body surface position regularly has to be determined in a global co-ordinate system with high accuracy. In this paper, we treat this task for multi-view monocular imagery acquiring both body surface as well as e.g. reference markers. To fulfill the high accuracy requirements the patient is not supposed to move while images are taken. An approach towards relaxing this demanding situation is to measure small movements of the patient, e.g. with help of an electromagnetic device, and to compensate for the measured motion prior to body surface triangulation. We present two approaches for motion compensation: disparity shift compensation, and moving cameras compensation - both capable of achieving patient registration qualitatively equivalent to motion-free registration.
Download

Paper Nr: 203
Title:

360-Degree Autostereoscopic Display using Conical Mirror and Integral Photography Technology

Authors:

Nobuyuki Ikeya and Kazuhisa Yanaka

Abstract: We propose a new 360åutostereoscopic display that combines a conical mirror and integral photography technology. Our system is similar to the conventional holographic pyramid in that a 3D object appears to float near the center. However, the pyramid consists of four planes with visible borders, whereas the conical mirror only has a seamless curved surface. Therefore, a stereoscopic image can be observed from any angle. The object displayed in the cone is a CG character. It is pre-rendered every 0.5t̊o obtain 720 still images. One IP image is synthesized based on those still images. This system has the advantage of being manufactured at a relatively low cost. Moreover, high reliability can be expected because this display has no mechanical moving parts.
Download

Paper Nr: 224
Title:

Affine Transformation from Fundamental Matrix and Two Directions

Authors:

Nghia L. Minh and Levente Hajder

Abstract: Researchers have recently shown that affine transformations between corresponding patches of two images can be applied for 3D reconstruction, including the reconstruction of surface normals. However, the accurate estimation of affine transformations between image patches is very challenging. This paper mainly proposes a novel method to estimate affine transformations from two directions if epipolar geometry of the image pair is known. A reconstruction pipeline is also proposed here in short. As side effects, two proofs are also given. The first one is to determine the relationship between affine transformations and the fundamental matrix, the second one shows how optimal surface normal estimation can be obtained via the roots of a cubic polynomial. A visual debugger is also proposed to validate the estimated surface normals in real images.
Download

Paper Nr: 252
Title:

S3D-R2R: An Automatic Stereoscopic 3D Image Recomposition to Retargeting Method with Depth Modification

Authors:

Md. B. Islam, Chee O. Wong and Md. K. Islam

Abstract: Stereoscopic image adaptation to the target display devices while minimizing the distortion of significant features and stereoscopic properties is a challenging problem. Conventional methods either fail to preserve the image context or unable to improve the image aesthetics with improved depth perception in the retargeted images. In this paper, we present an automatic warping-based stereoscopic 3D image recomposition to retargeting method, shortly S3D-R2R that improves the stereo image composition in the retargeting results. Our S3D-R2R method resizes both the left and right stereo image pair using a global optimization algorithm that minimizes a set of aesthetic quality errors. These errors are formulated based on the selected photographic composition rules and modify the depth perception. To improve the depth perception of the stereo image pair, the disparity consistency has been modified within the comfort disparity range. Experimental results show that our automatic method changes the position of the salient object in the target image scale and improves the depth perception within the comfort depth range. Empirical user studies indicate that our retargeting results receive more attention than state-of-the-art methods.
Download

Paper Nr: 287
Title:

Meta-parameters Exploration for Unsupervised Event-based Motion Analysis

Authors:

Veís Oudjail and Jean Martinet

Abstract: Being able to estimate motion features is an essential step in dynamic scene analysis. Optical flow typically quantifies the apparent motion of objects. Motion features can benefit from bio-inspired models of mammalian retina, where ganglion cells show preferences to global patterns of direction, especially in the four cardinal translatory directions. We study the meta-parameters of a bio-inspired motion estimation model using event cameras, that are bio-inspired vision sensors that naturally capture the dynamics of a scene. The motion estimation model is made of an elementary Spiking Neural Network, that learns the motion dynamics in a non-supervised way through the Spike-Timing-Dependent Plasticity. After short simulation times, the model can successfully estimate directions without supervision. Some of the advantages of such networks are the non-supervised and continuous learning capabilities, and also their implementability on very low-power hardware. The model is tuned using a synthetic dataset generated for parameter estimation, made of various patterns moving in several directions. The parameter exploration shows that attention should be given to model tuning, and yet the model is generally stable over meta-parameter changes.
Download

Paper Nr: 3
Title:

Comparison of the Optical Flow Quality for Video Denoising

Authors:

Nelson Monzón and Javier Sánchez

Abstract: Video denoising techniques need to understand the motion present in the scenes. In the literature, many strategies guide their temporal filters according to trajectories controlled by optical flow. However, the quality of these flows is rarely investigated. In fact, there are very few studies that compare the behavior of denoising proposals with different optical flow algorithms. In that direction, we analyze several methods and their performance using a general pipeline that reduces the noise through an average of the pixel’s trajectories. This ensures that the denoising strongly depends on the optical flow. We also analyze the behavior of the methods at occlusions and illumination changes. The pipeline incorporates a process to get rid of these effects, so that they do not affect the comparison metrics. We are led to propose a ranking of optical flows methods depending on their efficiency for video denoising, that mainly depends on their complexity.
Download

Paper Nr: 39
Title:

Who Is Your Favourite Player? Specific Player Tracking in Soccer Broadcast

Authors:

Tatsuya Nakamura and Katsuto Nakajima

Abstract: In this paper, we propose a method to identify and track only a specific player out of a number of players wearing the same jersey in the video sequence of a soccer broadcast to make a summary video focusing on the plays of the specific player. In a soccer broadcast, it is not easy to track a specific player because many players on both teams come and go and move across the field. Therefore, we devised a method to overcome this difficulty by combining multiple machine-learning techniques, such as deep neural networks. Our evaluation was conducted for nine players in three different positions and wearing three different color jerseys, and it is shown that although there is room for improvement on the recall, our proposed method can successfully track specific players with a precision of over 90%.
Download

Paper Nr: 47
Title:

Simultaneous Visual Context-aware Path Prediction

Authors:

Haruka Iesaki, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi, Yasunori Ishii, Kazuki Kozuka and Ryota Fujimura

Abstract: Autonomous cars need to understand the environment around it to avoid accidents. Moving objects like pedestrians and cyclists affect to the decisions of driving direction and behavior. And pedestrian is not always one-person. Therefore, we must know simultaneously how many people is in around environment. Thus, path prediction should be understanding the current state. For solving this problem, we propose path prediction method consider the moving context obtained by dashcams. Conventional methods receive the surrounding environment and positions, and output probability values. On the other hand, our approach predicts probabilistic paths by using visual information. Our method is an encoder-predictor model based on convolutional long short-term memory (ConvLSTM). ConvLSTM extracts visual information from object coordinates and images. We examine two types of images as input and two types of model. These images are related to people context, which is made from trimmed people’s positions and uncaptured background. Two types of model are recursively or not recursively decoder inputs. These models differ in decoder inputs because future images cannot obtain. Our results show visual context includes useful information and provides better prediction results than using only coordinates. Moreover, we show our method can easily extend to predict multi-person simultaneously.
Download

Paper Nr: 79
Title:

Robustness Improvement in Optical Deformation Analysis by Matching a Motion Field to Stress Imposed on a Surface

Authors:

Jun Takada and Masahiko Ohta

Abstract: A denoizing and compression method for motion field data is proposed to improve the robustness and efficiency of optical deformation analysis. The proposed method estimates stress change over time imposed on a captured surface based on displacements and strains derived from motion fields obtained by optical flow. The method then finds the best least squares approximation of the motion components due to the stress time series from the motion time series at each coordinate. This process decomposes motion fields into stress and response vectors while removing disturbances. Experimental results confirm that the proposed method significantly reduces noise in visualizing crack opening displacements on a bridge beam under traffic loads, as well as the size of the motion field data.
Download

Paper Nr: 84
Title:

Efficient Stereo Matching Method using Elimination of Lighting Factors under Radiometric Variation

Authors:

Yong-Jun Chang, Sojin Kim and Moongu Jeon

Abstract: Many stereo matching methods show quite accurate results from depth estimation for images captured under the same lighting conditions. However, the lighting conditions of the stereo image are not the same in the real video shooting environment. Therefore, stereo matching, which estimates depth information by searching corresponding points between two images, has difficulty in obtaining accurate results in this case. Some algorithms have been proposed to overcome this problem and have shown good performance. However, those algorithms require a large amount of computation. For this reason, they have a disadvantage of poor matching efficiency. In this paper, we propose an efficient stereo matching method using a color formation model that takes into account exposure and illumination changes of captured images. Our method changes an input image to a radiometric invariant image and also applies a local binary patch, which is robust to lighting changes, to the transformed image according to exposure and illumination changes to improve the matching speed.
Download

Paper Nr: 134
Title:

Generating a Consistent Global Map under Intermittent Mapping Conditions for Large-scale Vision-based Navigation

Authors:

Kazuki Nishiguchi, Walid Bousselham, Hideaki Uchiyama, Diego Thomas, Atsushi Shimada and Rin-ichiro Taniguchi

Abstract: Localization is the process to compute sensor poses based on vision technologies such as visual Simultaneous Localization And Mapping (vSLAM). It can generally be applied to navigation systems . To achieve this, a global map is essential such that the relocalization process requires a single consistent map represented with an unified coordinate system. However, a large-scale global map cannot be created at once due to insufficient visual features at some moments. This paper presents an interactive method to generate a consistent global map from intermittent maps created by vSLAM independently via global reference points. First, vSLAM is applied to individual image sequences to create maps independently. At the same time, multiple reference points with known latitude and longitude are interactively recorded in each map. Then, the coordinate system of each individual map is converted into the one that has metric scale and unified axes with the reference points. Finally, the individual maps are merged into a single map based on the relative position of each origin. In the evaluation, we show the result of map merging and relocalization with our dataset to confirm the effectiveness of our method for navigation tasks. In addition, the report on participating in the navigation competition in a practical environment is also discussed.
Download

Paper Nr: 168
Title:

Labelling of Continuous Dynamic Interactions with the Environment using a Dynamic Model Representation

Authors:

Juan C. Ramirez and Darius Burschka

Abstract: We propose an extension for a dynamic 3D model that allows a hieratchical labeling of continuous interactions in scenes. While most systems focus on labels for pure transportation tasks, we show how Atlas information attached to objects identified in the scene can be used to label not only transportation tasks but also physical interactions, like writing, erasing a board, tightning a screw etc. We analyse the dynamic motion observed by a camera system at different abtraction levels ranging from simple motion primitives, over single physical actions to complete processes. The associated observation time horizons range from single turning motion on the screws tightened during a task over the process of inserting screws to the entire process of building a device. The complexity and the time horizon for possible predictions about actions in the scene increase with the abstraction level. We present the extension at the example of typical tasks observed by a camera, like writing and erasing a whiteboard.
Download

Paper Nr: 179
Title:

Real-time Surveillance based Crime Detection for Edge Devices

Authors:

Sai V. Venkatesh, Adithya P. Anand, Gokul S. S., Akshay Ramakrishnan and Vineeth Vijayaraghavan

Abstract: There is a growing use of surveillance cameras to maintain a log of events that would help in the identification of criminal activities. However, it is necessary to continuously monitor the acquired footage which contributes to increased labor costs but more importantly, violation of privacy. Therefore, we need decentralized surveillance systems that function autonomously in real-time to reduce crime rates even further. In our work, we discuss an efficient method of crime detection using Deep Learning, that can be used for on-device crime monitoring. By making the inferences on-device, we can reduce the latency, the cost overhead for the collection of data into a centralized unit and counteract the lack of privacy. Using the concept of EarlyStopping–Multiple Instance Learning to provide low inference time, we build specialized models for crime detection using two real-world datasets in the domain. We implement the concept of Sub-Nyquist sampling on a video and introduce a metric ηcomp for evaluating the reduction of computation due to undersampling. On average, our models tested on Raspberry Pi 3 Model B provide a 30% increase of accuracy over benchmarks, computational savings as 80.23% and around 13 times lesser inference times. This allows for the development of efficient and accurate real-time implementation on edge devices for on-device crime detection.
Download

Paper Nr: 202
Title:

On the Fly Vehicle Modeling and Tracking with 2D-LiDAR Detector and Infrared Camera

Authors:

Kazuhiko Sumi, Kazunari Takagi, Tatsuya Oshiro, Takuya Matsumoto, Kazuyoshi Kitajima, Yoshifumi Hayakawa and Masayuki Yamamoto

Abstract: We propose a vehicle detection and tracking system that tracks vehicles from the rear using a 10-band infrared (IR) surveillance camera installed along the expressway. The main reason for using an infrared camera is to suppress the strong light reflections of head and tail lights of the vehicle at rainy night. However, due to lack of a large IR traffic video datasets covering all type of vehicles, we will not be able to take advantages of recent machine learning advances. Therefore, we propose rather straight approach to detect vehicles by a pair of 2D-LiDARs, then generate the image model of vehicle to be tracked on the fly. We prototyped the system and evaluated it with a normal traffic video taken on a highway. We achieved a 94% tracking success rate at a distance of 20m to 70m from the camera and mean error of localization is less than 2m at 70m.
Download

Paper Nr: 258
Title:

Learning Effective Sparse Sampling Strategies using Deep Active Sensing

Authors:

Mehdi Stapleton, Dieter Schmalstieg, Clemens Arth and Thomas Gloor

Abstract: Registering a known model with noisy sample measurements is in general a difficult task due to the problem in finding correspondences between the samples and points on the known model. General frameworks exist, such as variants of the classical iterative closest point (ICP) method to iteratively refine correspondence estimates. However, the methods are prone to getting trapped in locally optimal configurations, which may be far from the true registration. The quality of the final registration depends strongly on the set of samples. The quality of the set of sample measurements is more noticeable when the number of samples is relatively low (≈ 20). We consider sample selection in the context of active perception, i.e. an objective-driven decision-making process, to motivate our research and the construction of our system. We design a system for learning how to select the regions of the scene to sample, and, in doing so, improve the accuracy and efficiency of the sampling process. We present a full environment for learning how best to sample a scene in order to quickly and accurately register a model with the scene. This work has broad applicability from the fields of geodesy to medical robotics, where the cost of taking a measurement is much higher than the cost of incremental changes to the pose of the equipment.
Download

Paper Nr: 259
Title:

Optical Flow Estimation using a Correlation Image Sensor based on FlowNet-based Neural Network

Authors:

Toru Kurihara and Jun Yu

Abstract: Optical flow estimation is one of a challenging task in computer vision fields. In this paper, we aim to combine correlation image that enables single frame optical flow estimation with deep neural networks. Correlation image sensor captures temporal correlation between incident light intensity and reference signals, that can record intensity variation caused by object motion effectively. We developed FlowNetS-based neural networks for correlation image input. Our experimental results demonstrate proposed neural networks has succeeded in estimating the optical flow.
Download

Paper Nr: 290
Title:

Segmentation and Visualization of Crowd Flows in Videos using Hybrid Force Model

Authors:

Shreetam Behera, Debi P. Dogra, Malay K. Bandyopadhyay and Partha P. Roy

Abstract: Understanding crowd phenomena is a challenging task. It can help to monitor crowds to prevent unwanted incidents. Crowd flow is one of the most important phenomena that describes the motion of people in crowded scenarios. Crowd flow analysis is popular among the computer vision researchers as this can be used to describe the behavior of the crowd. In this paper, a hybrid model is proposed to understand the flows in densely crowded videos. The proposed method uses the Smooth Particle Hydrodynamics (SPH)-based method guided by the Langevin-based force model for the segmentation of linear as well as non-linear flows in crowd gatherings. SPH-based model identifies the coherent motion groups. Their behavior is then analyzed using the Langevin equation guided force model to segment dominant flows. The proposed method, based on the hybrid force model, has been evaluated on public video datasets. It has been observed that the proposed hybrid scheme is able to segment linear as well as non-linear flows with accuracy as high as 91.23%, which is 4-5% better than existing crowd flow segmentation algorithms. Also, our proposed method’s execution time is better than the existing techniques.
Download