Keynote Lectures

Building Emotionally Intelligent AI: From Sensing to Synthesis
Daniel McDuff, Microsoft, United States

Reinventing Movies: How Do We Tell Stories in VR?
Diego Gutierrez, Universidad de Zaragoza, Spain

Robust Fitting of Multiple Models in Computer Vision
Jiri Matas, Czech Technical University in Prague, Faculty of Electrical Engineering, Czech Republic

A Fine-grained Perspective onto Object Interactions from First-person Views
Dima Damen, University of Bristol, United Kingdom

Building Emotionally Intelligent AI: From Sensing to Synthesis

Daniel McDuff
Microsoft
United States

Brief Bio
Daniel McDuff is a Researcher at Microsoft where he leads research and development of affective computing technology, with a focus on scalable tools to enable the automated recognition and analysis of emotions and physiology. He is also a visiting scientist at Brigham and Women’s Hospital in Boston where he works on deploying these methods in primary care and surgical applications. Daniel completed his PhD in the Affective Computing Group at the MIT Media Lab in 2014 and has a B.A. and Masters from Cambridge University. Previously, Daniel was Director of Research at Affectiva and a post-doctoral research affiliate at the MIT Media Lab. During his Ph.D. and at Affectiva he built state-of-the-art facial expression recognition software and lead analysis of the world's largest database of facial expression videos. His work in machine learning, AR and affective computing has received nominations and awards from Popular Science magazine as one of the top inventions in 2011, South-by-South-West Interactive (SXSWi), The Webby Awards, ESOMAR and the Center for Integrated Medicine and Innovative Technology (CIMIT). His projects have been reported in many publications including The Times, the New York Times, The Wall Street Journal, BBC News, New Scientist, Scientific American and Forbes magazine. Daniel was named a 2015 WIRED Innovation Fellow and has spoken at TEDx Berlin and SXSW. We is a member of the ACM Future of Computing Academy.

Abstract
Emotions play an important role in our everyday lives. They influence memory, decision-making and well-being. In order to advance the fundamental understanding of human emotions, build smarter affective technology, and ultimately help people, we need to perform research in-situ. Leveraging exciting advances in machine learning and computer vision it is now possible to quantify emotional and physiological responses in new ways and on a large scale using webcams and microphones in everyday environments. I will present novel methods for physiological and behavioral measurement via ubiquitous hardware and show data from the largest longitudinal data collection of its kind. Then I will present state-of-the-art approaches for emotion synthesis (both audio and visual) that can be used to create rich human-agent or robot interactions. Finally, I will show examples of new human-computer interfaces that leverage behavioral and physiological signals, including emotion-aware natural conversational systems.

Reinventing Movies: How Do We Tell Stories in VR?

Diego Gutierrez
Universidad de Zaragoza
Spain
http://giga.cps.unizar.es/~diegog/

Brief Bio
Diego Gutierrez is a Professor at the Universidad de Zaragoza in Spain, where he leads the Graphics and Imaging Lab. His areas of interests include physically based global illumination, perception, computational imaging, and virtual reality. He's published many papers in top journals and conferences, being Papers Chair of Eurographics (2018), the Rendering Symposium (2012), and the Symposium on Applied Perception (2011). He's been Editor in Chief of ACM Transactions on Applied Perception, and is currently an Associate Editor of four other journals. He has received many awards including a Google Faculty Research Award in 2014, or an ERC Consolidator Grant in 2016.

Abstract
Traditional cinematography has relied for over a century on a well-established set of editing rules, called continuity editing, to create a sense of situational continuity. Despite massive changes in visual content across cuts, viewers in general experience no trouble perceiving the discontinuous flow of information as a coherent set of events. However, Virtual Reality (VR) movies are intrinsically different from traditional movies in that the viewer controls the camera orientation at all times. As a consequence, common editing techniques that rely on camera orientations, zooms, etc., cannot be used. In this talk we will investigate key relevant questions to understand how well traditional movie editing carries over to VR, such as: Does the perception of continuity hold across edit boundaries? Under which conditions? Do viewers’ observational behavior change after the cuts? We will make connections with recent cognition studies and the event segmentation theory, which states that our brains segment continuous actions into a series of discrete, meaningful events. This theory may in principle explain why traditional movie editing has been working so wonderfully, and thus may hold the answers to redesigning movie cuts in VR as well. In addition, and related to the general question of how people explore immersive virtual environments, we will present the main insights a second, recent study, analyzing almost 2000 head and gaze trajectories when users explore stereoscopic omni-directional panoramas. We have made our database publicly available for other researchers.

Robust Fitting of Multiple Models in Computer Vision

Jiri Matas
Czech Technical University in Prague, Faculty of Electrical Engineering
Czech Republic

Brief Bio
Jiri Matas is a full professor at the Center for Machine Perception, Czech Technical University in Prague. He holds a PhD degree from the University of Surrey, UK (1995). He has published more than 200 papers in refereed journals and conferences. His publications have approximately 34000 citations registered in Google Scholar and 13000 in the Web of Science. His h- index is 65 (Google scholar) and 43 (Clarivate Analytics Web of Science) respectively. He received the best paper prize e.g. at the British Machine Vision Conferences in 2002 and 2005, at the Asian Conference on Computer Vision in 2007 and at Int. Conf. on Document analysis and Recognition in 2015. J. Matas has served in various roles at major international computer vision conferences (e.g. ICCV, CVPR, ICPR, NIPS, ECCV), co-chairing ECCV 2004, 2016 and CVPR 2007. He is on the editorial board of IJCV and was the Associate Editor-in-Chief of IEEE T. PAMI. He served on the computer science panel of ERC. His research interests include visual tracking, object recognition, image matching and retrieval, sequential pattern recognition, and RANSAC- type optimization metods.

Abstract
Many computer vision problems can be formulated seen as multi-class multi-instance fitting, where the input data is interpreted as a mixture of noise and observations originating from multiple instances of multiple model types, e.g. as lines and circles in edge maps; as planes, cylinders and point clusters in 3D laser scans; as multiple homographies or fundamental matrices consistent with point correspondences in multiple views of a non-rigid scene. I will review properties of three popular data fitting methods - RANSAC, the Hough transform and Isack's and Boykov's PEARL, which disposes of the assumption of independent data errors. I will then present a novel method, called Multi-X, for general multi-class multi-instance model fitting. The proposed approach combines a random sampling strategy like RANSAC, lobal energy minimization using alpha-expansion as PEARL, and mode-seeking in the parameter domain like the Hough Transform. Multi-X outperforms significantly the state-of-the-art on standard datasets, runs in time approximately linear in the number of data points, an order of magnitude faster than available implementations of commonly used methods.

A Fine-grained Perspective onto Object Interactions from First-person Views

Dima Damen
University of Bristol
United Kingdom
http://www.cs.bris.ac.uk/~damen

Brief Bio
Dima Damen: Associate Professor in Computer Vision at the University of Bristol, United Kingdom. Received her PhD from the University of Leeds, UK (2009). Dima's research interests are in the automatic understanding of object interactions, actions and activities using wearable and static visual (and depth) sensors. She has contributed works to novel research questions including fine-grained object interaction recognition, understanding the completion of actions, skill determination from video, semantic ambiguities of actions and the robustness of classifiers to action’s temporal boundaries. Her work is published in leading venues: CVPR, ECCV, ICCV, PAMI, IJCV, CVIU and BMVC. In 2018, she led on releasing the largest dataset in first-person vision to date (EPIC-KITCHENS) - 11.5M frames of non-scripted recordings with full ground truth. Dima co-chaired BMVC 2013, is area chair for BMVC (2014-2018), associate editor of Pattern Recognition (2017-). She was selected as a Nokia Research collaborator in 2016, and as an Outstanding Reviewer in ICCV17, CVPR13 and CVPR12.

Abstract
Traditionally, action understanding has been limited to assigning one out of a pre-selected set of labels to a trimmed video sequence. This talk goes beyond traditional action recognition to fine-grained understand of daily object interactions. The talk will discuss works that attempt to understand ‘when’ an object interaction takes place, including ‘when’ it can be considered completed, ‘which’ semantic labels can describe the interactions, ‘how’ the interaction can be described (or captioned) and ‘who’ is better when contrasting people perform the same interaction. Potentials and limitations of current deep architectures to solve for fine-grained object interaction understanding will be discussed. The talk will focus on the first-person viewpoint –captured using wearable cameras- as it offers a unique perspective onto objects during interactions.