Buğra Tekin

I am a Research Scientist at the Meta Reality Labs XR Input team. Before that, I spent almost 4 years at the Microsoft Mixed Reality & AI group in Zürich. I received my Ph.D. degree at the Computer Vision Laboratory of EPFL under the supervision of Prof. Pascal Fua and Prof. Vincent Lepetit. I obtained my M.Sc. degree in Electrical Engineering from EPFL in 2013, and B.Sc degree in Electrical & Electronics Engineering from Bogazici University in 2011 with high honors. I also spent time at Microsoft Research as a research intern and at ETH Zürich as a visiting researcher. I am the recipient of Qualcomm Innovation Fellowship Europe in 2017.

Email  /  Google Scholar  /  LinkedIn


I'm interested in computer vision, machine learning, deep learning, image processing, and augmented reality. Much of my research is about semantically understanding humans and objects from the camera images in the 3D world. Particularly, I work on 2D/3D human pose estimation, motion capture, hand pose estimation, action recognition, 3D object detection and 6D pose estimation. In the past, I have also worked in biomedical imaging.


Context-Aware Sequence Alignment using 4D Skeletal Augmentation
Taein Kwon, Bugra Tekin, Siyu Tang, Marc Pollefeys
Computer Vision and Pattern Recognition (CVPR), 2022 (Oral)

We propose a skeletal self-supervised learning approach that uses alignment as a pretext task. Our approach to alignment relies on a context-aware attention model that incorporates spatial and temporal context within and across sequences and a contrastive learning formulation that relies on 4D skeletal augmentations. Pose data provides a valuable cue for alignment and downstream tasks, such as phase classification and phase progression, as it is robust to different camera angles and changes in the background, while being efficient for real-time processing.

Learning to Align Sequential Actions in the Wild
Weizhe Liu, Bugra Tekin, Huseyin Coskun, Vibhav Vineet, Pascal Fua, Marc Pollefeys
Computer Vision and Pattern Recognition (CVPR), 2022

We propose an approach to align sequential actions in the wild that involve diverse temporal variations. To this end, we present a new method to enforce temporal priors on the optimal transport matrix, which leverages temporal consistency, while allowing for variations in the order of actions. Our model accounts for both monotonic and non-monotonic sequences and handles background frames that should not be aligned. We demonstrate that our approach consistently outperforms the state-of-the-art in self-supervised sequential action representation learning.

H2O: Two Hands Manipulating Objects for First Person Interaction Recognition
Taein Kwon, Bugra Tekin, Jan Stuehmer, Federica Bogo, Marc Pollefeys
International Conference on Computer Vision (ICCV), 2021

In this paper, we propose a method to collect a dataset of two hands manipulating objects for first person interaction recognition. We provide a rich set of annotations including action labels, object classes, 3D left & right hand poses, 6D object poses, camera poses and scene point clouds. We further propose the first method to jointly recognize the 3D poses of two hands manipulating objects and a novel topology-aware graph convolutional network for recognizing hand-object interactions.

Domain-Specific Priors and Meta Learning for Low-shot First-Person Action Recognition
Huseyin Coskun, Zeeshan Zia, Bugra Tekin, Federica Bogo, Nassir Navab, Federico Tombari, Harpreet Sawhney
Pattern Analysis and Machine Intelligence (PAMI), 2021

We develop an effective method for low-shot transfer learning for first-person action classification. We leverage independently trained local visual cues to learn representations that can be transferred from a source domain providing primitive action labels to a target domain with only a handful of examples.

Reconstructing and grounding narrated instructional videos in 3D
Dimtri Zhukov, Ignacio Rocco, Ivan Laptev, Josef Sivic, Johannes L. Schoenberger, Bugra Tekin, Marc Pollefeys
arXiv preprint arXiv:2109.04409, 2021

We present a method for 3D reconstruction of instructional videos and localizing the associated narrations in 3D. Our method is resistant to the differences in appearance of objects depicted in the videos and computationally efficient.

Leveraging Photometric Consistency over Time for Sparsely Supervised Hand-Object Reconstruction
Yana Hasson, Bugra Tekin, Federica Bogo, Ivan Laptev, Marc Pollefeys, Cordelia Schmid
Computer Vision and Pattern Recognition (CVPR), 2020

In this paper, we propose a new method for dense 3D reconstruction of hands and objects from monocular color images. We further present a self-supervised learning approach leveraging photo-consistency between sparsely supervised frames.

HoloLens 2 Research Mode as a Tool for Computer Vision Research
Dorin Ungureanu, Federica Bogo, Silvano Galliani, Pooja Sama, Xin Duan, Casey Meekhof, Jan Stuhmer, Thomas Cashman, Bugra Tekin, Johannes L. Schoenberber, Pawel Olszta, Marc Pollefeys
Tech Report, 2020

We present HoloLens 2 Research Mode, an API anda set of tools enabling access to the raw sensor streams. We provide an overview of the API and explain how it can be used to build mixed reality applications based onprocessing sensor data. We also show how to combine theResearch Mode sensor data with the built-in eye and handtracking capabilities provided by HoloLens 2.

Reconstructing Human Body Mesh from Point Clouds by Adversarial GP Network
Boyao Zhou, Jean-Sebastian Franco, Federica Bogo, Bugra Tekin, Edmond Boyer
Asian Conference on Computer Vision (ACCV), 2020

We study the problem of reconstructing the template-aligned mesh for human body estimation from unstructured point cloud data and propose a new dedicated human template matching process with a point-based deep-autoencoder architecture, where consistency of surface points is enforced and parameterized with a specialized Gaussian Process layer, and whose global consistency and generalization abilities are enforced with adversarial training.

H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions
Bugra Tekin, Federica Bogo, Marc Pollefeys
Computer Vision and Pattern Recognition (CVPR), 2019 (Oral)

In this work, we propose, for the first time, a unified method to jointly recognize 3D hand and object poses, and their interactions from egocentric monocular color images. Our method jointly estimates the hand and object poses in 3D, models their interactions and recognizes the object and activity classes with a single feed-forward pass through a neural network.

Real Time Seamless Single Shot 6D Object Pose Prediction
Bugra Tekin, Sudipta N. Sinha, Pascal Fua
Computer Vision and Pattern Recognition (CVPR), 2018
supplementary / code

We introduce a new deep learning architecture that naturally extends the single-shot 2D object detection paradigm to 6D object pose estimation. It demonstrates state-of-the-art accuracy with real-time performance and is at least 5 times faster than the existing methods (50 to 94 fps depending on the input resolution).

Learning Latent Representations of 3D Human Pose with Deep Neural Networks
Isinsu Katircioglu*, Bugra Tekin*, Mathieu Salzmann, Vincent Lepetit, Pascal Fua
International Journal of Computer Vision (IJCV), 2018

We propose an efficient Long-Short-Term-Memory (LSTM) network for enforcing consistency of 3D human pose predictions across temporal windows.

Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose Estimation
Bugra Tekin, Pablo Marquez-Neila, Mathieu Salzmann Pascal Fua
International Conference on Computer Vision (ICCV), 2017
supplementary / code / project

We introduce an approach to learn where and how to fuse the streams of a two-stream convolutional neural network operating on different input modalities for 3D human pose estimation.

Fusing 2D Uncertainty and 3D Cues for Monocular Body Pose Estimation
Bugra Tekin, Pablo Marquez-Neila, Mathieu Salzmann, Pascal Fua
arXiv Preprint, arXiv:1611.05708, 2016

We propose to jointly model 2D uncertainty and leverage 3D image cues in a regression framework for reliable monocular 3D human pose estimation.

Structured Prediction of 3D Human Pose with Deep Neural Networks
Bugra Tekin*, Isinsu Katircioglu*, Mathieu Salzmann, Vincent Lepetit, Pascal Fua
British Machine Vision Conference (BMVC), 2016 (Oral)

We introduce a Deep Learning regression architecture for structured prediction of 3D human pose from monocular images that relies on an overcomplete auto-encoder to learn a high-dimensional latent pose representation and account for joint dependencies.

Direct Prediction of 3D Body Poses from Motion Compensated Sequences
Bugra Tekin, Artem Rozantsev, Vincent Lepetit, Pascal Fua
Computer Vision and Pattern Recognition (CVPR), 2016

We propose to predict the 3D human pose from a spatiotemporal volume of bounding boxes. We further propose a CNN-based motion compensation method that increases the stability and reliability of our 3D pose estimates.

Predicting People's 3D Poses from Short Sequences
Bugra Tekin, Xiaolu Sun, Xinchao Wang, Vincent Lepetit, Pascal Fua
arXiv Preprint, arXiv:1504.08200, 2015

We propose an efficient approach to exploiting motion information from consecutive frames of a video sequence to recover the 3D pose of people. Instead of computing candidate poses in individual frames and then linking them, as is often done, we regress directly from a spatio-temporal block of frames to a 3D pose in the central one.

Learning Separable Filters
Amos Sironi*, Bugra Tekin*, Roberto Rigamonti, Vincent Lepetit, Pascal Fua
Pattern Analysis and Machine Intelligence (PAMI), 2015
supplementary / code 2D / code 3D

We introduce an efficient approach to approximate a set of nonseparable convolutional filters by linear combinations of a smaller number of separable ones. We demonstrate that this greatly reduces the computational complexity at no cost in terms of performance for image recognition tasks with convolutional filters and CNNs.

Benefits of Consistency in Image Denoising with Steerable Wavelets
Bugra Tekin, Ulugbek Kamilov, Emrah Bostan, Michael Unser
International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013 (Oral)

We propose a technique for improving the performance of L1-based image denoising in the steerable wavelet domain. Our technique, which we call consistency, refers to the fact that the solution obtained by the algorithm is constrained to the space spanned by the basis functions of the transform, which results in a certain norm equivalence between image-domain and wavelet-domain estimations.

(*: indicates equal contribution)


Learning Robust Features and Latent Representations for Single View 3D Pose Estimation of Humans and Objects
Bugra Tekin
Ph.D. Thesis , September 2018

Learning Separable Filters with Shared Parts
Bugra Tekin
M.Sc. Thesis , June 2013


Method, System and Device for Direct Prediction of 3D Body Poses from Motion Compensated Sequence
Pascal Fua, Vincent Lepetit, Artem Rozantsev, Bugra Tekin
US Patent , Pub. No: US 2017-0316578 A1, Pub. Date: November 02, 2017


Deep Learning, TA, 2018

Computer Vision, TA, 2016, 2017

Numerical Methods for Visual Computing, TA, 2016

Programmation (C/C++) / (Java), TA, 2013, 2015

Principles of Digital Communications, TA, 2013

Circuits and Systems I/II, TA, 2011, 2012, 2013

pronunciation of my name, Buğra / webpage design courtesy