Buğra Tekin

I am a Research Scientist at the Meta Reality Labs since 2022. Before that, I spent almost 4 years at the Microsoft Mixed Reality & AI Lab in Zürich. I received my PhD degree at the Computer Vision Laboratory of EPFL under the supervision of Prof. Pascal Fua and Prof. Vincent Lepetit. Before that, I obtained my MS degree in Electrical Engineering from EPFL in 2013, and BS degree in Electrical & Electronics Engineering from Bogazici University in 2011 with high honors. I also spent time at Microsoft Research as a research intern and at ETH Zürich as a visiting researcher. I am the recipient of Qualcomm Innovation Fellowship Europe in 2017.

Email  /  Google Scholar  /  LinkedIn

Research

I'm interested in computer vision, machine learning, deep learning, image processing, and augmented reality. Much of my research is about semantically understanding humans and objects from the camera images in the 3D world. Particularly, I work on 2D/3D human pose estimation, hand pose estimation, action recognition, human-object interactions, 3D object detection and 6D pose estimation. In the past, I have also worked in biomedical imaging.

Publications

DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions
Sammy Christen, Shreyas Hampali, Fadime Sener, Edoardo Remelli, Tomas Hodan, Eric Sauser, Shugao Ma, Bugra Tekin
SIGGRAPH Asia, 2024

We introduce DiffH2O, a diffusion-based framework to synthesize dexterous hand-object interactions. DiffH2O generates realistic hand-object motion from natural language, generalizes to unseen objects at test time and enables fine-grained control over the motion with detailed textual descriptions

CigTime: Corrective Instruction Generation Through Inverse Motion Editing
Qihang Fang, Chengcheng Tang, Bugra Tekin, Yanchao Yang
Neural Information Processing Systems (NeurIPS), 2024

We introduce a novel task and system for automated coaching and feedback on human motion, aimed at generating corrective instructions and guidance for body posture and movement during specific tasks.

FoundPose: Unseen Object Pose Estimation with Foundation Features
Evin Pınar Örnek, Yann Labbé, Bugra Tekin, Lingni Ma, Cem Keskin, Christian Forster, Tomas Hodan
European Conference on Computer Vision (ECCV), 2024

A method for 6D pose estimation of unseen rigid objects from a single RGB image without any object-specific training.

X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization
Anna Kukleva, Fadime Sener, Edoardo Remelli, Bugra Tekin, Eric Sauser , Bernt Schiele, Shugao Ma
Computer Vision and Pattern Recognition (CVPR), 2024

A simple yet effective cross-modal adaptation framework for VLMs.

HoloAssist: An Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World
Xin Wang*, Taein Kwon*, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Fanello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, Marc Pollefeys
International Conference on Computer Vision (ICCV), 2023

HoloAssist is a large-scale egocentric human interaction dataset, where two people collaboratively complete physical manipulation tasks. By augmenting the data with action and conversational annotations and observing the rich behaviors of various participants, we present key insights into how human assistants correct mistakes, intervene in the task completion procedure, and ground their instructions to the environment.

Context-Aware Sequence Alignment using 4D Skeletal Augmentation
Taein Kwon, Bugra Tekin, Siyu Tang, Marc Pollefeys
Computer Vision and Pattern Recognition (CVPR), 2022 (Oral)

We propose a skeletal self-supervised learning approach that uses alignment as a pretext task. Our approach to alignment relies on a context-aware attention model that incorporates spatial and temporal context within and across sequences and a contrastive learning formulation that relies on 4D skeletal augmentations. Pose data provides a valuable cue for alignment and downstream tasks, such as phase classification and phase progression, as it is robust to different camera angles and changes in the background, while being efficient for real-time processing.

Learning to Align Sequential Actions in the Wild
Weizhe Liu, Bugra Tekin, Huseyin Coskun, Vibhav Vineet, Pascal Fua, Marc Pollefeys
Computer Vision and Pattern Recognition (CVPR), 2022

We propose an approach to align sequential actions in the wild that involve diverse temporal variations. To this end, we present a new method to enforce temporal priors on the optimal transport matrix, which leverages temporal consistency, while allowing for variations in the order of actions. Our model accounts for both monotonic and non-monotonic sequences and handles background frames that should not be aligned. We demonstrate that our approach consistently outperforms the state-of-the-art in self-supervised sequential action representation learning.

H2O: Two Hands Manipulating Objects for First Person Interaction Recognition
Taein Kwon, Bugra Tekin, Jan Stuehmer, Federica Bogo, Marc Pollefeys
International Conference on Computer Vision (ICCV), 2021
project

In this paper, we propose a method to collect a dataset of two hands manipulating objects for first person interaction recognition. We provide a rich set of annotations including action labels, object classes, 3D left & right hand poses, 6D object poses, camera poses and scene point clouds. We further propose the first method to jointly recognize the 3D poses of two hands manipulating objects and a novel topology-aware graph convolutional network for recognizing hand-object interactions.

Domain-Specific Priors and Meta Learning for Low-shot First-Person Action Recognition
Huseyin Coskun, Zeeshan Zia, Bugra Tekin, Federica Bogo, Nassir Navab, Federico Tombari, Harpreet Sawhney
Pattern Analysis and Machine Intelligence (PAMI), 2021

We develop an effective method for low-shot transfer learning for first-person action classification. We leverage independently trained local visual cues to learn representations that can be transferred from a source domain providing primitive action labels to a target domain with only a handful of examples.

Reconstructing and grounding narrated instructional videos in 3D
Dimtri Zhukov, Ignacio Rocco, Ivan Laptev, Josef Sivic, Johannes L. Schoenberger, Bugra Tekin, Marc Pollefeys
arXiv preprint arXiv:2109.04409, 2021

We present a method for 3D reconstruction of instructional videos and localizing the associated narrations in 3D. Our method is resistant to the differences in appearance of objects depicted in the videos and computationally efficient.

Leveraging Photometric Consistency over Time for Sparsely Supervised Hand-Object Reconstruction
Yana Hasson, Bugra Tekin, Federica Bogo, Ivan Laptev, Marc Pollefeys, Cordelia Schmid
Computer Vision and Pattern Recognition (CVPR), 2020
code

In this paper, we propose a new method for dense 3D reconstruction of hands and objects from monocular color images. We further present a self-supervised learning approach leveraging photo-consistency between sparsely supervised frames.

HoloLens 2 Research Mode as a Tool for Computer Vision Research
Dorin Ungureanu, Federica Bogo, Silvano Galliani, Pooja Sama, Xin Duan, Casey Meekhof, Jan Stuhmer, Thomas Cashman, Bugra Tekin, Johannes L. Schoenberber, Pawel Olszta, Marc Pollefeys
Tech Report, 2020
code

We present HoloLens 2 Research Mode, an API anda set of tools enabling access to the raw sensor streams. We provide an overview of the API and explain how it can be used to build mixed reality applications based onprocessing sensor data. We also show how to combine theResearch Mode sensor data with the built-in eye and handtracking capabilities provided by HoloLens 2.

Reconstructing Human Body Mesh from Point Clouds by Adversarial GP Network
Boyao Zhou, Jean-Sebastian Franco, Federica Bogo, Bugra Tekin, Edmond Boyer
Asian Conference on Computer Vision (ACCV), 2020

We study the problem of reconstructing the template-aligned mesh for human body estimation from unstructured point cloud data and propose a new dedicated human template matching process with a point-based deep-autoencoder architecture, where consistency of surface points is enforced and parameterized with a specialized Gaussian Process layer, and whose global consistency and generalization abilities are enforced with adversarial training.

H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions
Bugra Tekin, Federica Bogo, Marc Pollefeys
Computer Vision and Pattern Recognition (CVPR), 2019 (Oral)

In this work, we propose, for the first time, a unified method to jointly recognize 3D hand and object poses, and their interactions from egocentric monocular color images. Our method jointly estimates the hand and object poses in 3D, models their interactions and recognizes the object and activity classes with a single feed-forward pass through a neural network.

Real Time Seamless Single Shot 6D Object Pose Prediction
Bugra Tekin, Sudipta N. Sinha, Pascal Fua
Computer Vision and Pattern Recognition (CVPR), 2018
supplementary / code

We introduce a new deep learning architecture that naturally extends the single-shot 2D object detection paradigm to 6D object pose estimation. It demonstrates state-of-the-art accuracy with real-time performance and is at least 5 times faster than the existing methods (50 to 94 fps depending on the input resolution).

Learning Latent Representations of 3D Human Pose with Deep Neural Networks
Isinsu Katircioglu*, Bugra Tekin*, Mathieu Salzmann, Vincent Lepetit, Pascal Fua
International Journal of Computer Vision (IJCV), 2018

We propose an efficient Long-Short-Term-Memory (LSTM) network for enforcing consistency of 3D human pose predictions across temporal windows.

Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose Estimation
Bugra Tekin, Pablo Marquez-Neila, Mathieu Salzmann Pascal Fua
International Conference on Computer Vision (ICCV), 2017
supplementary / code / project

We introduce an approach to learn where and how to fuse the streams of a two-stream convolutional neural network operating on different input modalities for 3D human pose estimation.

Fusing 2D Uncertainty and 3D Cues for Monocular Body Pose Estimation
Bugra Tekin, Pablo Marquez-Neila, Mathieu Salzmann, Pascal Fua
arXiv Preprint, arXiv:1611.05708, 2016
project

We propose to jointly model 2D uncertainty and leverage 3D image cues in a regression framework for reliable monocular 3D human pose estimation.

Structured Prediction of 3D Human Pose with Deep Neural Networks
Bugra Tekin*, Isinsu Katircioglu*, Mathieu Salzmann, Vincent Lepetit, Pascal Fua
British Machine Vision Conference (BMVC), 2016 (Oral)

We introduce a Deep Learning regression architecture for structured prediction of 3D human pose from monocular images that relies on an overcomplete auto-encoder to learn a high-dimensional latent pose representation and account for joint dependencies.

Direct Prediction of 3D Body Poses from Motion Compensated Sequences
Bugra Tekin, Artem Rozantsev, Vincent Lepetit, Pascal Fua
Computer Vision and Pattern Recognition (CVPR), 2016
project

We propose to predict the 3D human pose from a spatiotemporal volume of bounding boxes. We further propose a CNN-based motion compensation method that increases the stability and reliability of our 3D pose estimates.

Predicting People's 3D Poses from Short Sequences
Bugra Tekin, Xiaolu Sun, Xinchao Wang, Vincent Lepetit, Pascal Fua
arXiv Preprint, arXiv:1504.08200, 2015

We propose an efficient approach to exploiting motion information from consecutive frames of a video sequence to recover the 3D pose of people. Instead of computing candidate poses in individual frames and then linking them, as is often done, we regress directly from a spatio-temporal block of frames to a 3D pose in the central one.

Learning Separable Filters
Amos Sironi*, Bugra Tekin*, Roberto Rigamonti, Vincent Lepetit, Pascal Fua
Pattern Analysis and Machine Intelligence (PAMI), 2014
supplementary / code 2D / code 3D

We introduce an efficient approach to approximate a set of nonseparable convolutional filters by linear combinations of a smaller number of separable ones. We demonstrate that this greatly reduces the computational complexity at no cost in terms of performance for image recognition tasks with convolutional filters and CNNs.

Benefits of Consistency in Image Denoising with Steerable Wavelets
Bugra Tekin, Ulugbek Kamilov, Emrah Bostan, Michael Unser
International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013 (Oral)

We propose a technique for improving the performance of L1-based image denoising in the steerable wavelet domain. Our technique, which we call consistency, refers to the fact that the solution obtained by the algorithm is constrained to the space spanned by the basis functions of the transform, which results in a certain norm equivalence between image-domain and wavelet-domain estimations.

(*: indicates equal contribution)

Theses

Learning Robust Features and Latent Representations for Single View 3D Pose Estimation of Humans and Objects
Bugra Tekin
Ph.D. Thesis , September 2018

Learning Separable Filters with Shared Parts
Bugra Tekin
M.Sc. Thesis , June 2013

Patents
nnteaching

Gesture recognition based on likelihood of interaction
US Patent App. 17/649,659

Multi-modal sensor based process tracking and guidance
US Patent App. 17/377,152

Action recognition
US Patent App. 17/155,013

Action classification based on manipulated object movement
US Patent 11,106,949

Predicting three-dimensional articulated and target object pose
US Patent 11,004,230

Spatially consistent representation of hand motion
US Patent App. 16/363,964

Method, System and Device for Direct Prediction of 3D Body Poses from Motion Compensated Sequence
US Patent App. US 2017-0316578 A1

Teaching
nnteaching

Deep Learning, TA, 2018

Computer Vision, TA, 2016, 2017

Numerical Methods for Visual Computing, TA, 2016

Programmation (C/C++) / (Java), TA, 2013, 2015

Principles of Digital Communications, TA, 2013

Circuits and Systems I/II, TA, 2011, 2012, 2013


pronunciation of my name, Buğra / website template from Jon Barron