ARCA - Analysis and Representation of Complex Activities in Videos

Prof. Dr. Juergen Gall

The goal of the project ARCA has been to automatically analyze human activities observed in videos, which is the basis for novel applications. It could be used to create short videos that summarize daily activities to support patients suffering from Alzheimer's disease. It could also be used for education, e.g., by providing a video analysis for a trainee in the hospital that shows if the tasks have been correctly executed. The analysis of complex activities in videos, however, is very challenging since activities vary in temporal duration between minutes and hours, involve interactions with several objects that change their appearance and shape, e.g., food during cooking, and are composed of many sub-activities, which can happen at the same time or in various orders.

While the majority of recent works in action recognition has focused on developing better feature encoding techniques for classifying sub-activities in short video clips of a few seconds, this project moved forward and aimed to develop a higher-level representation of complex activities to overcome the limitations of current approaches. This includes the handling of large time variations and the ability to recognize and locate complex activities in videos. A second objective of the project has been to learn a representation from videos that is not limited to a specific application, but that can be reused and adapted to a new setting. The third objective has been to synthesize human motion or poses by just providing a list of human actions or a human description to demonstrate that the model cannot only interpret data but also generate data.

Publications

Li S., Zhou Y., Yi J., and Gall J., Spatial-Temporal Consistency Network for Low-Latency Trajectory Forecasting (PDF, Supplementary Material), International Conference on Computer Vision (ICCV'21), To appear.

Behrmann N., Fayyaz M., Gall J., and Noroozi M., Long Short View Feature Decomposition via Contrastive Video Representation Learning (PDF, Supplementary Material), International Conference on Computer Vision (ICCV'21), To appear.

Biswas S. and Gall J., Multiple Instance Triplet Loss for Weakly Supervised Multi-Label Action Localisation of Interacting Persons (PDF), Understanding Social Behavior in Dyadic and Small Group Interactions Workshop, To appear.

Souri Y., Fayyaz M., Minciullo L., Francesca G., and Gall J., Fast Weakly Supervised Action Segmentation Using Mutual Consistency (PDF, Code), IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. ©IEEE

Li Z., Abu Farha Y., and Gall J., Temporal Action Segmentation from Timestamp Supervision (PDF, Supplementary Material, Code), IEEE Conference on Computer Vision and Pattern Recognition (CVPR'21), To appear.

Fayyaz M., Bahrami E., Diba A., Noroozi M., Adeli E., van Gool L., and Gall J., 3D CNNs with Adaptive Temporal Feature Resolutions (PDF, Supplementary Material, Code), IEEE Conference on Computer Vision and Pattern Recognition (CVPR'21), To appear.

Zatsarynna O., Abu Farha Y., and Gall J., Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos (PDF), IEEE Workshop on Precognition: Seeing through the Future, To appear.

Li S., Yi J., Abu Farha Y., and Gall J., Pose Refinement Graph Convolutional Network for Skeleton-based Action Recognition (PDF, Code), IEEE Robotics and Automation Letters (RA-L), Vol. 6, No. 2, 1028-1035, 2021. ©IEEE

Sushko V., Schönfeld E., Zhang D., Gall J., Schiele B., and Khoreva A., You Only Need Adversarial Supervision for Semantic Image Synthesis (PDF, Code), International Conference on Learning Representations (ICLR'21), 2021.

Behrmann N., Gall J., and Noroozi M., Unsupervised Video Representation Learning by Bidirectional Feature Prediction (PDF), Winter Conference on Applications of Computer Vision (WACV'21), 1669-1678, 2021. ©IEEE

Biswas S. and Gall J., Discovering Multi-Label Actor-Action Association in a Weakly Supervised Setting (PDF, Supplementary Material, Code), Asian Conference on Computer Vision (ACCV'20), Springer, LNCS 12626, 547-561, 2021. ©Springer-Verlag

Kwon O.-H., Tanke J., and Gall J., Recursive Bayesian Filtering for Multiple Human Pose Tracking from Multiple Cameras (PDF), Asian Conference on Computer Vision (ACCV'20), Springer, LNCS 12623, 438-453, 2021. ©Springer-Verlag

Li S., Abu Farha Y., Liu Y., Cheng M.-M., and Gall J., MS-TCN++: Multi-Stage Temporal Convolutional Network for Action Segmentation (PDF, MS-TCN Code, MS-TCN++ Code), IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. ©IEEE

Abu Farha Y., Ke Q., Schiele B., and Gall J., Long-Term Anticipation of Activities with Cycle Consistency (PDF, Supplementary Material), DAGM German Conference on Pattern Recognition (GCPR'20), Springer, LNCS 12544, 159-173, 2021. ©Springer-Verlag

Zhang Y., Briq R., Tanke J., and Gall J., Adversarial Synthesis of Human Pose From Text (PDF, Supplementary Material), DAGM German Conference on Pattern Recognition (GCPR'20), Springer, LNCS 12544, 145-158, 2021. ©Springer-Verlag

Rafi U., Doering A., Leibe B., and Gall J., Self-supervised Keypoint Correspondences for Multi-Person Pose Estimation and Tracking in Videos (PDF, Supplementary Material), European Conference on Computer Vision (ECCV'20), Springer, LNCS 12365, 36-52, 2020. ©Springer-Verlag

Diba A., Fayyaz M., Sharma V., Paluri M., Gall J., Stiefelhagen R., and van Gool L., Large Scale Holistic Video Understanding (PDF, Supplementary Material, Data), European Conference on Computer Vision (ECCV'20), Springer, LNCS 12350, 593-610, 2020. ©Springer-Verlag

Fayyaz M. and Gall J., SCT: Set Constrained Temporal Transformer for Set Supervised Action Segmentation (PDF, Code), IEEE Conference on Computer Vision and Pattern Recognition (CVPR'20), 498-507, 2020. ©IEEE

Kuehne H., Richard A., and Gall J., A Hybrid RNN-HMM Approach for Weakly Supervised Temporal Action Segmentation (PDF, Code), IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 42, No. 4, 765-779, 2020. ©IEEE

Panareda Busto P., Iqbal A., and Gall J., Open Set Domain Adaptation for Image and Action Recognition (PDF, Supplementary Material, Slides, Code), IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 42, No. 2, 413-429, 2020. ©IEEE

Ruiz A. H., Gall J., and Moreno-Noguer F., Human Motion Prediction via Spatio-Temporal Inpainting (PDF), International Conference on Computer Vision (ICCV'19), 7133-7142, 2019. ©IEEE

Abu Farha Y. and Gall J., Uncertainty-Aware Anticipation of Activities (PDF), International Workshop on Human Behaviour Understanding, 1197-1204, 2019. ©IEEE

Richard A., Iqbal A., and Gall J., Enhancing Temporal Action Localization with Transfer Learning from Action Recognition (PDF), Workshop and Challenge on Comprehensive Video Understanding in the Wild, 1533-1540, 2019. ©IEEE

Sawatzky J., Banerjee D., and Gall J., Harvesting Information from Captions for Weakly Supervised Semantic Segmentation (PDF), Workshop on Cross-Modal Learning in Real World, 4481-4490, 2019. ©IEEE

Iqbal A. and Gall J., Level Selector Network for Optimizing Accuracy-Specificity Trade-offs (PDF), International Workshop on Large Scale Holistic Video Understanding, 1466-1473, 2019. ©IEEE

Panareda Busto P. and Gall J., Joint Viewpoint and Keypoint Estimation with Real and Synthetic Data (PDF, Code, Supplementary Material), German Conference on Pattern Recognition (GCPR'19), Springer, LNCS 11824, 107-121, 2019. ©Springer-Verlag

Tanke J. and Gall J., Iterative Greedy Matching for 3D Human Pose Tracking from Multiple Views (PDF, Code), German Conference on Pattern Recognition (GCPR'19), Springer, LNCS 11824, 537-550, 2019. ©Springer-Verlag

Biswas S., Souri Y., and Gall J., Hierarchical Graph-RNNs for Action Detection of Multiple Activities (PDF), IEEE International Conference on Image Processing (ICIP'19), 1-5, 2019. ©IEEE

Thoker F. and Gall J., Cross-modal Knowledge Distillation for Action Recognition (PDF), IEEE International Conference on Image Processing (ICIP'19), 6-10, 2019. ©IEEE

Abu Farha Y. and Gall J., MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation (PDF, Code), IEEE Conference on Computer Vision and Pattern Recognition (CVPR'19), 3570-3579, 2019. ©IEEE

Sawatzky J., Souri Y., Grund C., and Gall J., What Object Should I Use? - Task Driven Object Detection (PDF, Supplementary Material, Code/Data), IEEE Conference on Computer Vision and Pattern Recognition (CVPR'19), 7597-7606, 2019. ©IEEE

Kukleva A., Kuehne H., Sener F., and Gall J., Unsupervised Learning of Action Classes with Continuous Temporal Embedding (PDF, Supplementary Material, Code), IEEE Conference on Computer Vision and Pattern Recognition (CVPR'19), 12058-12066, 2019. ©IEEE

Sabokrou M., Pourreza M., Fayyaz M., Entezari R., Fathy M., Gall J., and Adeli E., AVID: Adversarial Visual Irregularity Detection (PDF, Code), Asian Conference on Computer Vision (ACCV'18), Springer, LNCS 11269, 169-184, 2018. ©Springer-Verlag

Briq R., Moeller M., and Gall J., Convolutional Simplex Projection Network for Weakly Supervised Semantic Segmentation (PDF, Code), British Machine Vision Conference (BMVC'18), 2018.

Doering A., Iqbal U., and Gall J., Joint Flow: Temporal Flow Fields for Multi Person Tracking (PDF), British Machine Vision Conference (BMVC'18), 2018.

Rafi U., Gall J., and Leibe B., Direct Shot Correspondence Matching (PDF), British Machine Vision Conference (BMVC'18), 2018.

Iqbal U., Molchanov P., Breuel T., Gall J., and Kautz J., Hand Pose Estimation via Latent 2.5D Heatmap Regression (PDF), European Conference on Computer Vision (ECCV'18), Springer, LNCS 11215, 125-143, 2018. ©Springer-Verlag

Diba A., Fayyaz M., Sharma V., Arzani M., Yousefzadeh R., Gall J., and van Gool L., Spatio-Temporal Channel Correlation Networks for Action Classification (PDF), European Conference on Computer Vision (ECCV'18), Springer, LNCS 11208, 299-315, 2018. ©Springer-Verlag

Iqbal U., Doering A., Yasin H., Krüger B., Weber A., and Gall J., A Dual-Source Approach for 3D Human Pose Estimation from Single Images (PDF, Code), Computer Vision and Image Understanding, Vol 172, 37-49, Elsevier, 2018. ©Elsevier

Richard A., Kuehne H., Iqbal A., and Gall J., NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning (PDF, Code), IEEE Conference on Computer Vision and Pattern Recognition (CVPR'18), 7386-7395, 2018. ©IEEE

Abu Farha Y., Richard A., and Gall J., When will you do what? - Anticipating Temporal Occurrences of Activities (PDF, Code, Video), IEEE Conference on Computer Vision and Pattern Recognition (CVPR'18), 5343-5352, 2018. ©IEEE

Richard A., Kuehne H., and Gall J., Action Sets: Weakly Supervised Action Segmentation without Ordering Constraints (PDF, Code), IEEE Conference on Computer Vision and Pattern Recognition (CVPR'18), 5987-5996, 2018. ©IEEE

Andriluka M., Iqbal U., Insafutdinov E., Pishchulin L., Milan A., Gall J., and Schiele B., PoseTrack: A Benchmark for Human Pose Estimation and Tracking (PDF, Data), IEEE Conference on Computer Vision and Pattern Recognition (CVPR'18), 5167-5176, 2018. ©IEEE

Biswas S. and Gall J., Structural Recurrent Neural Network (SRNN) for Group Activity Analysis (PDF), IEEE Winter Conference on Applications of Computer Vision (WACV'18), 1625-1632, 2018. ©IEEE

Kuehne H., Richard A., and Gall J., Weakly Supervised Learning of Actions from Transcripts (PDF), Computer Vision and Image Understanding, Special Issue on Language in Vision, Vol 163, 78-89, Elsevier, 2017. ©Elsevier

Panareda Busto P. and Gall J., Open Set Domain Adaptation (PDF, Supplementary Material, Slides, Code), International Conference on Computer Vision (ICCV'17), 754-763, 2017. ©IEEE (Marr Prize Honorable Mention)

Iqbal A., Richard A., Kuehne H., and Gall J., Recurrent Residual Learning for Action Recognition (PDF), German Conference on Pattern Recognition (GCPR'17), Springer, LNCS 10496, 126-137, 2017. ©Springer-Verlag

Richard A., Kuehne H., and Gall J., Weakly Supervised Action Learning with RNN based Fine-to-Coarse Modeling (PDF, Code), IEEE Conference on Computer Vision and Pattern Recognition (CVPR'17), 1273-1282, 2017. ©IEEE

Iqbal U., Milan A., and Gall J., PoseTrack: Joint Multi-Person Pose Estimation and Tracking (PDF, Data/Code, PoseTrack Challenge), IEEE Conference on Computer Vision and Pattern Recognition (CVPR'17), 4654-4663, 2017. ©IEEE

Iqbal U., Garbade M., and Gall J., Pose for Action - Action for Pose (PDF, Code), IEEE International Conference on Automatic Face and Gesture Recognition (FG'17), 438-445, 2017. ©IEEE

Richard A. and Gall J., A Bag-of-Words Equivalent Recurrent Neural Network for Action Recognition (PDF, Code), Computer Vision and Image Understanding, Special Issue on Image and Video Understanding in Big Data, Vol 156, 79–91, Elsevier, 2017. ©Elsevier

Iqbal U. and Gall J., Multi-Person Pose Estimation with Local Joint-to-Person Associations (PDF), International Workshop on Crowd Understanding, Springer, LNCS 9914, 627-642, 2016. ©Springer-Verlag

Rafi U., Kostrikov I., Gall J., and Leibe B., An Efficient Convolutional Network for Human Pose Estimation (PDF, Code), British Machine Vision Conference (BMVC'16), 2016.

Garbade M. and Gall J., Handcrafting vs Deep Learning: An Evaluation of NTraj+ Features for Pose Based Action Recognition (PDF), Workshop on New Challenges in Neural Computation and Machine Learning (NC2), 2016.

Software

Fast Weakly Supervised Action Segmentation Using Mutual Consistency

Temporal Action Segmentation from Timestamp Supervision

3D CNNs with Adaptive Temporal Feature Resolutions

Pose Refinement Graph Convolutional Network for Skeleton-based Action Recognition

You Only Need Adversarial Supervision for Semantic Image Synthesis

Discovering Multi-Label Actor-Action Association in a Weakly Supervised Setting

SCT: Set Constrained Temporal Transformer for Set Supervised Action Segmentation

MS-TCN++: Multi-Stage Temporal Convolutional Network for Action Segmentation

MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation

Joint Viewpoint and Keypoint Estimation with Real and Synthetic Data

Iterative Greedy Matching for 3D Human Pose Tracking from Multiple Views

What Object Should I Use? - Task Driven Object Detection

Unsupervised Learning of Action Classes with Continuous Temporal Embedding

AVID: Adversarial Visual Irregularity Detection

Convolutional Simplex Projection Network for Weakly Supervised Semantic Segmentation

NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning

When will you do what? - Anticipating Temporal Occurrences of Activities

Action Sets: Weakly Supervised Action Segmentation without Ordering Constraints

Open Set Domain Adaptation

Weakly Supervised Action Learning with RNN based Fine-to-Coarse Modeling

PoseTrack: Joint Multi-Person Pose Estimation and Tracking

Pose for Action - Action for Pose

An Efficient Convolutional Network for Human Pose Estimation

Data

Large Scale Holistic Video Understanding

What Object Should I Use? - Task Driven Object Detection

PoseTrack Challenge

PoseTrack: Joint Multi-Person Pose Estimation and Tracking

Members

Principal Investigator:
Prof. Dr. Juergen Gall

Postdocs:
Hildegard Kühne
Umer Rafi

Ph. D. students:
Shi-Jie Li
Julian Tanke
Andreas Doering
Yazan Abu Farha
Rania Briq
Mohsen Fayyaz
Yaser Souri
Mian Ahsan Iqbal
Sovan Biswas
Fadime Sener
Alexander Richard
Umar Iqbal