Document Type



Doctor of Philosophy


Computer Science

First Adviser

Mooi C. Chuah


With technological advancement in embedded system design, powerful cameras have been embedded within smart phones, and wireless cameras can be easily deployed at street corners, traffic lights, big stadiums, train stations, etc. Besides, the growth of online media, surveillance, and mobile cameras have resulted in an explosion of videos being uploaded to social media sites such as Facebook and YouTube. The availability of such a vast volume of videos has attracted the computer vision community to conduct much research on human activity recognition since people are arguably the most interesting subjects of such videos. Automatic human activity recognition allows engineers and computer scientists to design smarter surveillance systems, semantically aware video indexes and also more natural human-computer interfaces. Despite the explosion of video data, the ability to automatically recognize and understand human activities is still rather limited. This is primarily due to multiple challenges inherent to the recognition task, namely large variability in human execution styles, the complexity of the visual stimuli in terms of camera motion, background clutter, viewpoint changes, etc., and the number of activities that can be recognized. In addition, the ability to predict future actions of objects based on past observed video frames is very useful. Therefore, in this thesis, we explore four designs to solve the problems we discussed earlier, namely

(1) A semantics-based deep learning model, namely SBGAR, is proposed to do group activity recognition. This model achieves higher accuracy and efficiency than existing group activity recognition methods.

(2) Despite its high accuracy, SBGAR has some limitations, namely (i) it requires a large dataset with caption information, (ii) activity recognition model is independent of the caption generation model and hence SBGAR may not perform well in some cases. To remove such limitations, we design ReHAR, a robust and efficient human activity recognition scheme. ReHAR can be used to recognize both single-person activities and group activities.

(3) In many application scenarios, merely knowing what the moving agents are doing is not sufficient. It also requires predictions of future trajectories of moving agents. Thus, we propose GRIP, a graph-based interaction-aware motion intent prediction scheme. The scheme uses a graph to represent the relationships between two objects, e.g., human joints or traffic agents, and predict the motion intents of all observed objects simultaneously.

(4) Action recognition and trajectory prediction schemes are typically deployed in resource-constrained devices. Thus, any technique that can accelerate the computation speed of our schemes is important. Hence, we propose a novel deep learning model decomposition method called DAC that is capable of factorizing an ordinary convolutional layer into two layers with much fewer parameters. DAC computes the corresponding weights for the newly generated layers directly from the weights of the original convolutional layer. Thus, no training (or fine-tuning) or any data is needed.