Abstract:
The amount of video content generated increases daily, three hundred hours of video content is uploaded
to YouTube every 60 seconds1. There exists a need to sort, summarise, describe, categorise and retrieve
video data based on the content (i.e. the activities occurring in the video). Activity recognition (i.e.
automatically naming activities) is an important area for video analysis. Activity recognition has
applications in robotics, video surveillance, multimedia retrieval, behaviour analysis, disaster warning
systems and content-based browsing.
Automatically categorising activities given a video clip poses two main challenges, namely object
detection and motion learning. An activity recognition system must detect and localise the agent as
well as learn to categorise the action the agent is performing. This research hypothesises that learning
models incorporating spatial and temporal aspects from video data should outperform models that
learn only spatial or temporal features on activity recognition learning tasks. The above hypothesis is
investigated by developing two deep learning architectures for activity recognition that learn temporally
independent and dependent features respectively.minima do not exist. A recurrent network (structurally constrained gated recurrent unit (SCGRU)) that adds contextual
feature learning to gated recurrent units (GRUs) is proposed. Adding contextual features stabilises the
hidden state of a GRU layer.
The approach taken to investigate activity recognition architectures in this research involved examining
the architectures on four benchmark datasets and analysing the results to 1) find the best model for
activity recognition, 2) examine the model’s ability to learn salient temporal features, and 3) examine
the model’s computational complexity. SCGRU based models outperform GRU based models on the
majority of the investigated activity recognition models and datasets.