Abstract:
Much of our interaction with the environment is physical. We use our bodies for nonverbal expression or to augment or emphasize verbal communication. In other cases we use our bodies to execute tasks such as walking or picking up an object. A human observer can easily recognise these activities. For example, it is the job of a security officer in a supermarket to observe people and check that articles are not stolen. If a person does steal, the security officer recognises the act and takes appropriate action. The problem addressed in this study is the automatic recognition of human gestures by means of video image analysis. For this purpose a computer-based system with similar recognition capabilities as a human observer is investigated. The system uses cameras that correspond to the eyes and algorithms that resemble abilities of the human visual system. Automatic gesture recognition is a complex problem and the focus here is to develop algorithms that will solve a subset of the problem. This involves the recognition of simple gestures such as walking and waving of arms. The approach taken in this dissertation is to represent body shape in camera images with a simple model called a bounding box. This model has the appearance of a rect¬angle that encapsulates the extremities of the human body and resembles the coarse structure of body shape. From a representation point of view, the model is an abstrac¬tion of body pose. A gesture consists of a sequence of poses. By employing pattern recognition techniques, a sequence of pose abstractions is recognised as a gesture. Various aspects of the bounding box model are explored in this study. Perception experiments are conducted to gain a conceptual understanding of the behaviour of the model. Other aspects include investigation of two- and three-dimensional spatial representations of the model with a neural network classifier as well as the model's temporal properties through the use of hidden Markov models. These aspects are tested using gesture recognition systems implemented for this purpose. The gesture vocabularies of these systems range from four to ten gestures, while recognition rates vary from 84.7% to 96.3%.