WO2011159258A1

WO2011159258A1 - Method and system for classifying a user's action

Info

Publication number: WO2011159258A1
Application number: PCT/SG2011/000214
Authority: WO
Inventors: Susanto Rahardja; Farzam Farbiz; Miaolong Yuan
Original assignee: Agency For Science, Technology And Research
Priority date: 2010-06-16
Filing date: 2011-06-16
Publication date: 2011-12-22

Abstract

A method and system for classifying a user's action. The method comprises the steps of capturing video data of the user's action; generating an input signal indicative of the user's action based on the captured video data; and applying the input signal to a Restricted Coulomb Energy (RCE) neural network classifier for classifying the user's action into basic actions.

Description

METHOD AND SYSTE FOR CLASSIFYING A USER'S ACTION

FIELD OF INVENTION

The present; invention relates broadly to a method and system for classifying a user's action, and to a method and system for training a Restricted Coulomb Energy (RCE) neural network for classifying a users action.

BACKGROUND

Currently, in many video entertainment and training systems, it is desirable to provide simulated representations that match with respective user profiles or requirements. Jn one example, a user playing a basketball video game may want the character he is controlling to move the same way on the video screen as his favourite real-life player simply by mimicking said player's trade-mark moves. In another example, a user singing karaoke / dancing may want an interactive display of a singer/dancer partner that moves along with his movements. However, existing attempts at creating such a system have either been unsuccessful or produced unsatisfactory results.

In some existing approaches, verbal interaction [1 ,2] or sensor-based interactive tools [3, 4] are used for interacting with virtual humans. One such approach [2] shows a character-based interactive storytelling system using a speech-based interaction .tool. Another real-time interactive approach [1] allows musicians to interact with a synthetic character. The virtual character responses to musical inputs as if the musicians are conducting a live musical performance. Another approach [3] uses a Nintendo Wii Remote as an interaction tool to provide related feedback for interacting with virtual agents. Yet another approach [4] uses motion sensors for tracking the user's body movements in order to have full body interaction with the virtual characters. For example, to interact with virtual humans through body actions, several sensors are attached to the participant's body and used to control the motion of an avatar. However, in general, speech-based interactive tools may not be a good choice in noisy environments. Sensor-based interaction tools typically lack flexibility due to spatial constraints from the environments.

In contrast, vision-based methods provide a potential tool that can be used to interact with the virtual characters without any sensors attached on the user. Most existing vision-based methods use gesture recognition techniques to interact with the virtual characters [5], However, in some: applications, gesture is not able to provide sufficient information to naturally interact with virtual characters. Other computer vision-based, approaches focus on full body pose recovery or robust human action; recognition methods for the purpose of interacting with virtual humans [6,7].

For example, one vision-based approach [6] comprises an interactive dancing system in which the 2-dimensional (2D) location of the center of mass (CoM) of the silhouette is used to detect related dance actions. However, under fast changing and low lighting conditions, such as a night club, the CoM detector is prone to wrong tracking results. In another vision-based interface system [7], a human can control the animation of a virtual character. The system uses computer vision and machine learning techniques to recover the user's full body 3D pose from silhouettes obtained from three cameras, and the virtual character follows the user's pose in three dimensions. However, this method is also not suitable for the case where the lighting conditions are low and fast changing. Moreover, due to the overall system delay (from capturing the video to rendering the animation), it is not able to provide a face-to-face live interaction with the virtual character.

In other virtual dance synthesis techniques [8,9, 10,1 1], different methods are used to generate real-time dancing animation from motion capture data. Captured sequences are segmented into basic moves based on the analysis of e.g. the music rhythm and the motion beats, which are assembled into sequences using motion graphs, and the new dance is aligned to the beat structure of the music. In one such technique [10], motion beats are estimated from unlabeled dance motion captures. Basic movements, in which the motion beats are coincident with the known music rhythmic pattern (e.g. Waltz, and Tango), may be obtained. The basic movements are then grouped into prototype movements using clustering techniques. The transition probabilities among the prototype movements are established based on kinematic continuity and behavioral continuity. However, motion beats may not be used for obtaining basic movements from modern dance (e.g. club dance) due to the lack of standardized sets of basic movements.

A need therefore exists to provide a method and system for classifying a users action that seek to address at least some of the above problems.

References:

Robyn Taylor, Daniel Torres and Pierre Boulanger, Using music to interact with a virtual character, Proceedings of the International Conference of New Interfaces for Musical Expression, Vancouver, 2005.

Marc Cavazza, Fred Charles and Steven J. Mead, Interacting with virtual characters in interactive storytelling, AAMSS 2002, Bologna, Italy.

3. Aaron Kotranza, Kyle Johnsen, Juan Cendan, Bayard Miller, D. Scott Lind, and Benjamin Lok, Virtual multi-tools for hands and tool-based interaction with life-size virtual human agents, IEEE Symposium on 3D user Interfaces, 2009, Louisiana, USA.

4. Luc Emering, Ronan Boulic and Daniel Thaimann, Interacting with virtual humans through body actions, IEEE Computer Graphics & Applications,

^" 1998, 8-11.

5. Selim Balcisoy, Daniel Thaimann, interaction between Real and Virtual Humans in Augmented Reality, Proc. Computer Animation'97, IEEE CS Press, 1997, pp. 31-38.

6. Dennis Reidsma, Herwin van Welbergen, Ronald Poppe, Pieter Bos, and Anton Nijholt, Toward Bi-directional Dancing Interaction Proc. of 5th International Conference on Entertainment Computing, 1-12, 2006. 7. Liu Ren, Gregory Shakhnarovich, Jessica Hodgins, Hanspeter Pfister and Paul Viola, Learning Silhouette Features for Control of Human Motion, ACM Transactions on Graphics, Vol.24(4), October, 2005, pp1303-1331.

8. Shiratori, T., Nakazawa, A., Ikeuchi, K.: Rhythmic motion analysis using motion capture and musical information. In: Proc. of 2003 IEEE International Conference on Multisensor Fusion and Integration for intelligent Systems, (2003) 89-94.

9. Nakazawa, A., Nakaoka, S., Kudoh, S., Ikeuchi, K.: Digital archive of human dance motions. In: Proceedings of the International Conference on Virtual Systems and Multimedia (VSMM2O02).

10. Kim, T., II Park, S., Yong Shin, S.: Rhythmic-motion synthesis based on motion-beat analysis. ACM Transactions on Graphics 22(3) (2003) 392—4-01.

11. KOVAR, L, GLEICHER, M„ AND PIGHIN, F. 2002. Motion graphs.

Transactions on Graphics 21 , 3.

SUMMARY

In accordance with a first aspect of the present invention, there is provided a method for classifying a user's action, the method comprising the steps of:

capturing video data of the user's action;

generating an input signal indicative of the user's action based on the captured video data; and

applying the input signal to a Restricted Coulomb Energy (RCE) neural network classifier for classifying the user's action into basic actions.

The step of generating the input signal indicative of the user's action may comprise:

extracting a motion history image of the user's action at each frame of a video input signal; and calculating gradient information based on the extracted motion history image.

The gradient information may be calculated based on interior pixels of the motion history image. he input signal indicative of the user's action may comprise a motion vector generated based on the calculated gradient information.

A number of input layer cells of the RCE may be equal to a number of dimensions of the motion vector, and applying the input signal to the RCE may comprise applying respective values for each dimension.

A number of output layer cells of the RCE may be equal to a number of the basic actions, and classifying the user's action may comprise determining the output layer cell with a maximum conditional probability.

Calculating gradient information based on the extracted motion history image may comprise:

generating respective histograms of gradient directions for each of a plurality of overlapping regions representing different regions of the user's body; and

combining the histograms into a single histogram.

The basic actions may be selected from a group consisting of "move left", "move right", "move forward", "move backward", "move upward", "move downward", "rotate left", "rotate right" and "hands up".

In accordance with a second aspect of the present invention, there is provided a system for classifying a user's action, comprising:

means for capturing video data of the user's action;

means for generating an input signal indicative of the user's action based on the captured video data; and

a Restricted Coulomb Energy (RCE) neural network classifier configured to receive the input signal for classifying the user's action into basic actions. The means for generating the input signal indicative of the user's action may extract a motion history image of the user's action at each frame of a video input signal; and calculate gradient information based on the extracted motion history image.

The gradient information may be calculated based on interior pixels of the motion history image.

The input signal indicative of the user's action may comprise a motion vector generated based on the calculated gradient information.

A number of input layer cells of the RCE may be equal to a number of dimensions of the motion vector, and the RCE may be configured to receive respective values for each dimension.

A number of output layer cells of the RCE may be equal to a number of the basic actions, and the RCE may determine the output layer cell with a maximum conditional probability for classifying the user's action.

The means for generating the input signal indicative of the users action may calculate the gradient information by generating respective histograms of gradient directions for each of a plurality of overlapping regions representing different regions of the user's body; and combining the histograms into a single histogram. The basic actions may be selected from a group consisting of "move left",

"move right", "move forward", "move backward", "move upward", "move downward", "rotate left", "rotate right" and "hands up".

In accordance with a third aspect of the present invention, there is provided a computer storage medium having stored thereon computer code means for instructing a computing device to execute a method for classifying a user's action, the method comprising the steps of:

capturing video data of the user's action; generating an input signal indicative of the user's action based on the captured video data; and

applying the input signal to a Restricted Coulomb Energy (RCE) neural network classifier for classifying the user's action into basic actions. in accordance with a fourth aspect of the present invention, there is provided a method for training an RCE neural network for classifying a user's action, the method comprising the steps of:

capturing training video data;

classifying the captured training video data into basic actions;

generating a training input signal based on the captured training video data for applying to the RCE neural network; and

training the. RCE neural network to recognize the training input signal as one of the classified basic actions.

In accordance with a fifth aspect of the present invention, there is provided a system for training an RCE neural network for classifying a user's action, the system comprising:

means for capturing training video data;

means for classifying the captured training video data into basic actions;

means for generating a training input signal based on the captured training video data for applying to the RCE neural network; and

means for training the RCE neural network to recognize the training input signal as one of the classified basic actions.

In accordance with a sixth aspect of the present invention, there is provided a computer storage medium having stored thereon computer code means for instructing a computing device to execute a method for training an RCE neural network for classifying a user's action, the method comprising the steps of:

capturing training video data;

classifying the captured training video data into basic actions:

generating a training input signal based on the captured training video data for applying to the RCE neural network; and training the RCE neural network to recognize the training input signal as one of the classified basic actions.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention will be better understood and readily apparent to one of ordinary skill in the art from the following written description, by way of example only, and in conjunction with the drawings, in which:

Figure 1(a) shows a flow chart illustrating a training stage of a method for generating an interactive output, here a virtual dance partner, according to an example embodiment. Figure 1 (b) shows a flow chart illustrating an online stage of a method for generating an interactive output, here a virtual dance partner, according to an example embodiment.

Figure 2 shows a flow chart illustrating a method of action recognition in accordance with an example embodiment.

Figure 3 shows a schematic diagram illustrating the structure of an RCE neural network according to an example embodiment. Figures 4(a)-(c) show images illustrating extraction of a Motion History Image from a moving silhouette according to an example embodiment.

Figure 4(d) shows a schematic diagram illustrating overlapping windows for generating a motion vector according to an example embodiment.

Figure 4(e) shows a motion vector corresponding to the input image of Figure 4(a). Figure 5 shows a flow chart illustrating a method of activating a prototype cell in an RCE neural network according to an example embodiment.

Figure 6 shows a schematic diagram illustrating sorting the motion clips based on a cumulative distribution function.

Figure 7 shows a flow chart illustrating a method for classifying a user's action according to an example embodiment.

Figure 8 shows a flow chart illustrating a method for training a Restricted Coulomb Energy (RCE) neural network for classifying a user's action according to an example embodiment.

Figure 9 shows a schematic diagram illustrating a computer system suitable for implementing the methods and systems of the example embodiments.

DETAILED DESCRIPTION The example embodiments provide a method and system for recognizing actions of a human subject in a human computer interaction system, e.g. recognizing actions of the human subject in a vision-based interactive dancing simulation. An interactive output, e.g. a display, can be generated based on the recognized actions. Figure 1 (a) shows a flow chart 10OA illustrating a training stage of a method for generating an interactive output, here a virtual dance partner, according to an example embodiment. At step 102a, input music feature is provided, e.g. for providing a time reference to training data. At step 102b, dancing motion data is captured, e.g. from training videos that are manually provided. At step 104, the dancing motion in the training videos is segmented into smaller motion clips based on the time reference obtained from the music. At step 106, the segmented motion clips are manually iabeied, i.e. the action class in each frame of a respective training motion clip is known. At step 108, the labeled motion clips are provided to a motion clip database. At step 110, features in a respective motion clip are extracted. At step 112, a Restricted Coulomb Energy (RCE) network is trained based on the extracted features and the labeled action classes (to be discussed in detail below).

Figure 1(b) shows a flow; chart 100B illustrating an online stage of a method for generating an interactive output, here a virtual dance partner, according to an example embodiment. At step 122, incoming video frames of a dancing user are captured by a camera. At step 124, features in each of video frames are extracted. At step 126, the action of the person is recognized in real-time by the trained ^"RCE action classifier (described above with respect to Figure 1(a)). At step 208, one or more motion clips from the database is retrieved from the database based on the recognized action over a predetermined period, as separately provided by the dance music being played. -At step 210, the retrieved motion clips are rendered for displaying as the interactive output of the virtual dance partner. In the example embodiments, the virtual partner is displayed to the user in the form of e.g. a. 3-dimensional (3D) display. However, it will be appreciated that other forms of display, a 2-dimensional display, a holographic projection, etc may be used in alternate embodiments. This gives the user an impression of dancing together with the virtual partner. The selection process of the suitable motion clip in the example embodiments is based on a similarity measurement that affects the selection probability value of each clip, as discussed in detail below.

Some portions of the description which follows are explicitly or implicitly presented in terms of algorithms and functional or symbolic representations of operations on data within a computer memory. These algorithmic descriptions and functional or symbolic representations are the means used by those skilled in the data processing arts to convey most effectively the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities, such as electrical, magnetic or optical signals capable of being stored, transferred, combined, compared, and otherwise manipulated.

Unless specifically stated otherwise, and as apparent from the following, it will be appreciated that throughout the present specification, discussions utilizing terms such as "scanning", "calculating", "determining", "replacing", "generating", "initializing", "outputting", or the like, refer to the action and processes of a computer system, or similar electronic device, that manipulates and transforms data represented as physical quantities within the computer system into other data similarly represented as physical quantities within the computer system or other information storage, transmission or display devices.

The present specification also discloses apparatus for performing the operations of the methods. Such apparatus may be specially constructed for the required purposes, ; or may comprise a general purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer. The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose machines may be used with programs in accordance with the teachings herein. Alternatively, the construction of more specialized apparatus to perform the required method steps may be appropriate. The structure of a conventional general purpose computer will appear from the description below.

In addition, the present specification also implicitly discloses a computer program, in that it would be apparent to the person skilled in the art that the individual steps of the method described herein may be put into effect by computer code. The computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and coding thereof may be used to implement the teachings of the disclosure contained herein. Moreover, the computer program is not intended to be limited to any particular control flow. There are many other variants of the computer program, which can use different control flows without departing from the spirit or scope of the invention.

Furthermore, one or more of the steps of the computer program may be performed in parallel rather than sequentially. Such a computer program may be stored on any computer readable medium. The computer readable medium may include storage devices such as magnetic or optical disks, memory chips, or other storage devices suitable for interfacing with a general purpose computer. The computer readable medium may aiso include a hard-wired medium such as exemplified in the Internet system, or wireless medium such as exemplified in the GSM mobile telephone system. The computer program when loaded and executed on such a general-purpose computer effectively results in an apparatus that implements the steps of the preferred method.

In the example embodiments, nine basic human actions are pre-defined, i.e. "move left", "move right", "move forward", "move backward", "move upward", "move downward", "rotate left", "rotate right" and "hands up". Every frame of every motion clip from the motion clip database is manually classified into one of these nine basic actions and grouped together into a vector called "motion vector" of that motion clip. These motion vectors are used when the system is in use or in operation for the probability- based selection process.

As described above, training videos from different users are recorded and manually classified, in the example embodiments, a Restricted Coulomb Energy (RCE) neural network is then employed: to learn these basic actions using the training video data. When the system is in use or in operation, the trained RCE neural network is used to classify the user's movement at each video frame.

Figure 2 shows a flow chart 200 illustrating a method of action recognition in accordance with an example embodiment. As can be seen from Figure 2, in the example embodiment, the method of action recognition comprises motion vector extraction, e.g. using motion history images (step 202), training a RCE neural network (step 204) and real-time or online action recognition, e.g. of a human, using the learned RCE neural network (step 206). Figure 3 shows a schematic diagram illustrating the structure of an RCE neural network 300 according to an example embodiment. The RCE neural network 300 comprises three layers of neuron cells, with a full set of connections between the first and second layers (i.e. input layer 310 and prototype layer 320), and a partial set of connections between the second and third layers (prototype layer 320 and output layer 330), as shown in Figure 3. The input layer 310 includes a plurality of input cells 312, which is indicative of the dimension of the input vector. The middle layer cells are called prototype cells 322. Each prototype cell 222 contains a motion vector that occurs in the training data, and each output cell 332 on the output layer 330 corresponds to a different action class label presented in the training data set. in the example embodiments, the prototype cells 322 comprise five parameters, i.e. class C, weight vector ύ) , cell threshold λ , pattern count n, and smoothing factor σ . The weight vector ω represents a set of weighted connections between the prototype cells 322 and each of the input layer cells 312. The cell threshold λ describes a hyper- spherical region of influence around the prototype cells 322. The size of the region of influence around the prototype cells 322 in the example embodiments can be adjusted. The pattern count n indicates the number of times that a prototype cell 322 has responded to the input motion vectors submitted to the RCE neural network 300. The smoothing factor σ represents a radial decaying coefficient of the hyper-spherical influence field. The output layer 330 comprises the action categories (e.g. the 9 basic actions). As shown in Figure 3, i the RCE neural network of the example embodiments, each prototype cell 322 connects to only one output cell 332. In the example embodiments, the training procedure of the RCE neural network

300 (Figure 3) makes use of three mechanisms: prototype cell commitment, threshold modification and pattern count increment.

First, given the training action video data, the motion vectors are extracted and the corresponding action class labels are manually associated/labelled.

In an example embodiment, a histogram of the gradients of the users motion at each frame as calculated from Motion History Images (MHI) is used to describe the user's actions. It will be appreciated that even under relatively low lighting conditions, MHI can be extracted effectively. Additionally, in order to improve reliability, the gradient information is calculated in the example embodiment by only examining the interior pixels of the MHI, e.g. pixels inside of the MHI.

Figures 4(a)-(c) show images illustrating extraction of a Motion History Image from a moving silhouette according to an example embodiment. Here, Figure 4(a) shows an input image frame. Figure 4(b) shows the MHI of the input image frame of Figure 4(a). Figure 4(c) shows the pixels of the valid orientations of the MHI. In some example embodiments, the motion histogram of the orientations of the MHI is obtained by quantizing the gradient directions from the MHI into multiples of e.g. 5 degrees. Thus, each motion vector has e.g. 72 dimensions when the MHI is considered a single region. In addition, to handle changes in scale, the motion histogram is normalized by the sum of all the vaiid motion orientation pixels in the example embodiment. The resulting motion vector is then applied as the input of the RCE neural network.

In another embodiment as illustrated by Figures 4(d)-(e), from the MHI, the gradients of motion, i.e. the directional information (rotation), of the MHI are calculated. In order to obtain the motion vector of different regions around the body, the MHI is "divided^' into nine regions (or windows). Figure -4(d) shows a schematic diagram " illustrating overlapping windows for generating a motion vector according to an example embodiment. Each regio (dark portions in Figure 4(d)) is represented by a histogram of the motion orientations. To generate the histogram for these regions, the gradient directions from the MHI are quantized into multiple of 30 degrees, resulting in a histogram with 12 bins. To handle changes in scale, these histograms are normalized by the sum of all the motion orientation pixels found in the gradient map in each region. By combing these 9 histograms, a 9x 12 = 108 dimensional vector is obtained in the example embodiment. The thus obtained motion vector, in the form of a motion histogram, is shown in Figure 4(e).

Next, prototype cells of the RCE network are activated. Figure 5 shows a flow chart 500 illustrating a method of activating a prototype cell in an RCE neural network according to an example embodiment. At step 502, for an input motion vector X belonging to a class C_k , a response from existing prototype cells is checked. If the input motion vector X does not trigger any response from the existing prototype cells, at step 504, a new prototype cell is created. At step 506, the new prototype cell / is connected to an output cell k representing the class C_k . At step 508, the current input motion vector is loaded as the weight vector of the new prototype cell . On the other hand, if the input motion vector X belonging to a class C_k triggers a response from an existing prototype cell / that belongs to the same class C_k , at step

510, the pattern count n of this existing prototype cell ^'/^' is incremented by one (e.g. n = n+1).

Alternatively, if the input motion vector X belonging to a class C_k triggers a response from an existing prototype cell /^' that does not belong to the same class C_k , at step 512, a radius of the hyper-spherical influence field of this prototype cell /^' is reduced according to:

(1)

./=!

where ω_ί = {ω_ϋ : j = 1,N_V} is the weight vector of the prototype cell i.

The above algorithm can be implemented in the example embodiments such that, in response to the input signal (i.e. motion vector X), the prototype cell / uses a radial basis function to determine a trigger signal d_j as shown in Equation (1). If the trigger signal d_t is less than or equal to the cell threshold λ , the prototype cell /^' becomes active to trigger its associated action class C. Otherwise, the prototype cell /^' does not respond to the input motion vector X. When a prototype cell is triggered, its pattern count n is incremented by one.

In the example embodiment, the output of the prototype cell / in the trained RCE neural network is a probability value based on a decaying exponential function as defined by:

Here, there are 9 cells on the output layer each belonging to a respective class ( k e [1,9] ). The output value of these cells in the output layer is calculated as follows:

and the con it ona proba ty s determined according to:

In the example embodiments, the output value with the maximum conditional probability indicates an optimal identification of the input signal with that action class.

With reference to the virtual dance partner generation, in the example embodiment, the motion vector of the incoming frame is extracted, in a similar manner as described above with respect to the training stage, in real-time and used as the input signal of the trained RCE neural network. The trained RCE neural network categorizes the input signal into one of the nine basic actions, as described above. For all frames in a predetermined period of e.g. one music bar, the occurrences for each of the nine basic actions are counted. Each of these nine numbers can be considered as a users requirement r_k and can be used for synthesizing the dance animation of the virtual character.

Additionally, based on the users recognized basic actions as well as input music features (e.g. music energy and drum groove pattern similarity), a selection probability is associated with each motion dip. in the example embodiment, the selection probability indicates how close the respective motion clip is able to meet the input requirements. For example, let M = {m, ,m.₂,..., m_N } be the motion clip database with N motion clips, and R = {r_t ,r₂ _ r_L ] be the L requirements for the next motion clip, assuming independence of the requirement factors in R, if the current playing motion clip is m the probability of selecting the next motion clip m is: in _f \ m, , R) = pirn,„r„ r, r_L) =† p(m , | m, , r_k ) (5) on a simila

The similarity function (©) in Equation (6) may be a bell-shaped function with the highest value of 1 when m and m_j are similar under the respective requirement^ .

; In: an example embodiment, after a selection probability score is determined for each motion clip, the motion clips are selected as described below:

The plurality of motion clips are weighted. based on their probability scores, such that a motion clip with a higher score has a higher chance of being selected. In an example embodiment, the motion clips are sorted in a cumulative distribution function (from 0 to 1) of the respective probability scores.

For example, assume clip / has a score of 0.6, clip j is has a score of 0.2,. clip k has a score of 0.15 and clip / has a score of 0.05. The clips can be sorted as shown in Figure 6, e.g. clip / is assigned portion 602, clip j is assigned portion 604, clip k is assigned portion 606 and clip / is assigned portion 608. As such, a clip with a higher probability score is assigned a larger portion on the cumulative distribution function.

A random number from 0 to 1 (i.e. within a range of the cumulative distribution function) is generated such that if the random number falls within the portion associated with a clip, the ciip is selected. For example, if the generated random number is from 0 to 0.6, clip / is selected. If the random number is from 0.6 to 0.8, ciip j is selected, and so on.

After selecting the related motion clip based on the recognized basic actions and other factors as described above/ the selected motion clip is rendered into e.g. a 3D display for outputting to the user. Figure 7 shows a flow chart 700 illustrating a method for classifying a user's action according to an example embodiment. At step 702, video data of the user's action is captured. At step 704, an input signal indicative of the user's action is generated based oh the captured video data. At step 706, the input signal is applied to a Restricted Coulomb Energy (RCE) neural network classifier for classifying the user's action into basic actions.

Figure 8 shows a flow chart 800 illustrating a method for training an RCE for classifying a user's action according to an example embodiment. At step 802, training video data is captured. At step 804, the captured training video data is classified into basic actions. At step 806, a training input signal is generated based on the captured training video data for applying to the RCE neural network. At step 808, the RCE neural network is trained to recognize the training input signal as one of the manually classified basic actions.

The method and system of the example embodiments can be implemented on a computer system 900, schematically shown in Figure 9. It may be implemented as software, such as a computer program being executed within the computer system 900, and instructing the computer system 900 to conduct the method of the example embodiments.

The computer system 900 comprises a computer module 902, input modules such as a keyboard 904 and mouse 906 and a plurality of output devices such as a display 908, and printer 910.

The computer module 902 is connected to a computer network 912 via a suitable transceiver device 914, to enable access to e.g. the Internet or other network systems such as Local Area Network (LAN) or Wide Area Network (WAN).

The computer module 902 in the example includes a processor 918, a Random Access Memory (RAM) 920 and a Read Only Memory (ROM) 922. The computer module 902 also includes a number of Input/Output (I/O) interfaces, for example I/O interface 924 to the display 908, and I/O interface 926 to the keyboard 904.

The components of the computer module 902 typically communicate via an interconnected bus 828 and in a manner known to the person skilled in the relevant art.

The application program is typically supplied to the user of the computer system 900- encoded on a data storage medium such as a CD-ROM or flash memory; carrier and read utilising a corresponding data storage medium drive of a data storage device 930. The application program is read and controlled in its execution by -the processor 918 Intermediate storage of program data maybe accomplished using RAM 920. The method and system according to the various embodiments described are applicable to a wide range of entertainment systems, such as multiplayer online games, single player games, and virtual world simulations. The method and system according to the various embodiments are also applicable to edutainment and training systems. Also, the use of an RCE neural network as an action classifier according to the example embodiments can advantageously avoid a potential local minima problem, which may happen for some other neural networks, such as a Hopfield network or a feedback network with a back-propagation learning process. Negative samples are preferably not required for training the RCE-based classifiers of the example embodiments. Further advantages of using the RCE neural network as action classifier according to the example embodiments includes having a localized information representation (because the hidden layer of the RCE neural network encodes specific knowledge about the cluster at each node); and having the size of the influence region of each region node (i.e. prototype cell) adjustable by the giobal influence from all the other prototype cells. This mechanism may make the RCE neural network according to the example embodiments more precise even in the boundary portions of the class feature space; and the RCE according to the example embodiments can be trained oniine or during real-time operation without re-training using all existing training data. It will be .appreciated by a person skilled in the art that numerous variations and/or modifications may be made to the present invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects to be illustrative and not restrictive.

Claims

1. A method for classifying a user's action, the method comprising the steps of:

capturing video data of the user's action;

generating an input signal indicative of the users action based on the captured video data; and

. applying the input signal to a Restricted Coulomb Energy (RCE) neural network classifier for classifying the user's action into basic actions.

2: The method as claimed in claim 1 , wherein the step of generating the input signal indicative of the user's action comprises:

extracting a motion history image of the user's action at each frame of a video input signal; and

.calculating gradient information based on the extracted motion history image.

3. The method as claimed in claim 2, wherein the gradient information is calculated based on interior pixels of the motion history image. 4. The method as claimed in claims 2 or 3, wherein the input signal indicative of the user's action comprises a motion vector generated based on the calculated gradient information.

5. The method as claimed in claim 4, wherein a number of input layer cells of the RCE is equal to a number of dimensions of the motion vector, and applying the input signal to the RCE comprises applying respective values for each dimension.

6. The method as claimed in any one of the preceding claims, wherein a number of output layer cells of the RCE is equal to a number of the basic actions, and classifying the user's action comprises determining the output layer cell with a maximum conditional probability.

7. The method as claimed in any one of claims 2 to 6, wherein calculating gradient information based on the extracted motion history image comprises:

generating respective histograms of gradient directions for each of a plurality of 5 overlapping regions representing different regions of the users body; and

combining the histograms into a single histogram.

8: The method as claimed any one of the preceding claims, wherein the i y _. basic actions are selected from a group consisting of "move left", "move right", "move0 forward", "move backward", "move upward", "move downward", "rotate left", "rotate right" and "hands up".

9. A system for classifying a user's action, comprising:

means for capturing video data of the user's action;

5 means for generating an input signal indicative of the user's action based on the captured video data; and

a Restricted Coulomb Energy (RCE) neural network classifier configured to receive the input signal for classifying the user's action into basic actions, 0 10. The system as claimed in claim 9, wherein the means for generating the input signal indicative of the user's action extracts a motion history image of the user's action at each frame of a video input signal; and calculates gradient information based on the extracted motion history image. 5 1 . The system as claimed in claim 10, wherein the gradient information is calculated based on interior pixels of the motion history image.

12. The system as claimed in claims 10 or 11 , wherein the input signal indicative of the user's action comprises a motion vector generated based on the 0 calculated gradient information.

13. The system as claimed in claim 12, wherein a number of input layer cells of the RCE is equal to a number of dimensions of the motion vector, and wherein the RCE is configured to receive respective values for each dimension.

14. The system as claimed in any one of claims 9 to 13, wherein a number of output layer cells of the RCE is equal to a number of the basic actions, and wherein the RGE determines the output layer cell with a maximum conditional probability for classifying the ^"user's action.

15. The system as claimed in any one of claims 10 to 14, wherein the means for generating the input signal indicative of the user's action calculates the gradient information by generating respective histograms of gradient directions for each of a plurality of overlapping regions representing different regions of the user's body; and combining the histograms into a single histogram.

16. The system as claimed any one of claims 9 to 15, wherein the basic actions are selected from a group consisting of "move left", "move right", "move forward", "move backward", "move upward", "move downward", "rotate left", "rotate right" and "hands up".

17. A computer storage medium having stored thereon computer code means for instructing a computing device to execute a method for classifying a user's action, the method comprising the steps of:

capturing video data of the user's action;

18. A method for training an RCE neural network for classifying a user's action, the method comprising the steps of:

capturing training video data;

classifying the captured training video data into basic actions,

training the RCE neural network to recognize the training input signal as one of the classified basic actions.

19. A system for training an RCE neural network for classifying a users action, the system comprising:

means for capturing training video data;

means for classifying the captured training video data into basic actions; means for generating a training input signal based on the captured training video data for applying to the RCE neural network; and

means for training the RCE neural network to recognize the training input signal^' a$ one of the classified basic actions.

20. A computer storage medium having stored thereon computer code means for instructing a computing device to execute a method for training an RCE neural network for classifying a user's action, the method comprising the steps of: capturing training video data;

classifying the captured training video data into basic actions;

generating a training input signal based on the captured training video data forapplying to the RCE neural network; and