WO2011159258A1 - Method and system for classifying a user's action - Google Patents

Method and system for classifying a user's action Download PDF

Info

Publication number
WO2011159258A1
WO2011159258A1 PCT/SG2011/000214 SG2011000214W WO2011159258A1 WO 2011159258 A1 WO2011159258 A1 WO 2011159258A1 SG 2011000214 W SG2011000214 W SG 2011000214W WO 2011159258 A1 WO2011159258 A1 WO 2011159258A1
Authority
WO
WIPO (PCT)
Prior art keywords
action
user
input signal
training
rce
Prior art date
Application number
PCT/SG2011/000214
Other languages
French (fr)
Inventor
Susanto Rahardja
Farzam Farbiz
Miaolong Yuan
Original Assignee
Agency For Science, Technology And Research
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agency For Science, Technology And Research filed Critical Agency For Science, Technology And Research
Publication of WO2011159258A1 publication Critical patent/WO2011159258A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/20Input arrangements for video game devices
    • A63F13/21Input arrangements for video game devices characterised by their sensors, purposes or types
    • A63F13/213Input arrangements for video game devices characterised by their sensors, purposes or types comprising photodetecting means, e.g. cameras, photodiodes or infrared cells
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/80Special adaptations for executing a specific game genre or game mode
    • A63F13/814Musical performances, e.g. by evaluating the player's ability to follow a notation
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F2300/00Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
    • A63F2300/10Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game characterized by input arrangements for converting player-generated signals into game device control signals
    • A63F2300/1087Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game characterized by input arrangements for converting player-generated signals into game device control signals comprising photodetecting means, e.g. a camera
    • A63F2300/1093Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game characterized by input arrangements for converting player-generated signals into game device control signals comprising photodetecting means, e.g. a camera using visible light
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F2300/00Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
    • A63F2300/60Methods for processing data by generating or executing the game program
    • A63F2300/6045Methods for processing data by generating or executing the game program for mapping control signals received from the input arrangement into game commands

Definitions

  • the present; invention relates broadly to a method and system for classifying a user's action, and to a method and system for training a Restricted Coulomb Energy (RCE) neural network for classifying a users action.
  • RCE Restricted Coulomb Energy
  • a user playing a basketball video game may want the character he is controlling to move the same way on the video screen as his favourite real-life player simply by mimicking said player's trade-mark moves.
  • a user singing karaoke / dancing may want an interactive display of a singer/dancer partner that moves along with his movements.
  • existing attempts at creating such a system have either been unsuccessful or produced unsatisfactory results.
  • verbal interaction [1 ,2] or sensor-based interactive tools [3, 4] are used for interacting with virtual humans.
  • One such approach [2] shows a character-based interactive storytelling system using a speech-based interaction .tool.
  • Another real-time interactive approach [1] allows musicians to interact with a synthetic character. The virtual character responses to musical inputs as if the musicians are conducting a live musical performance.
  • Another approach [3] uses a Nintendo Wii Remote as an interaction tool to provide related feedback for interacting with virtual agents.
  • Yet another approach [4] uses motion sensors for tracking the user's body movements in order to have full body interaction with the virtual characters. For example, to interact with virtual humans through body actions, several sensors are attached to the participant's body and used to control the motion of an avatar.
  • speech-based interactive tools may not be a good choice in noisy environments.
  • Sensor-based interaction tools typically lack flexibility due to spatial constraints from the environments.
  • vision-based methods provide a potential tool that can be used to interact with the virtual characters without any sensors attached on the user.
  • Most existing vision-based methods use gesture recognition techniques to interact with the virtual characters [5], However, in some: applications, gesture is not able to provide sufficient information to naturally interact with virtual characters.
  • Other computer vision-based, approaches focus on full body pose recovery or robust human action; recognition methods for the purpose of interacting with virtual humans [6,7].
  • one vision-based approach [6] comprises an interactive dancing system in which the 2-dimensional (2D) location of the center of mass (CoM) of the silhouette is used to detect related dance actions.
  • the CoM detector is prone to wrong tracking results.
  • a human can control the animation of a virtual character.
  • the system uses computer vision and machine learning techniques to recover the user's full body 3D pose from silhouettes obtained from three cameras, and the virtual character follows the user's pose in three dimensions.
  • this method is also not suitable for the case where the lighting conditions are low and fast changing.
  • due to the overall system delay from capturing the video to rendering the animation), it is not able to provide a face-to-face live interaction with the virtual character.
  • a method for classifying a user's action comprising the steps of:
  • RCE Restricted Coulomb Energy
  • the step of generating the input signal indicative of the user's action may comprise:
  • the gradient information may be calculated based on interior pixels of the motion history image.
  • he input signal indicative of the user's action may comprise a motion vector generated based on the calculated gradient information.
  • a number of input layer cells of the RCE may be equal to a number of dimensions of the motion vector, and applying the input signal to the RCE may comprise applying respective values for each dimension.
  • a number of output layer cells of the RCE may be equal to a number of the basic actions, and classifying the user's action may comprise determining the output layer cell with a maximum conditional probability.
  • Calculating gradient information based on the extracted motion history image may comprise:
  • the basic actions may be selected from a group consisting of "move left”, “move right”, “move forward”, “move backward”, “move upward”, “move downward”, “rotate left”, “rotate right” and “hands up”.
  • a system for classifying a user's action comprising:
  • a Restricted Coulomb Energy (RCE) neural network classifier configured to receive the input signal for classifying the user's action into basic actions.
  • the means for generating the input signal indicative of the user's action may extract a motion history image of the user's action at each frame of a video input signal; and calculate gradient information based on the extracted motion history image.
  • the gradient information may be calculated based on interior pixels of the motion history image.
  • the input signal indicative of the user's action may comprise a motion vector generated based on the calculated gradient information.
  • a number of input layer cells of the RCE may be equal to a number of dimensions of the motion vector, and the RCE may be configured to receive respective values for each dimension.
  • a number of output layer cells of the RCE may be equal to a number of the basic actions, and the RCE may determine the output layer cell with a maximum conditional probability for classifying the user's action.
  • the means for generating the input signal indicative of the users action may calculate the gradient information by generating respective histograms of gradient directions for each of a plurality of overlapping regions representing different regions of the user's body; and combining the histograms into a single histogram.
  • the basic actions may be selected from a group consisting of "move left",
  • a computer storage medium having stored thereon computer code means for instructing a computing device to execute a method for classifying a user's action, the method comprising the steps of:
  • RCE Restricted Coulomb Energy
  • a system for training an RCE neural network for classifying a user's action comprising:
  • a computer storage medium having stored thereon computer code means for instructing a computing device to execute a method for training an RCE neural network for classifying a user's action, the method comprising the steps of:
  • Figure 1(a) shows a flow chart illustrating a training stage of a method for generating an interactive output, here a virtual dance partner, according to an example embodiment.
  • Figure 1 (b) shows a flow chart illustrating an online stage of a method for generating an interactive output, here a virtual dance partner, according to an example embodiment.
  • Figure 2 shows a flow chart illustrating a method of action recognition in accordance with an example embodiment.
  • Figure 3 shows a schematic diagram illustrating the structure of an RCE neural network according to an example embodiment.
  • Figures 4(a)-(c) show images illustrating extraction of a Motion History Image from a moving silhouette according to an example embodiment.
  • Figure 4(d) shows a schematic diagram illustrating overlapping windows for generating a motion vector according to an example embodiment.
  • Figure 4(e) shows a motion vector corresponding to the input image of Figure 4(a).
  • Figure 5 shows a flow chart illustrating a method of activating a prototype cell in an RCE neural network according to an example embodiment.
  • Figure 6 shows a schematic diagram illustrating sorting the motion clips based on a cumulative distribution function.
  • Figure 7 shows a flow chart illustrating a method for classifying a user's action according to an example embodiment.
  • Figure 8 shows a flow chart illustrating a method for training a Restricted Coulomb Energy (RCE) neural network for classifying a user's action according to an example embodiment.
  • RCE Restricted Coulomb Energy
  • Figure 9 shows a schematic diagram illustrating a computer system suitable for implementing the methods and systems of the example embodiments.
  • the example embodiments provide a method and system for recognizing actions of a human subject in a human computer interaction system, e.g. recognizing actions of the human subject in a vision-based interactive dancing simulation.
  • An interactive output e.g. a display, can be generated based on the recognized actions.
  • Figure 1 (a) shows a flow chart 10OA illustrating a training stage of a method for generating an interactive output, here a virtual dance partner, according to an example embodiment.
  • input music feature is provided, e.g. for providing a time reference to training data.
  • dancing motion data is captured, e.g. from training videos that are manually provided.
  • the dancing motion in the training videos is segmented into smaller motion clips based on the time reference obtained from the music.
  • the segmented motion clips are manually iabeied, i.e. the action class in each frame of a respective training motion clip is known.
  • the labeled motion clips are provided to a motion clip database.
  • features in a respective motion clip are extracted.
  • a Restricted Coulomb Energy (RCE) network is trained based on the extracted features and the labeled action classes (to be discussed in detail below).
  • Figure 1(b) shows a flow; chart 100B illustrating an online stage of a method for generating an interactive output, here a virtual dance partner, according to an example embodiment.
  • incoming video frames of a dancing user are captured by a camera.
  • features in each of video frames are extracted.
  • the action of the person is recognized in real-time by the trained " RCE action classifier (described above with respect to Figure 1(a)).
  • one or more motion clips from the database is retrieved from the database based on the recognized action over a predetermined period, as separately provided by the dance music being played.
  • the retrieved motion clips are rendered for displaying as the interactive output of the virtual dance partner.
  • the virtual partner is displayed to the user in the form of e.g. a. 3-dimensional (3D) display.
  • 3D 3-dimensional
  • other forms of display, a 2-dimensional display, a holographic projection, etc may be used in alternate embodiments. This gives the user an impression of dancing together with the virtual partner.
  • the selection process of the suitable motion clip in the example embodiments is based on a similarity measurement that affects the selection probability value of each clip, as discussed in detail below.
  • the present specification also discloses apparatus for performing the operations of the methods.
  • Such apparatus may be specially constructed for the required purposes, ; or may comprise a general purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer.
  • the algorithms and displays presented herein are not inherently related to any particular computer or other apparatus.
  • Various general purpose machines may be used with programs in accordance with the teachings herein.
  • the construction of more specialized apparatus to perform the required method steps may be appropriate.
  • the structure of a conventional general purpose computer will appear from the description below.
  • the present specification also implicitly discloses a computer program, in that it would be apparent to the person skilled in the art that the individual steps of the method described herein may be put into effect by computer code.
  • the computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and coding thereof may be used to implement the teachings of the disclosure contained herein.
  • the computer program is not intended to be limited to any particular control flow. There are many other variants of the computer program, which can use different control flows without departing from the spirit or scope of the invention.
  • the computer readable medium may include storage devices such as magnetic or optical disks, memory chips, or other storage devices suitable for interfacing with a general purpose computer.
  • the computer readable medium may aiso include a hard-wired medium such as exemplified in the Internet system, or wireless medium such as exemplified in the GSM mobile telephone system.
  • the computer program when loaded and executed on such a general-purpose computer effectively results in an apparatus that implements the steps of the preferred method.
  • nine basic human actions are pre-defined, i.e. "move left”, “move right”, “move forward”, “move backward”, “move upward”, “move downward”, “rotate left”, “rotate right” and “hands up”.
  • Every frame of every motion clip from the motion clip database is manually classified into one of these nine basic actions and grouped together into a vector called "motion vector" of that motion clip.
  • motion vectors are used when the system is in use or in operation for the probability- based selection process.
  • a Restricted Coulomb Energy (RCE) neural network is then employed: to learn these basic actions using the training video data.
  • RCE Restricted Coulomb Energy
  • the trained RCE neural network is used to classify the user's movement at each video frame.
  • Figure 2 shows a flow chart 200 illustrating a method of action recognition in accordance with an example embodiment.
  • the method of action recognition comprises motion vector extraction, e.g. using motion history images (step 202), training a RCE neural network (step 204) and real-time or online action recognition, e.g. of a human, using the learned RCE neural network (step 206).
  • Figure 3 shows a schematic diagram illustrating the structure of an RCE neural network 300 according to an example embodiment.
  • the RCE neural network 300 comprises three layers of neuron cells, with a full set of connections between the first and second layers (i.e.
  • the input layer 310 includes a plurality of input cells 312, which is indicative of the dimension of the input vector.
  • the middle layer cells are called prototype cells 322.
  • Each prototype cell 222 contains a motion vector that occurs in the training data, and each output cell 332 on the output layer 330 corresponds to a different action class label presented in the training data set.
  • the prototype cells 322 comprise five parameters, i.e. class C, weight vector ⁇ ) , cell threshold ⁇ , pattern count n, and smoothing factor ⁇ .
  • the weight vector ⁇ represents a set of weighted connections between the prototype cells 322 and each of the input layer cells 312.
  • the cell threshold ⁇ describes a hyper- spherical region of influence around the prototype cells 322. The size of the region of influence around the prototype cells 322 in the example embodiments can be adjusted.
  • the pattern count n indicates the number of times that a prototype cell 322 has responded to the input motion vectors submitted to the RCE neural network 300.
  • the smoothing factor ⁇ represents a radial decaying coefficient of the hyper-spherical influence field.
  • the output layer 330 comprises the action categories (e.g. the 9 basic actions). As shown in Figure 3, i the RCE neural network of the example embodiments, each prototype cell 322 connects to only one output cell 332. In the example embodiments, the training procedure of the RCE neural network
  • Figure 3 makes use of three mechanisms: prototype cell commitment, threshold modification and pattern count increment.
  • the motion vectors are extracted and the corresponding action class labels are manually associated/labelled.
  • a histogram of the gradients of the users motion at each frame as calculated from Motion History Images is used to describe the user's actions. It will be appreciated that even under relatively low lighting conditions, MHI can be extracted effectively. Additionally, in order to improve reliability, the gradient information is calculated in the example embodiment by only examining the interior pixels of the MHI, e.g. pixels inside of the MHI.
  • Figures 4(a)-(c) show images illustrating extraction of a Motion History Image from a moving silhouette according to an example embodiment.
  • Figure 4(a) shows an input image frame.
  • Figure 4(b) shows the MHI of the input image frame of Figure 4(a).
  • Figure 4(c) shows the pixels of the valid orientations of the MHI.
  • the motion histogram of the orientations of the MHI is obtained by quantizing the gradient directions from the MHI into multiples of e.g. 5 degrees.
  • each motion vector has e.g. 72 dimensions when the MHI is considered a single region.
  • the motion histogram is normalized by the sum of all the vaiid motion orientation pixels in the example embodiment.
  • the resulting motion vector is then applied as the input of the RCE neural network.
  • the gradients of motion i.e. the directional information (rotation) of the MHI are calculated.
  • the MHI is "divided ' into nine regions (or windows).
  • Figure -4(d) shows a schematic diagram " illustrating overlapping windows for generating a motion vector according to an example embodiment.
  • Each regio dark portions in Figure 4(d)
  • the gradient directions from the MHI are quantized into multiple of 30 degrees, resulting in a histogram with 12 bins.
  • FIG. 5 shows a flow chart 500 illustrating a method of activating a prototype cell in an RCE neural network according to an example embodiment.
  • a response from existing prototype cells is checked. If the input motion vector X does not trigger any response from the existing prototype cells, at step 504, a new prototype cell is created.
  • the new prototype cell / is connected to an output cell k representing the class C k .
  • the current input motion vector is loaded as the weight vector of the new prototype cell .
  • the input motion vector X belonging to a class C k triggers a response from an existing prototype cell / that belongs to the same class C k , at step
  • a radius of the hyper-spherical influence field of this prototype cell / ' is reduced according to:
  • the prototype cell / uses a radial basis function to determine a trigger signal d j as shown in Equation (1). If the trigger signal d t is less than or equal to the cell threshold ⁇ , the prototype cell / ' becomes active to trigger its associated action class C. Otherwise, the prototype cell / ' does not respond to the input motion vector X. When a prototype cell is triggered, its pattern count n is incremented by one.
  • the output of the prototype cell / in the trained RCE neural network is a probability value based on a decaying exponential function as defined by:
  • the output value with the maximum conditional probability indicates an optimal identification of the input signal with that action class.
  • the motion vector of the incoming frame is extracted, in a similar manner as described above with respect to the training stage, in real-time and used as the input signal of the trained RCE neural network.
  • the trained RCE neural network categorizes the input signal into one of the nine basic actions, as described above. For all frames in a predetermined period of e.g. one music bar, the occurrences for each of the nine basic actions are counted. Each of these nine numbers can be considered as a users requirement r k and can be used for synthesizing the dance animation of the virtual character.
  • a selection probability is associated with each motion dip.
  • the similarity function ( ⁇ ) in Equation (6) may be a bell-shaped function with the highest value of 1 when m and m j are similar under the respective requirement ⁇ .
  • the motion clips are selected as described below:
  • the plurality of motion clips are weighted. based on their probability scores, such that a motion clip with a higher score has a higher chance of being selected.
  • the motion clips are sorted in a cumulative distribution function (from 0 to 1) of the respective probability scores.
  • clip / has a score of 0.6
  • clip j is has a score of 0.2
  • clip k has a score of 0.15
  • clip / has a score of 0.05.
  • the clips can be sorted as shown in Figure 6, e.g. clip / is assigned portion 602, clip j is assigned portion 604, clip k is assigned portion 606 and clip / is assigned portion 608. As such, a clip with a higher probability score is assigned a larger portion on the cumulative distribution function.
  • a random number from 0 to 1 (i.e. within a range of the cumulative distribution function) is generated such that if the random number falls within the portion associated with a clip, the ciip is selected. For example, if the generated random number is from 0 to 0.6, clip / is selected. If the random number is from 0.6 to 0.8, ciip j is selected, and so on.
  • FIG. 7 shows a flow chart 700 illustrating a method for classifying a user's action according to an example embodiment.
  • video data of the user's action is captured.
  • an input signal indicative of the user's action is generated based oh the captured video data.
  • the input signal is applied to a Restricted Coulomb Energy (RCE) neural network classifier for classifying the user's action into basic actions.
  • RCE Restricted Coulomb Energy
  • FIG 8 shows a flow chart 800 illustrating a method for training an RCE for classifying a user's action according to an example embodiment.
  • training video data is captured.
  • the captured training video data is classified into basic actions.
  • a training input signal is generated based on the captured training video data for applying to the RCE neural network.
  • the RCE neural network is trained to recognize the training input signal as one of the manually classified basic actions.
  • the method and system of the example embodiments can be implemented on a computer system 900, schematically shown in Figure 9. It may be implemented as software, such as a computer program being executed within the computer system 900, and instructing the computer system 900 to conduct the method of the example embodiments.
  • the computer system 900 comprises a computer module 902, input modules such as a keyboard 904 and mouse 906 and a plurality of output devices such as a display 908, and printer 910.
  • the computer module 902 is connected to a computer network 912 via a suitable transceiver device 914, to enable access to e.g. the Internet or other network systems such as Local Area Network (LAN) or Wide Area Network (WAN).
  • LAN Local Area Network
  • WAN Wide Area Network
  • the computer module 902 in the example includes a processor 918, a Random Access Memory (RAM) 920 and a Read Only Memory (ROM) 922.
  • the computer module 902 also includes a number of Input/Output (I/O) interfaces, for example I/O interface 924 to the display 908, and I/O interface 926 to the keyboard 904.
  • I/O Input/Output
  • the components of the computer module 902 typically communicate via an interconnected bus 828 and in a manner known to the person skilled in the relevant art.
  • the application program is typically supplied to the user of the computer system 900- encoded on a data storage medium such as a CD-ROM or flash memory; carrier and read utilising a corresponding data storage medium drive of a data storage device 930.
  • the application program is read and controlled in its execution by -the processor 918 Intermediate storage of program data maybe accomplished using RAM 920.
  • the method and system according to the various embodiments described are applicable to a wide range of entertainment systems, such as multiplayer online games, single player games, and virtual world simulations.
  • the method and system according to the various embodiments are also applicable to edutainment and training systems.
  • an RCE neural network as an action classifier according to the example embodiments can advantageously avoid a potential local minima problem, which may happen for some other neural networks, such as a Hopfield network or a feedback network with a back-propagation learning process. Negative samples are preferably not required for training the RCE-based classifiers of the example embodiments. Further advantages of using the RCE neural network as action classifier according to the example embodiments includes having a localized information representation (because the hidden layer of the RCE neural network encodes specific knowledge about the cluster at each node); and having the size of the influence region of each region node (i.e. prototype cell) adjustable by the giobal influence from all the other prototype cells.
  • This mechanism may make the RCE neural network according to the example embodiments more precise even in the boundary portions of the class feature space; and the RCE according to the example embodiments can be trained oniine or during real-time operation without re-training using all existing training data. It will be .appreciated by a person skilled in the art that numerous variations and/or modifications may be made to the present invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects to be illustrative and not restrictive.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Psychiatry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)

Abstract

A method and system for classifying a user's action. The method comprises the steps of capturing video data of the user's action; generating an input signal indicative of the user's action based on the captured video data; and applying the input signal to a Restricted Coulomb Energy (RCE) neural network classifier for classifying the user's action into basic actions.

Description

METHOD AND SYSTE FOR CLASSIFYING A USER'S ACTION
FIELD OF INVENTION
The present; invention relates broadly to a method and system for classifying a user's action, and to a method and system for training a Restricted Coulomb Energy (RCE) neural network for classifying a users action.
BACKGROUND
Currently, in many video entertainment and training systems, it is desirable to provide simulated representations that match with respective user profiles or requirements. Jn one example, a user playing a basketball video game may want the character he is controlling to move the same way on the video screen as his favourite real-life player simply by mimicking said player's trade-mark moves. In another example, a user singing karaoke / dancing may want an interactive display of a singer/dancer partner that moves along with his movements. However, existing attempts at creating such a system have either been unsuccessful or produced unsatisfactory results.
In some existing approaches, verbal interaction [1 ,2] or sensor-based interactive tools [3, 4] are used for interacting with virtual humans. One such approach [2] shows a character-based interactive storytelling system using a speech-based interaction .tool. Another real-time interactive approach [1] allows musicians to interact with a synthetic character. The virtual character responses to musical inputs as if the musicians are conducting a live musical performance. Another approach [3] uses a Nintendo Wii Remote as an interaction tool to provide related feedback for interacting with virtual agents. Yet another approach [4] uses motion sensors for tracking the user's body movements in order to have full body interaction with the virtual characters. For example, to interact with virtual humans through body actions, several sensors are attached to the participant's body and used to control the motion of an avatar. However, in general, speech-based interactive tools may not be a good choice in noisy environments. Sensor-based interaction tools typically lack flexibility due to spatial constraints from the environments.
In contrast, vision-based methods provide a potential tool that can be used to interact with the virtual characters without any sensors attached on the user. Most existing vision-based methods use gesture recognition techniques to interact with the virtual characters [5], However, in some: applications, gesture is not able to provide sufficient information to naturally interact with virtual characters. Other computer vision-based, approaches focus on full body pose recovery or robust human action; recognition methods for the purpose of interacting with virtual humans [6,7].
For example, one vision-based approach [6] comprises an interactive dancing system in which the 2-dimensional (2D) location of the center of mass (CoM) of the silhouette is used to detect related dance actions. However, under fast changing and low lighting conditions, such as a night club, the CoM detector is prone to wrong tracking results. In another vision-based interface system [7], a human can control the animation of a virtual character. The system uses computer vision and machine learning techniques to recover the user's full body 3D pose from silhouettes obtained from three cameras, and the virtual character follows the user's pose in three dimensions. However, this method is also not suitable for the case where the lighting conditions are low and fast changing. Moreover, due to the overall system delay (from capturing the video to rendering the animation), it is not able to provide a face-to-face live interaction with the virtual character.
In other virtual dance synthesis techniques [8,9, 10,1 1], different methods are used to generate real-time dancing animation from motion capture data. Captured sequences are segmented into basic moves based on the analysis of e.g. the music rhythm and the motion beats, which are assembled into sequences using motion graphs, and the new dance is aligned to the beat structure of the music. In one such technique [10], motion beats are estimated from unlabeled dance motion captures. Basic movements, in which the motion beats are coincident with the known music rhythmic pattern (e.g. Waltz, and Tango), may be obtained. The basic movements are then grouped into prototype movements using clustering techniques. The transition probabilities among the prototype movements are established based on kinematic continuity and behavioral continuity. However, motion beats may not be used for obtaining basic movements from modern dance (e.g. club dance) due to the lack of standardized sets of basic movements.
A need therefore exists to provide a method and system for classifying a users action that seek to address at least some of the above problems.
References:
Robyn Taylor, Daniel Torres and Pierre Boulanger, Using music to interact with a virtual character, Proceedings of the International Conference of New Interfaces for Musical Expression, Vancouver, 2005.
Marc Cavazza, Fred Charles and Steven J. Mead, Interacting with virtual characters in interactive storytelling, AAMSS 2002, Bologna, Italy.
3. Aaron Kotranza, Kyle Johnsen, Juan Cendan, Bayard Miller, D. Scott Lind, and Benjamin Lok, Virtual multi-tools for hands and tool-based interaction with life-size virtual human agents, IEEE Symposium on 3D user Interfaces, 2009, Louisiana, USA.
4. Luc Emering, Ronan Boulic and Daniel Thaimann, Interacting with virtual humans through body actions, IEEE Computer Graphics & Applications,
" 1998, 8-11.
5. Selim Balcisoy, Daniel Thaimann, interaction between Real and Virtual Humans in Augmented Reality, Proc. Computer Animation'97, IEEE CS Press, 1997, pp. 31-38.
6. Dennis Reidsma, Herwin van Welbergen, Ronald Poppe, Pieter Bos, and Anton Nijholt, Toward Bi-directional Dancing Interaction Proc. of 5th International Conference on Entertainment Computing, 1-12, 2006. 7. Liu Ren, Gregory Shakhnarovich, Jessica Hodgins, Hanspeter Pfister and Paul Viola, Learning Silhouette Features for Control of Human Motion, ACM Transactions on Graphics, Vol.24(4), October, 2005, pp1303-1331.
8. Shiratori, T., Nakazawa, A., Ikeuchi, K.: Rhythmic motion analysis using motion capture and musical information. In: Proc. of 2003 IEEE International Conference on Multisensor Fusion and Integration for intelligent Systems, (2003) 89-94.
9. Nakazawa, A., Nakaoka, S., Kudoh, S., Ikeuchi, K.: Digital archive of human dance motions. In: Proceedings of the International Conference on Virtual Systems and Multimedia (VSMM2O02).
10. Kim, T., II Park, S., Yong Shin, S.: Rhythmic-motion synthesis based on motion-beat analysis. ACM Transactions on Graphics 22(3) (2003) 392—4-01.
11. KOVAR, L, GLEICHER, M„ AND PIGHIN, F. 2002. Motion graphs.
Transactions on Graphics 21 , 3.
SUMMARY
In accordance with a first aspect of the present invention, there is provided a method for classifying a user's action, the method comprising the steps of:
capturing video data of the user's action;
generating an input signal indicative of the user's action based on the captured video data; and
applying the input signal to a Restricted Coulomb Energy (RCE) neural network classifier for classifying the user's action into basic actions.
The step of generating the input signal indicative of the user's action may comprise:
extracting a motion history image of the user's action at each frame of a video input signal; and calculating gradient information based on the extracted motion history image.
The gradient information may be calculated based on interior pixels of the motion history image. he input signal indicative of the user's action may comprise a motion vector generated based on the calculated gradient information.
A number of input layer cells of the RCE may be equal to a number of dimensions of the motion vector, and applying the input signal to the RCE may comprise applying respective values for each dimension.
A number of output layer cells of the RCE may be equal to a number of the basic actions, and classifying the user's action may comprise determining the output layer cell with a maximum conditional probability.
Calculating gradient information based on the extracted motion history image may comprise:
generating respective histograms of gradient directions for each of a plurality of overlapping regions representing different regions of the user's body; and
combining the histograms into a single histogram.
The basic actions may be selected from a group consisting of "move left", "move right", "move forward", "move backward", "move upward", "move downward", "rotate left", "rotate right" and "hands up".
In accordance with a second aspect of the present invention, there is provided a system for classifying a user's action, comprising:
means for capturing video data of the user's action;
means for generating an input signal indicative of the user's action based on the captured video data; and
a Restricted Coulomb Energy (RCE) neural network classifier configured to receive the input signal for classifying the user's action into basic actions. The means for generating the input signal indicative of the user's action may extract a motion history image of the user's action at each frame of a video input signal; and calculate gradient information based on the extracted motion history image.
The gradient information may be calculated based on interior pixels of the motion history image.
The input signal indicative of the user's action may comprise a motion vector generated based on the calculated gradient information.
A number of input layer cells of the RCE may be equal to a number of dimensions of the motion vector, and the RCE may be configured to receive respective values for each dimension.
A number of output layer cells of the RCE may be equal to a number of the basic actions, and the RCE may determine the output layer cell with a maximum conditional probability for classifying the user's action.
The means for generating the input signal indicative of the users action may calculate the gradient information by generating respective histograms of gradient directions for each of a plurality of overlapping regions representing different regions of the user's body; and combining the histograms into a single histogram. The basic actions may be selected from a group consisting of "move left",
"move right", "move forward", "move backward", "move upward", "move downward", "rotate left", "rotate right" and "hands up".
In accordance with a third aspect of the present invention, there is provided a computer storage medium having stored thereon computer code means for instructing a computing device to execute a method for classifying a user's action, the method comprising the steps of:
capturing video data of the user's action; generating an input signal indicative of the user's action based on the captured video data; and
applying the input signal to a Restricted Coulomb Energy (RCE) neural network classifier for classifying the user's action into basic actions. in accordance with a fourth aspect of the present invention, there is provided a method for training an RCE neural network for classifying a user's action, the method comprising the steps of:
capturing training video data;
classifying the captured training video data into basic actions;
generating a training input signal based on the captured training video data for applying to the RCE neural network; and
training the. RCE neural network to recognize the training input signal as one of the classified basic actions.
In accordance with a fifth aspect of the present invention, there is provided a system for training an RCE neural network for classifying a user's action, the system comprising:
means for capturing training video data;
means for classifying the captured training video data into basic actions;
means for generating a training input signal based on the captured training video data for applying to the RCE neural network; and
means for training the RCE neural network to recognize the training input signal as one of the classified basic actions.
In accordance with a sixth aspect of the present invention, there is provided a computer storage medium having stored thereon computer code means for instructing a computing device to execute a method for training an RCE neural network for classifying a user's action, the method comprising the steps of:
capturing training video data;
classifying the captured training video data into basic actions:
generating a training input signal based on the captured training video data for applying to the RCE neural network; and training the RCE neural network to recognize the training input signal as one of the classified basic actions.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments of the invention will be better understood and readily apparent to one of ordinary skill in the art from the following written description, by way of example only, and in conjunction with the drawings, in which:
Figure 1(a) shows a flow chart illustrating a training stage of a method for generating an interactive output, here a virtual dance partner, according to an example embodiment. Figure 1 (b) shows a flow chart illustrating an online stage of a method for generating an interactive output, here a virtual dance partner, according to an example embodiment.
Figure 2 shows a flow chart illustrating a method of action recognition in accordance with an example embodiment.
Figure 3 shows a schematic diagram illustrating the structure of an RCE neural network according to an example embodiment. Figures 4(a)-(c) show images illustrating extraction of a Motion History Image from a moving silhouette according to an example embodiment.
Figure 4(d) shows a schematic diagram illustrating overlapping windows for generating a motion vector according to an example embodiment.
Figure 4(e) shows a motion vector corresponding to the input image of Figure 4(a). Figure 5 shows a flow chart illustrating a method of activating a prototype cell in an RCE neural network according to an example embodiment.
Figure 6 shows a schematic diagram illustrating sorting the motion clips based on a cumulative distribution function.
Figure 7 shows a flow chart illustrating a method for classifying a user's action according to an example embodiment.
Figure 8 shows a flow chart illustrating a method for training a Restricted Coulomb Energy (RCE) neural network for classifying a user's action according to an example embodiment.
Figure 9 shows a schematic diagram illustrating a computer system suitable for implementing the methods and systems of the example embodiments.
DETAILED DESCRIPTION The example embodiments provide a method and system for recognizing actions of a human subject in a human computer interaction system, e.g. recognizing actions of the human subject in a vision-based interactive dancing simulation. An interactive output, e.g. a display, can be generated based on the recognized actions. Figure 1 (a) shows a flow chart 10OA illustrating a training stage of a method for generating an interactive output, here a virtual dance partner, according to an example embodiment. At step 102a, input music feature is provided, e.g. for providing a time reference to training data. At step 102b, dancing motion data is captured, e.g. from training videos that are manually provided. At step 104, the dancing motion in the training videos is segmented into smaller motion clips based on the time reference obtained from the music. At step 106, the segmented motion clips are manually iabeied, i.e. the action class in each frame of a respective training motion clip is known. At step 108, the labeled motion clips are provided to a motion clip database. At step 110, features in a respective motion clip are extracted. At step 112, a Restricted Coulomb Energy (RCE) network is trained based on the extracted features and the labeled action classes (to be discussed in detail below).
Figure 1(b) shows a flow; chart 100B illustrating an online stage of a method for generating an interactive output, here a virtual dance partner, according to an example embodiment. At step 122, incoming video frames of a dancing user are captured by a camera. At step 124, features in each of video frames are extracted. At step 126, the action of the person is recognized in real-time by the trained "RCE action classifier (described above with respect to Figure 1(a)). At step 208, one or more motion clips from the database is retrieved from the database based on the recognized action over a predetermined period, as separately provided by the dance music being played. -At step 210, the retrieved motion clips are rendered for displaying as the interactive output of the virtual dance partner. In the example embodiments, the virtual partner is displayed to the user in the form of e.g. a. 3-dimensional (3D) display. However, it will be appreciated that other forms of display, a 2-dimensional display, a holographic projection, etc may be used in alternate embodiments. This gives the user an impression of dancing together with the virtual partner. The selection process of the suitable motion clip in the example embodiments is based on a similarity measurement that affects the selection probability value of each clip, as discussed in detail below.
Some portions of the description which follows are explicitly or implicitly presented in terms of algorithms and functional or symbolic representations of operations on data within a computer memory. These algorithmic descriptions and functional or symbolic representations are the means used by those skilled in the data processing arts to convey most effectively the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities, such as electrical, magnetic or optical signals capable of being stored, transferred, combined, compared, and otherwise manipulated.
Unless specifically stated otherwise, and as apparent from the following, it will be appreciated that throughout the present specification, discussions utilizing terms such as "scanning", "calculating", "determining", "replacing", "generating", "initializing", "outputting", or the like, refer to the action and processes of a computer system, or similar electronic device, that manipulates and transforms data represented as physical quantities within the computer system into other data similarly represented as physical quantities within the computer system or other information storage, transmission or display devices.
The present specification also discloses apparatus for performing the operations of the methods. Such apparatus may be specially constructed for the required purposes, ; or may comprise a general purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer. The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose machines may be used with programs in accordance with the teachings herein. Alternatively, the construction of more specialized apparatus to perform the required method steps may be appropriate. The structure of a conventional general purpose computer will appear from the description below.
In addition, the present specification also implicitly discloses a computer program, in that it would be apparent to the person skilled in the art that the individual steps of the method described herein may be put into effect by computer code. The computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and coding thereof may be used to implement the teachings of the disclosure contained herein. Moreover, the computer program is not intended to be limited to any particular control flow. There are many other variants of the computer program, which can use different control flows without departing from the spirit or scope of the invention.
Furthermore, one or more of the steps of the computer program may be performed in parallel rather than sequentially. Such a computer program may be stored on any computer readable medium. The computer readable medium may include storage devices such as magnetic or optical disks, memory chips, or other storage devices suitable for interfacing with a general purpose computer. The computer readable medium may aiso include a hard-wired medium such as exemplified in the Internet system, or wireless medium such as exemplified in the GSM mobile telephone system. The computer program when loaded and executed on such a general-purpose computer effectively results in an apparatus that implements the steps of the preferred method.
In the example embodiments, nine basic human actions are pre-defined, i.e. "move left", "move right", "move forward", "move backward", "move upward", "move downward", "rotate left", "rotate right" and "hands up". Every frame of every motion clip from the motion clip database is manually classified into one of these nine basic actions and grouped together into a vector called "motion vector" of that motion clip. These motion vectors are used when the system is in use or in operation for the probability- based selection process.
As described above, training videos from different users are recorded and manually classified, in the example embodiments, a Restricted Coulomb Energy (RCE) neural network is then employed: to learn these basic actions using the training video data. When the system is in use or in operation, the trained RCE neural network is used to classify the user's movement at each video frame.
Figure 2 shows a flow chart 200 illustrating a method of action recognition in accordance with an example embodiment. As can be seen from Figure 2, in the example embodiment, the method of action recognition comprises motion vector extraction, e.g. using motion history images (step 202), training a RCE neural network (step 204) and real-time or online action recognition, e.g. of a human, using the learned RCE neural network (step 206). Figure 3 shows a schematic diagram illustrating the structure of an RCE neural network 300 according to an example embodiment. The RCE neural network 300 comprises three layers of neuron cells, with a full set of connections between the first and second layers (i.e. input layer 310 and prototype layer 320), and a partial set of connections between the second and third layers (prototype layer 320 and output layer 330), as shown in Figure 3. The input layer 310 includes a plurality of input cells 312, which is indicative of the dimension of the input vector. The middle layer cells are called prototype cells 322. Each prototype cell 222 contains a motion vector that occurs in the training data, and each output cell 332 on the output layer 330 corresponds to a different action class label presented in the training data set. in the example embodiments, the prototype cells 322 comprise five parameters, i.e. class C, weight vector ύ) , cell threshold λ , pattern count n, and smoothing factor σ . The weight vector ω represents a set of weighted connections between the prototype cells 322 and each of the input layer cells 312. The cell threshold λ describes a hyper- spherical region of influence around the prototype cells 322. The size of the region of influence around the prototype cells 322 in the example embodiments can be adjusted. The pattern count n indicates the number of times that a prototype cell 322 has responded to the input motion vectors submitted to the RCE neural network 300. The smoothing factor σ represents a radial decaying coefficient of the hyper-spherical influence field. The output layer 330 comprises the action categories (e.g. the 9 basic actions). As shown in Figure 3, i the RCE neural network of the example embodiments, each prototype cell 322 connects to only one output cell 332. In the example embodiments, the training procedure of the RCE neural network
300 (Figure 3) makes use of three mechanisms: prototype cell commitment, threshold modification and pattern count increment.
First, given the training action video data, the motion vectors are extracted and the corresponding action class labels are manually associated/labelled.
In an example embodiment, a histogram of the gradients of the users motion at each frame as calculated from Motion History Images (MHI) is used to describe the user's actions. It will be appreciated that even under relatively low lighting conditions, MHI can be extracted effectively. Additionally, in order to improve reliability, the gradient information is calculated in the example embodiment by only examining the interior pixels of the MHI, e.g. pixels inside of the MHI.
Figures 4(a)-(c) show images illustrating extraction of a Motion History Image from a moving silhouette according to an example embodiment. Here, Figure 4(a) shows an input image frame. Figure 4(b) shows the MHI of the input image frame of Figure 4(a). Figure 4(c) shows the pixels of the valid orientations of the MHI. In some example embodiments, the motion histogram of the orientations of the MHI is obtained by quantizing the gradient directions from the MHI into multiples of e.g. 5 degrees. Thus, each motion vector has e.g. 72 dimensions when the MHI is considered a single region. In addition, to handle changes in scale, the motion histogram is normalized by the sum of all the vaiid motion orientation pixels in the example embodiment. The resulting motion vector is then applied as the input of the RCE neural network.
In another embodiment as illustrated by Figures 4(d)-(e), from the MHI, the gradients of motion, i.e. the directional information (rotation), of the MHI are calculated. In order to obtain the motion vector of different regions around the body, the MHI is "divided' into nine regions (or windows). Figure -4(d) shows a schematic diagram " illustrating overlapping windows for generating a motion vector according to an example embodiment. Each regio (dark portions in Figure 4(d)) is represented by a histogram of the motion orientations. To generate the histogram for these regions, the gradient directions from the MHI are quantized into multiple of 30 degrees, resulting in a histogram with 12 bins. To handle changes in scale, these histograms are normalized by the sum of all the motion orientation pixels found in the gradient map in each region. By combing these 9 histograms, a 9x 12 = 108 dimensional vector is obtained in the example embodiment. The thus obtained motion vector, in the form of a motion histogram, is shown in Figure 4(e).
Next, prototype cells of the RCE network are activated. Figure 5 shows a flow chart 500 illustrating a method of activating a prototype cell in an RCE neural network according to an example embodiment. At step 502, for an input motion vector X belonging to a class Ck , a response from existing prototype cells is checked. If the input motion vector X does not trigger any response from the existing prototype cells, at step 504, a new prototype cell is created. At step 506, the new prototype cell / is connected to an output cell k representing the class Ck . At step 508, the current input motion vector is loaded as the weight vector of the new prototype cell . On the other hand, if the input motion vector X belonging to a class Ck triggers a response from an existing prototype cell / that belongs to the same class Ck , at step
510, the pattern count n of this existing prototype cell '/' is incremented by one (e.g. n = n+1).
Alternatively, if the input motion vector X belonging to a class Ck triggers a response from an existing prototype cell /' that does not belong to the same class Ck , at step 512, a radius of the hyper-spherical influence field of this prototype cell /' is reduced according to:
(1)
./=!
where ωί = {ωϋ : j = 1,NV} is the weight vector of the prototype cell i.
The above algorithm can be implemented in the example embodiments such that, in response to the input signal (i.e. motion vector X), the prototype cell / uses a radial basis function to determine a trigger signal dj as shown in Equation (1). If the trigger signal dt is less than or equal to the cell threshold λ , the prototype cell /' becomes active to trigger its associated action class C. Otherwise, the prototype cell /' does not respond to the input motion vector X. When a prototype cell is triggered, its pattern count n is incremented by one.
In the example embodiment, the output of the prototype cell / in the trained RCE neural network is a probability value based on a decaying exponential function as defined by:
Figure imgf000016_0001
Here, there are 9 cells on the output layer each belonging to a respective class ( k e [1,9] ). The output value of these cells in the output layer is calculated as follows:
Figure imgf000017_0001
and the con it ona proba ty s determined according to:
Figure imgf000017_0002
In the example embodiments, the output value with the maximum conditional probability indicates an optimal identification of the input signal with that action class.
With reference to the virtual dance partner generation, in the example embodiment, the motion vector of the incoming frame is extracted, in a similar manner as described above with respect to the training stage, in real-time and used as the input signal of the trained RCE neural network. The trained RCE neural network categorizes the input signal into one of the nine basic actions, as described above. For all frames in a predetermined period of e.g. one music bar, the occurrences for each of the nine basic actions are counted. Each of these nine numbers can be considered as a users requirement rk and can be used for synthesizing the dance animation of the virtual character.
Additionally, based on the users recognized basic actions as well as input music features (e.g. music energy and drum groove pattern similarity), a selection probability is associated with each motion dip. in the example embodiment, the selection probability indicates how close the respective motion clip is able to meet the input requirements. For example, let M = {m, ,m.2,..., mN } be the motion clip database with N motion clips, and R = {rt ,r2 _ rL ] be the L requirements for the next motion clip, assuming independence of the requirement factors in R, if the current playing motion clip is m the probability of selecting the next motion clip m is: in f \ m, , R) = pirn,„r„ r, rL) =† p(m , | m, , rk ) (5) on a simila
Figure imgf000018_0001
The similarity function (©) in Equation (6) may be a bell-shaped function with the highest value of 1 when m and mj are similar under the respective requirement^ .
; In: an example embodiment, after a selection probability score is determined for each motion clip, the motion clips are selected as described below:
The plurality of motion clips are weighted. based on their probability scores, such that a motion clip with a higher score has a higher chance of being selected. In an example embodiment, the motion clips are sorted in a cumulative distribution function (from 0 to 1) of the respective probability scores.
For example, assume clip / has a score of 0.6, clip j is has a score of 0.2,. clip k has a score of 0.15 and clip / has a score of 0.05. The clips can be sorted as shown in Figure 6, e.g. clip / is assigned portion 602, clip j is assigned portion 604, clip k is assigned portion 606 and clip / is assigned portion 608. As such, a clip with a higher probability score is assigned a larger portion on the cumulative distribution function.
A random number from 0 to 1 (i.e. within a range of the cumulative distribution function) is generated such that if the random number falls within the portion associated with a clip, the ciip is selected. For example, if the generated random number is from 0 to 0.6, clip / is selected. If the random number is from 0.6 to 0.8, ciip j is selected, and so on.
After selecting the related motion clip based on the recognized basic actions and other factors as described above/ the selected motion clip is rendered into e.g. a 3D display for outputting to the user. Figure 7 shows a flow chart 700 illustrating a method for classifying a user's action according to an example embodiment. At step 702, video data of the user's action is captured. At step 704, an input signal indicative of the user's action is generated based oh the captured video data. At step 706, the input signal is applied to a Restricted Coulomb Energy (RCE) neural network classifier for classifying the user's action into basic actions.
Figure 8 shows a flow chart 800 illustrating a method for training an RCE for classifying a user's action according to an example embodiment. At step 802, training video data is captured. At step 804, the captured training video data is classified into basic actions. At step 806, a training input signal is generated based on the captured training video data for applying to the RCE neural network. At step 808, the RCE neural network is trained to recognize the training input signal as one of the manually classified basic actions.
The method and system of the example embodiments can be implemented on a computer system 900, schematically shown in Figure 9. It may be implemented as software, such as a computer program being executed within the computer system 900, and instructing the computer system 900 to conduct the method of the example embodiments.
The computer system 900 comprises a computer module 902, input modules such as a keyboard 904 and mouse 906 and a plurality of output devices such as a display 908, and printer 910.
The computer module 902 is connected to a computer network 912 via a suitable transceiver device 914, to enable access to e.g. the Internet or other network systems such as Local Area Network (LAN) or Wide Area Network (WAN).
The computer module 902 in the example includes a processor 918, a Random Access Memory (RAM) 920 and a Read Only Memory (ROM) 922. The computer module 902 also includes a number of Input/Output (I/O) interfaces, for example I/O interface 924 to the display 908, and I/O interface 926 to the keyboard 904.
The components of the computer module 902 typically communicate via an interconnected bus 828 and in a manner known to the person skilled in the relevant art.
The application program is typically supplied to the user of the computer system 900- encoded on a data storage medium such as a CD-ROM or flash memory; carrier and read utilising a corresponding data storage medium drive of a data storage device 930. The application program is read and controlled in its execution by -the processor 918 Intermediate storage of program data maybe accomplished using RAM 920. The method and system according to the various embodiments described are applicable to a wide range of entertainment systems, such as multiplayer online games, single player games, and virtual world simulations. The method and system according to the various embodiments are also applicable to edutainment and training systems. Also, the use of an RCE neural network as an action classifier according to the example embodiments can advantageously avoid a potential local minima problem, which may happen for some other neural networks, such as a Hopfield network or a feedback network with a back-propagation learning process. Negative samples are preferably not required for training the RCE-based classifiers of the example embodiments. Further advantages of using the RCE neural network as action classifier according to the example embodiments includes having a localized information representation (because the hidden layer of the RCE neural network encodes specific knowledge about the cluster at each node); and having the size of the influence region of each region node (i.e. prototype cell) adjustable by the giobal influence from all the other prototype cells. This mechanism may make the RCE neural network according to the example embodiments more precise even in the boundary portions of the class feature space; and the RCE according to the example embodiments can be trained oniine or during real-time operation without re-training using all existing training data. It will be .appreciated by a person skilled in the art that numerous variations and/or modifications may be made to the present invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects to be illustrative and not restrictive.

Claims

1. A method for classifying a user's action, the method comprising the steps of:
capturing video data of the user's action;
generating an input signal indicative of the users action based on the captured video data; and
. applying the input signal to a Restricted Coulomb Energy (RCE) neural network classifier for classifying the user's action into basic actions.
2: The method as claimed in claim 1 , wherein the step of generating the input signal indicative of the user's action comprises:
extracting a motion history image of the user's action at each frame of a video input signal; and
.calculating gradient information based on the extracted motion history image.
3. The method as claimed in claim 2, wherein the gradient information is calculated based on interior pixels of the motion history image. 4. The method as claimed in claims 2 or 3, wherein the input signal indicative of the user's action comprises a motion vector generated based on the calculated gradient information.
5. The method as claimed in claim 4, wherein a number of input layer cells of the RCE is equal to a number of dimensions of the motion vector, and applying the input signal to the RCE comprises applying respective values for each dimension.
6. The method as claimed in any one of the preceding claims, wherein a number of output layer cells of the RCE is equal to a number of the basic actions, and classifying the user's action comprises determining the output layer cell with a maximum conditional probability.
7. The method as claimed in any one of claims 2 to 6, wherein calculating gradient information based on the extracted motion history image comprises:
generating respective histograms of gradient directions for each of a plurality of 5 overlapping regions representing different regions of the users body; and
combining the histograms into a single histogram.
8: The method as claimed any one of the preceding claims, wherein the i y . basic actions are selected from a group consisting of "move left", "move right", "move0 forward", "move backward", "move upward", "move downward", "rotate left", "rotate right" and "hands up".
9. A system for classifying a user's action, comprising:
means for capturing video data of the user's action;
5 means for generating an input signal indicative of the user's action based on the captured video data; and
a Restricted Coulomb Energy (RCE) neural network classifier configured to receive the input signal for classifying the user's action into basic actions, 0 10. The system as claimed in claim 9, wherein the means for generating the input signal indicative of the user's action extracts a motion history image of the user's action at each frame of a video input signal; and calculates gradient information based on the extracted motion history image. 5 1 . The system as claimed in claim 10, wherein the gradient information is calculated based on interior pixels of the motion history image.
12. The system as claimed in claims 10 or 11 , wherein the input signal indicative of the user's action comprises a motion vector generated based on the 0 calculated gradient information.
13. The system as claimed in claim 12, wherein a number of input layer cells of the RCE is equal to a number of dimensions of the motion vector, and wherein the RCE is configured to receive respective values for each dimension.
14. The system as claimed in any one of claims 9 to 13, wherein a number of output layer cells of the RCE is equal to a number of the basic actions, and wherein the RGE determines the output layer cell with a maximum conditional probability for classifying the "user's action.
15. The system as claimed in any one of claims 10 to 14, wherein the means for generating the input signal indicative of the user's action calculates the gradient information by generating respective histograms of gradient directions for each of a plurality of overlapping regions representing different regions of the user's body; and combining the histograms into a single histogram.
16. The system as claimed any one of claims 9 to 15, wherein the basic actions are selected from a group consisting of "move left", "move right", "move forward", "move backward", "move upward", "move downward", "rotate left", "rotate right" and "hands up".
17. A computer storage medium having stored thereon computer code means for instructing a computing device to execute a method for classifying a user's action, the method comprising the steps of:
capturing video data of the user's action;
generating an input signal indicative of the user's action based on the captured video data; and
applying the input signal to a Restricted Coulomb Energy (RCE) neural network classifier for classifying the user's action into basic actions.
18. A method for training an RCE neural network for classifying a user's action, the method comprising the steps of:
capturing training video data;
classifying the captured training video data into basic actions,
generating a training input signal based on the captured training video data for applying to the RCE neural network; and
training the RCE neural network to recognize the training input signal as one of the classified basic actions.
19. A system for training an RCE neural network for classifying a users action, the system comprising:
means for capturing training video data;
means for classifying the captured training video data into basic actions; means for generating a training input signal based on the captured training video data for applying to the RCE neural network; and
means for training the RCE neural network to recognize the training input signal' a$ one of the classified basic actions.
20. A computer storage medium having stored thereon computer code means for instructing a computing device to execute a method for training an RCE neural network for classifying a user's action, the method comprising the steps of: capturing training video data;
classifying the captured training video data into basic actions;
generating a training input signal based on the captured training video data forapplying to the RCE neural network; and
training the RCE neural network to recognize the training input signal as one of the classified basic actions.
PCT/SG2011/000214 2010-06-16 2011-06-16 Method and system for classifying a user's action WO2011159258A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SG201004251-3 2010-06-16
SG201004251 2010-06-16

Publications (1)

Publication Number Publication Date
WO2011159258A1 true WO2011159258A1 (en) 2011-12-22

Family

ID=45348448

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2011/000214 WO2011159258A1 (en) 2010-06-16 2011-06-16 Method and system for classifying a user's action

Country Status (1)

Country Link
WO (1) WO2011159258A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI506461B (en) * 2013-07-16 2015-11-01 Univ Nat Taiwan Science Tech Method and system for human action recognition
JP2017196380A (en) * 2016-09-21 2017-11-02 株式会社セガゲームス Information processor and program
JP2017196392A (en) * 2017-02-09 2017-11-02 株式会社セガゲームス Information processor and program
WO2017187850A1 (en) * 2016-04-27 2017-11-02 株式会社セガゲームス Information processing device and program
CN107862275A (en) * 2017-11-01 2018-03-30 电子科技大学 Human bodys' response model and its construction method and Human bodys' response method
CN109906457A (en) * 2016-11-03 2019-06-18 三星电子株式会社 Data identification model constructs equipment and its constructs the method for data identification model and the method for data discrimination apparatus and its identification data
KR102082999B1 (en) * 2018-09-14 2020-02-28 한국항공대학교산학협력단 RCE neural network learning apparatus and method thereof
CN110909621A (en) * 2019-10-30 2020-03-24 中国科学院自动化研究所南京人工智能芯片创新研究院 Body-building guidance system based on vision
KR20200080419A (en) * 2018-12-19 2020-07-07 한국항공대학교산학협력단 Hand gesture recognition method using artificial neural network and device thereof
KR20210103177A (en) * 2020-02-13 2021-08-23 한림대학교 산학협력단 Method for recognizing hand gesture based on RCE-NN
US11908176B2 (en) 2016-11-03 2024-02-20 Samsung Electronics Co., Ltd. Data recognition model construction apparatus and method for constructing data recognition model thereof, and data recognition apparatus and method for recognizing data thereof

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007053116A1 (en) * 2005-10-31 2007-05-10 National University Of Singapore Virtual interface system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007053116A1 (en) * 2005-10-31 2007-05-10 National University Of Singapore Virtual interface system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JONSSON, H. ET AL.: "Vision-based segmentation of hand regions for purpose of tracking gestures", MASTER'S THESIS IN COMPUTING SCIENCE, 1 December 2008 (2008-12-01), DEPARTMENT OF COMPUTING SCIENCE SE-901 87 UMEA, SWEDEN *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9218545B2 (en) 2013-07-16 2015-12-22 National Taiwan University Of Science And Technology Method and system for human action recognition
TWI506461B (en) * 2013-07-16 2015-11-01 Univ Nat Taiwan Science Tech Method and system for human action recognition
WO2017187850A1 (en) * 2016-04-27 2017-11-02 株式会社セガゲームス Information processing device and program
JP2017196380A (en) * 2016-09-21 2017-11-02 株式会社セガゲームス Information processor and program
CN109906457A (en) * 2016-11-03 2019-06-18 三星电子株式会社 Data identification model constructs equipment and its constructs the method for data identification model and the method for data discrimination apparatus and its identification data
US11908176B2 (en) 2016-11-03 2024-02-20 Samsung Electronics Co., Ltd. Data recognition model construction apparatus and method for constructing data recognition model thereof, and data recognition apparatus and method for recognizing data thereof
JP2017196392A (en) * 2017-02-09 2017-11-02 株式会社セガゲームス Information processor and program
CN107862275A (en) * 2017-11-01 2018-03-30 电子科技大学 Human bodys' response model and its construction method and Human bodys' response method
KR102082999B1 (en) * 2018-09-14 2020-02-28 한국항공대학교산학협력단 RCE neural network learning apparatus and method thereof
KR20200080419A (en) * 2018-12-19 2020-07-07 한국항공대학교산학협력단 Hand gesture recognition method using artificial neural network and device thereof
KR102179999B1 (en) 2018-12-19 2020-11-17 한국항공대학교산학협력단 Hand gesture recognition method using artificial neural network and device thereof
CN110909621A (en) * 2019-10-30 2020-03-24 中国科学院自动化研究所南京人工智能芯片创新研究院 Body-building guidance system based on vision
KR20210103177A (en) * 2020-02-13 2021-08-23 한림대학교 산학협력단 Method for recognizing hand gesture based on RCE-NN
KR102338296B1 (en) * 2020-02-13 2021-12-09 한림대학교 산학협력단 Method for recognizing hand gesture based on RCE-NN

Similar Documents

Publication Publication Date Title
WO2011159258A1 (en) Method and system for classifying a user's action
Molchanov et al. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network
Chen et al. Monocular human pose estimation: A survey of deep learning-based methods
Kamel et al. Deep convolutional neural networks for human action recognition using depth maps and postures
US9690982B2 (en) Identifying gestures or movements using a feature matrix that was compressed/collapsed using principal joint variable analysis and thresholds
Roberto de Souza et al. Procedural generation of videos to train deep action recognition networks
Kale et al. A study of vision based human motion recognition and analysis
CN102693413B (en) Motion identification
Han et al. Enhanced computer vision with microsoft kinect sensor: A review
US9639746B2 (en) Systems and methods of detecting body movements using globally generated multi-dimensional gesture data
Yao et al. Animated pose templates for modeling and detecting human actions
Gupta et al. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural networks
Saleem et al. Toward human activity recognition: a survey
Rallis et al. Machine learning for intangible cultural heritage: a review of techniques on dance analysis
dos Santos Anjo et al. A real-time system to recognize static gestures of Brazilian sign language (libras) alphabet using Kinect.
Uddin Human activity recognition using segmented body part and body joint features with hidden Markov models
bin Mohd Sidik et al. A study on natural interaction for human body motion using depth image data
Neverova Deep learning for human motion analysis
Zhao et al. A survey of deep learning in sports applications: Perception, comprehension, and decision
Ahmad et al. Optimized deep learning-based cricket activity focused network and medium scale benchmark
Shahjalal et al. An approach to automate the scorecard in cricket with computer vision and machine learning
Gavrilescu Recognizing human gestures in videos by modeling the mutual context of body position and hands movement
Vyas Pose estimation and action recognition in sports and fitness
Tai et al. HSFE network and fusion model based dynamic hand gesture recognition
Toudjeu et al. A 2D Convolutional Neural Network Approach for Human Action Recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11796070

Country of ref document: EP

Kind code of ref document: A1

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11796070

Country of ref document: EP

Kind code of ref document: A1