CN106778576B

CN106778576B - Motion recognition method based on SEHM characteristic diagram sequence

Info

Publication number: CN106778576B
Application number: CN201611110573.5A
Authority: CN
Inventors: 吴贺俊; 李嘉豪
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2016-12-06
Filing date: 2016-12-06
Publication date: 2020-05-26
Anticipated expiration: 2036-12-06
Also published as: CN106778576A

Abstract

When the action recognition method provided by the invention is used for carrying out action recognition, the SEHM (segment energy hierarchy maps) feature diagram provided by the invention is used as a bottom-layer feature for carrying out action recognition. By reasonably selecting parameters such as time slice length in the algorithm, calculating a corresponding SEHM characteristic diagram sequence and applying the sequence to a neural network for prediction, the functions of off-line recognition and on-line recognition can be realized on action recognition. And because the constructed SEHM characteristic diagram is related to the front and back change of the overall gesture of the action, the action information in the action change process can be fully utilized, and the accuracy of action recognition is improved. Meanwhile, certain compression is carried out on the original data when the SEHM characteristic diagram is calculated, the complexity of the method and the requirement on hardware are low, and online real-time action recognition can be achieved.

Description

Motion recognition method based on SEHM characteristic diagram sequence

Technical Field

The invention relates to the field of image recognition, in particular to an action recognition method based on an SEHM feature map sequence.

Background

With the development of camera sensor technology, the definition level of the camera is generally improved, so that the number and probability of the camera appearing in various scenes are greatly increased. Under the wave of the internet in the modern times, a large amount of image video data is emerging in daily life, and the development of image processing technology is also driven. As one field of image processing technology, motion recognition technology has been widely applied in many scenes, including video monitoring, motion sensing games, health care, social assistance, and the like. For example, Microsoft introduced a somatosensory peripheral-Kinect with an Xbox360 in 2010, which can be used as a depth camera in a host game to capture the body movements of a player and interact with the game; in addition, developers can also develop own applications, such as simulation dressing and the like, on the Windows platform by using a development kit.

While having a wide application scenario, the development of motion recognition has many technical difficulties and constraints.

The first is the constraint of objective conditions. In a video image sequence, because of the actual shooting situation, unavoidable hindering factors often occur, such as a person in a camera encountering the occlusion of other objects (object occlusion); the camera is not always fixed, so that the camera shooting picture shakes (the visual angle shakes); the same person changes its color in light and shadow (lighting conditions); different cameras have great difference (resolution) in picture definition due to the quality of the lens. This is a problem that must be considered in the field of motion recognition, and even in the field of image processing.

The second is the influence of subjective conditions. As a subject of the motion recognition processing, different persons have their own definitions and understandings about the same motion, and even the same motion may have some slight differences. The specific expression is that different people do the same action, and the length, amplitude, pause and the like of the action often cause a plurality of differences of the whole image sequence. In addition to the difference caused by the action of the subject, different people have some differences in body type and structure due to age and sex; the distance from the camera, the angle of facing the camera, will make a large difference between the recorded actions. Each of the above factors may increase the diversity of data. Meanwhile, in order to realize the action recognition algorithm and provide specific interfaces and applications for different industries and scenes, not only the accuracy of the action recognition algorithm but also other constraint conditions, such as cost problems and real-time problems, need to be considered.

In the motion recognition algorithm, generally, a sensor is used as raw input data, and the classification and judgment of motions are performed in cooperation with processes such as preprocessing, feature calculation, classification models and the like. The conventional motion recognition method generally uses a conventional RGB camera as an input method, but as various new sensors appear, more and more kinds of sensors are applied to the motion recognition method, such as a depth camera, an infrared camera, an acceleration sensor, and the like. The advent of new sensors has enabled new input data to be applied to motion recognition methods, and even a number of model fusion methods have emerged. The depth map, as new data from the conventional RGB map, records not a color value but a distance from the camera per pixel. Because of its distance information, research and algorithms based thereon have gained increasing attention and interest.

Reference one discloses a motion recognition method, which takes a depth map as an input, and projects the depth map into different planes of three orthogonal coordinate systems according to distance information of the depth map: front view, side view, top view. The first document proposes a new feature map, namely a depth energy map; and then, calculating HOG characteristics corresponding to the depth energy maps under different views and inputting the HOG characteristics into an SVM classifier for prediction classification. The method directly combines the whole depth video sequence into a depth energy map, does not fully consider the overlapping and redundant information of the motion before and after the whole motion, and does not consider the change of the human posture before and after the whole motion. In a video in which a plurality of different motions appear in front and behind, an energy map of the plurality of motions cannot be accurately divided and generated, and thus the plurality of motions cannot be recognized (front and back multi-motion video recognition); similarly, in the online identification, the depth energy map cannot be synthesized because the end frame cannot be selected, that is, the real-time requirement cannot be met.

The second reference discloses a motion recognition method, which also projects a depth map onto three coordinate surfaces and calculates a corresponding depth energy map, and then introduces another feature operator LBP as a high-level feature. And after the LBP characteristics of the depth energy map are calculated, the improved extreme learning machine model is used for action recognition. The method also processes the whole video sequence into a depth energy map, does not consider the internal relation of the gestures before and after the action, and cannot meet the requirements of front and back multi-action video identification, online identification, instantaneity and the like.

Reference three discloses a motion recognition method, which projects a depth map to three different view angle maps, and calculates only a depth energy map representing distance change in the whole video, different from reference one and reference two, and calculates a historical track map of a depth distance active region, and takes the appearance sequence of postures into consideration; meanwhile, a static posture graph and an average energy graph are also provided, and the input of features is enriched. However, this method does not fully consider the problem that the previous historical poses in the entire video sequence are covered by the later poses, although the order of appearance of the poses is considered, resulting in that the first half of some actions are covered by the second half and much information is lost. Although the situation before and after the gesture is considered to a certain extent, the interference of some redundant actions is not considered. Although the calculation of the stationary region is added, only the absolute value of the motion energy map is considered, and the positive and negative directions of the motion energy are not considered. Similar to the first reference and the second reference, the third reference also cannot meet the requirements of front and back multi-action video identification, online identification and real-time performance.

Reference 1: yang, Xiaoodong, C.Zhang, and Y.L.Tian. "recording action received maps-based programs of oriented programs" ACMINETIONAL Conference on Multimedia 2012: 1057-.

Reference two: chen, Chen, R.Jafari, and N.Kehtarnavaz. "Action recognitions from Depth Sequences Using Depth Motion Maps-Based Local Binary patterns." Applications of Computer Vision IEEE 2015: 1092-.

Reference three: liang, Bin, and L.Zheng. "3D Motion Trail Model Based pyrad histograms of organized Gradient for Action Recognition." International conference on Pattern Recognition IEEE Computer Society,2014: 1952-.

Disclosure of Invention

The invention provides an action recognition method based on an SEHM characteristic diagram sequence for solving the problems in the prior art, and the method can realize off-line recognition and on-line recognition and has better real-time performance.

In order to realize the purpose, the technical scheme is as follows:

an action recognition method based on an SEHM characteristic diagram sequence comprises the following steps:

s1, aiming at a depth map sequence with a selected time interval of N frames in a video, projecting a depth map of each frame in the depth map sequence to different planes of three orthogonal coordinate systems to obtain three orthogonal view angle maps: front, side and top views;

s2, calculating the difference value of two adjacent frames of the depth image sequence under each visual angle image to serve as an energy image, wherein each frame of energy image represents the distance change of the previous frame and the next frame; then, the energy diagram is divided into three state diagrams according to the specific values of the energy diagram and the set threshold value: a binary map for forward state, a binary map for backward state, or a static binary map. The method comprises the following specific steps:

wherein

Is the energy map of the t frame under the view map v; epsilon is a set threshold;

representing the absolute value of the difference of the next frame minus the previous frame; i is 1,2 and 3, which respectively represent a forward state binary diagram, a backward state binary diagram and a static binary diagram; state diagram of the tth frame

By a three-channel matrix EM_tCarrying out representation;

s3, after the step S2 is executed, state diagram sequences under the three view angle diagrams are obtained respectively; respectively averagely dividing N frame state diagram sequences of the three view diagrams into S time slices according to the front and back orders, wherein S is N/K, and K represents the length of each time slice; for the state diagram sequence under each view angle diagram, sequentially selecting the state diagram sequence of a time slice from front to back to calculate the SEHM characteristic diagram:

s31, assuming that the state diagram sequence of the time slice selected for calculation for the p-th time is started from the (p-1) × K +1 frame of the state diagram sequence of the N frames and ended at the p × K frame, the SEHM feature map of the time slice is calculated by the following formula and step S32:

SEHM_p＝max(SEHM_p,EM_(p-1)*K+k·k)

wherein k has an initial value of 1, SEHM_pIs a three-channel matrix with an initial value set to zero;

s32. let k be k +1 and then execute the formula of step S31 until k>K, finally outputting the SEHM after standardization processing_pAn SEHM profile as the p-th time slice selected for calculation;

s4, obtaining SEHM characteristic diagrams of each time slice under the three view angle diagrams through steps S31 and S32;

s5, fusing the SEHM characteristic diagrams of the time slices corresponding to each other under the three view angle diagrams to obtain a fused SEHM characteristic diagram taking the time slices as units;

s6, the fused SEHM feature maps of the time slices form an SEHM feature map sequence, the SEHM feature map sequence is input into a neural network, the neural network outputs a list of probability vectors P representing the possibility of each action, and the action recognition result of the current N-frame depth map sequence is determined according to the output probability vectors P.

In the above-described aspect, the motion recognition method performs motion recognition based on the SEHM feature map when performing motion recognition. By reasonably selecting parameters such as time slice length in the algorithm, calculating a corresponding SEHM characteristic diagram sequence and applying the sequence to a neural network for prediction, the functions of off-line recognition and on-line recognition can be realized on action recognition. And because the constructed SEHM characteristic diagram is related to the front and back change of the overall gesture of the action, the action information in the action change process can be fully utilized, and the accuracy of action recognition is improved. Meanwhile, certain compression is carried out on the original data when the SEHM characteristic diagram is calculated, the complexity of the method and the requirement on hardware are low, and online real-time action recognition can be achieved.

Preferably, the SEHM feature maps are calculated for the N frame state diagram sequences under the three view angle diagrams respectively, and then the calculated SEHM feature maps under the three view angle diagrams are fused to obtain a global SEHM feature map; in step S6, the global SEHM feature map and the SEHM feature maps of the time slices form an SEHM feature map sequence, and the SEHM feature map sequence is input to the neural network for motion recognition. Through the arrangement, the action characteristics of the whole time period length can be taken into account, and the accuracy of action recognition can be further improved.

Preferably, in step S1, when selecting the N-frame depth map sequence for motion recognition, the N-frame depth map sequence is selected through a sliding window, where the sliding window includes a window size value m indicating a time length from a starting frame of a next selected depth map sequence to a starting frame of a last selected depth map sequence. A video segment can select a plurality of time segments with the length of N frames in a sliding window mode for action recognition, and finally, the model can respectively give result prediction of each segment.

Preferably, the epsilon is 30.

Preferably, K is 10.

Preferably, N-80.

Preferably, in step S5, the SEHM feature maps of the time slices corresponding to the front view, the side view and the top view are fused according to a ratio of 2:1: 1.

Preferably, the neural network comprises a convolutional layer, a magnetization layer, an LSTM layer, a fully-connected layer, and a Softmax layer;

wherein the convolutional layer and the magnetization layer are used for extracting high-level features from the SEHM feature map sequence;

the LSTM layer is used for performing context processing on the high-level features of the extracted feature map sequence and outputting the high-level features with better recognition effect and time sequence information;

the full connection layer and the Softmax layer are used for receiving high-level characteristics output by the LSTM layer or the convolution layer and the magnetization layer and outputting a column of prediction probability vectors P.

Preferably, the probability vector P comprises a number of probabilities P_iWherein p is_iIndicating a motionProbability of being identified as action i;

the process of determining the motion recognition result in step S6 is as follows:

setting a threshold value rho with a value between 0 and 1, and if the probability of no action in the probability vector P is greater than rho, considering the action in the N-frame depth map sequence as a meaningless action; otherwise, the action with the maximum recognition probability value is taken as a recognition result to be output.

Preferably, ρ is 0.5.

Compared with the prior art, the invention has the beneficial effects that:

the action recognition method provided by the invention is used for carrying out action recognition based on the SEHM characteristic diagram during action recognition. By reasonably selecting parameters such as time slice length in the algorithm, calculating a corresponding SEHM characteristic diagram sequence and applying the sequence to a neural network for prediction, the functions of off-line recognition and on-line recognition can be realized on action recognition. And because the constructed SEHM characteristic diagram is related to the front and back change of the overall gesture of the action, the action information in the action change process can be fully utilized, and the accuracy of action recognition is improved. Meanwhile, certain compression is carried out on the original data when the SEHM characteristic diagram is calculated, the complexity of the method and the requirement on hardware are low, and online real-time action recognition can be achieved.

Drawings

Fig. 1 is an exploded view of a sequence of SEHM profiles of a waving motion.

Fig. 2 is a block diagram of the overall neural network with an LSTM layer used in the embodiment.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

the invention is further illustrated below with reference to the figures and examples.

Example 1

Different people have their own definitions and understandings of the same action, and one of the most obvious manifestations is the difference of the action length. But also the same person does the same action at different times for subjective reasons. Most existing methods generally only merge the whole depth video sequence into a new feature map. However, this results in a large part of the spatiotemporal information in the video sequence being lost, and especially, a large amount of information is easily lost due to a relatively large gesture overlapping part such as one-hand crossing in front of the body. To reduce the loss of information, the present invention proposes SEHM feature maps (segment energy history maps).

For a depth video sequence with a time period of N frames, a depth Map of each frame is projected to three orthogonal view maps (Map)_f,Map_s,Map_t): front view, side view, top view. The energy map calculation is then performed for the sequence of depth maps for each view. For the depth map sequence under each view, the invention calculates the difference value (the next frame minus the previous frame) of two adjacent frames in the sequence as an energy map. Each energy map represents the distance change between the previous and the next frame. According to the specific values of each energy map obtained, the invention divides them into binary maps of three states according to the threshold value: forward state, backward state, static state. The method comprises the following specific steps:

wherein

representing the absolute value of the difference of the next frame minus the previous frame; i is 1,2 and 3, which respectively represent a forward state binary diagram, a backward state binary diagram and a static binary diagram; three state diagrams for the tth frame

By a three-channel matrix EM_tCarrying out representation;

through calculation, energy map sequences under different viewing angles can be obtained. But the energy map cannot be applied directly to the neural network as input data because:

1. millions of data sets are often used for image recognition to obtain a good effect; the length of some simple actions generally has tens to hundreds of frames of pictures, and each main body is a person with similar appearance; compared with image recognition, motion recognition requires far more data sets to achieve similar effects than the former. Therefore, if each frame of data is used as input, a larger data set is required to obtain a more appreciable result when training the model.

2. Because the LSTM layer in the neural network needs to consider the context of all input sequences, if each frame in the video is taken as an input unit, the selected time period is appropriate but the calculation amount is large and the requirement on hardware is high; or the selected time period is too short to influence the training result of the model.

In summary, the invention performs appropriate compression and combination, i.e. SEHM feature map calculation, on the original depth sequence.

When all the energy map sequences in each view in the current time period are calculated, the energy maps of each frame can be synthesized into the SEHM feature map. In order to consider the real-time performance of the algorithm, the specific situation of the motion data needs to be comprehensively considered, and the appropriate values of N and K are selected to be used for calculating the SEHM characteristic diagram. Meanwhile, in order to achieve the functions of front and back multi-action video identification, online identification and real-time performance, the video is divided into a plurality of time periods for identification respectively in a sliding window mode. For example, if the length of a certain video is 120 frames, and each time 80 frames are taken as the length of a time period, and the sliding window is 40 frames, the SEHM feature map sequence calculation needs to be performed on the depth map sequences of 1 to 80 frames and 41 to 120 frames respectively. Calculating to obtain an SEHM characteristic map sequence of two time periods; and motion recognition results of two time periods are respectively obtained through the neural network model, so that functions such as online recognition and the like are realized.

For an energy map sequence with a certain time slot length of N frames, the N frame state map sequences of the three view maps are respectively and averagely divided into S time slices according to the front and back sequence, wherein S is N/K, and K represents the length of each time slice; for the state diagram sequence under each view angle diagram, sequentially selecting the state diagram sequence of a time slice from front to back to calculate the SEHM characteristic diagram:

SEHM_p＝max(SEHM_p,EM_(p-1)*K+k·k)

after the feature maps for the time slices are computed, a global SEHM feature map for the entire time slice is similarly computed. The global SEHM feature map starts at the first frame of the time period and ends at the last frame. Through the above operations, the SEHM feature map is compressed and retains key pose information in the video. For general speed and complexity actions, N80 and K10 may be considered.

And in order to obtain the final recognition result, the SEHM feature maps at three visual angles are required to be fused. Considering the ability of the neural network to handle the local and global relationships of the pictures, the present invention combines the SEHM feature maps of each corresponding time slice or the global SEHM feature maps of the time slices at multiple viewing angles into a final SEHM feature map in a ratio of 2:1: 1. The final SEHM feature map is then passed to a neural network to extract features. Fig. 1 shows the structure composition of the combined final SEHM profile sequence.

For the pattern recognition method, except for extracting features from the original data, the algorithm model is the most important part. Because the SEHM eigenmap sequence has been compressed and ordered backwards and forwards, a model with ordered input handling, such as the LSTM (long shortterm memory) layer, may work well. While LSTM has achieved great success in the natural language and speech domains, it has also begun to be referred to the image domain in recent years.

The deep neural network can be selected to preprocess the model as it can perform better effect when the data set is larger. The Alexnet network model is an image recognition model for RGB maps, where a person is one recognition type of his task. Considering that the SEHM characteristic diagram is a three-channel characteristic diagram obviously having the contour characteristics of the human body, it can be considered that the SEHM characteristic diagram of the present invention has better results when retraining the SEHM characteristic diagram with the parameters of the convolution layer and the magnetization layer of Alexnet as initial values. The network structure of the convolution layer and the magnetization layer of the Alexnet network is used as the front section of the neural network structure, and the LSTM network layer is connected to the rear section of the neural network model, so that the training speed of the front section of the model can be accelerated, and the precision can be improved. The overall structure of the neural network model is shown in fig. 2.

As can be seen from fig. 2, both the global SEHM feature map and the sequence of SEHM feature maps undergo convolution magnetization layer extraction of high-level features; the difference is that the SEHM characteristic diagram sequence can provide better advanced characteristics through LSTM layer processing because of the existence of the context information; whereas the global SEHM profile does not need to pass through the LSTM layer for information covering the entire time period. Finally, inputting the high-level characteristics into the full-link layer and the Softmax layer to obtain a list of probability vectors P (wherein each item P in the vectors_iRepresenting the probability of being judged as a class).

For a probability vector P for a certain time period, a threshold P between 0 and 1 may be defined, if there is no classification P in the probability vector_iIf the value is larger than rho, the action in the time period is regarded as meaningless action; else take the probability p_iThe largest category is taken as the predicted action.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. An action recognition method based on an SEHM characteristic diagram sequence is characterized in that: the method comprises the following steps:

s2, calculating the difference value of two adjacent frames of the depth image sequence under each visual angle image to serve as an energy image, wherein each frame of energy image represents the distance change of the previous frame and the next frame; then, the energy diagram is divided into three state diagrams according to the specific values of the energy diagram and the set threshold value: a binary map for forward state, a binary map for backward state, or a static binary map; the method comprises the following specific steps:

wherein

By a three-channel matrix EM_tCarrying out representation;

SEHM_p＝max(SEHM_p，EM_(p-1)*K+k·k)

s32, enabling K to be K +1, then executing the formula in the step S31 until K is larger than K, and finally outputting SEHM after standardization processing_pAn SEHM profile as the p-th time slice selected for calculation;

s6, the fused SEHM characteristic maps of the time slices form an SEHM characteristic map sequence, the SEHM characteristic map sequence is input into a neural network, the neural network outputs a list of probability vectors P representing the possibility of each action, and the action recognition result of the current N-frame depth map sequence is determined according to the output probability vectors P;

respectively calculating the SEHM characteristic diagrams of the N frame state diagram sequences under the three view angle diagrams, and then fusing the calculated SEHM characteristic diagrams under the three view angle diagrams to obtain a global SEHM characteristic diagram; in step S6, the global SEHM feature map and the SEHM feature maps of the time slices form an SEHM feature map sequence, and the SEHM feature map sequence is input to the neural network for motion recognition.

2. The method of claim 1 for motion recognition based on an SEHM feature map sequence, wherein: in step S1, when the N-frame depth map sequence is selected for motion recognition, the N-frame depth map sequence is selected through a sliding window, where the sliding window includes a window size value m indicating a time length from a start frame of a next selected depth map sequence to a start frame of a last selected depth map sequence.

3. The method of claim 1 for motion recognition based on an SEHM feature map sequence, wherein: and epsilon is 30.

4. The method of claim 1 for motion recognition based on an SEHM feature map sequence, wherein: and K is 10.

5. The method of claim 1 for motion recognition based on an SEHM feature map sequence, wherein: and N is 80.

6. The method of claim 1 for motion recognition based on an SEHM feature map sequence, wherein: in step S5, the SEHM feature maps of the time slices corresponding to the front view, the side view and the top view are fused according to the ratio of 2: 1.

7. The method of claim 1 for motion recognition based on an SEHM feature map sequence, wherein: the neural network comprises a convolutional layer, a magnetization layer, an LSTM layer, a full connection layer and a Softmax layer;

8. The method of claim 1 for motion recognition based on an SEHM feature map sequence, wherein: the probability vector P comprises a number of probabilities P_iWherein p is_iRepresenting the probability of the motion being recognized as motion i;