CN110135249A

CN110135249A - Human bodys' response method based on time attention mechanism and LSTM

Info

Publication number: CN110135249A
Application number: CN201910271178.2A
Authority: CN
Inventors: 毕盛; 谢澈澈; 董敏; 李永发
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-04-04
Filing date: 2019-04-04
Publication date: 2019-08-16
Anticipated expiration: 2039-04-04
Also published as: CN110135249B

Abstract

The Human bodys' response method based on time attention mechanism and LSTM that the present invention provides a kind of, comprising steps of 1) obtaining the video data of RGB monocular vision sensor；2) 2D skeleton joint point data is extracted；3) artis co-ordinative construction feature is extracted；4) LSTM shot and long term memory network is constructed；5) time attention mechanism is added in LSTM network；6) Human bodys' response is carried out using softmax classifier.The present invention can improve the universality, real-time and the accuracy rate to compound action identification of the Activity recognition system of view-based access control model.

Description

Human bodys' response method based on time attention mechanism and LSTM

Technical field

The present invention relates to the technical fields of Human bodys' response, refer in particular to a kind of based on time attention mechanism and LSTM Human bodys' response method.

Background technique

In recent years, Human bodys' response technology has a wide range of applications in production and living.On the one hand, the hair of smart home Exhibition puts forward higher requirements the action recognition of machine Human To Human and understanding, and on the other hand, the transition of industry makes industry tend to intelligence Energyization development, Human bodys' response are widely used in the fields such as human-computer interaction and the man-machine collaboration of industrial robot.In addition, With the development of video media and popularizing for visual sensor, Human bodys' response technology is in tele-medicine, family's monitoring and city City's security monitoring etc. plays an important role.RGB+D video becomes current behavior identification since it includes information abundant The hot spot of research.

Currently, mainly using the sensor of view-based access control model and based on depth nerve net in terms of Human bodys' response technical research The method of network, but it is also faced with following problem at present:

1, the universality of deep vision sensor is poor: although based on the Activity recognition method of RGB+D video in experimental situation There is higher precision, however since deep vision sensor real-time is poor, resolution ratio is low, higher cost, can only closely identify Deng limitation, it is difficult to be popularized in real life.

2, the real-time of rgb video Activity recognition system is poor: since video contains bulk information, bringing for Activity recognition While enough available informations, a large amount of redundancy is also brought, to reduce the speed of system operation, makes to prolong in practical application The slow time is long, and real-time is poor.

3, the accuracy of identification of complex background and compound action is low: for compound action, current most of Activity recognition methods All be video sequence input deep neural network is subjected to feature extraction, however but ignore different frame in video sequence to movement The percentage contribution of classification lacks the concern to key message so that the accuracy of identification of compound action drops in Human bodys' response system It is low.

Summary of the invention

The purpose of the present invention is to overcome the shortcomings of the existing technology and deficiency, proposes a kind of based on time attention mechanism With the Human bodys' response method of LSTM, recognition accuracy is higher and universality is stronger, it is intended to which building is passed based on RGB monocular vision The deep neural network model of sensor is to improve the universality in the Activity recognition system of view-based access control model；From rgb video stream 2D skeleton joint point is extracted, proposes a kind of structure feature extracting method based on skeletal joint point, by reducing Video Redundancy letter It ceases to improve the processing speed of Activity recognition system to improve real-time；It is proposed a kind of LSTM of binding time attention mechanism (shot and long term memory network) model, to improve the accuracy rate of Activity recognition.

To achieve the above object, technical solution provided by the present invention are as follows: the people based on time attention mechanism and LSTM Body Activity recognition method, comprising the following steps:

1) video data of RGB monocular vision sensor is obtained；

2) 2D skeleton joint point data is extracted；

3) artis co-ordinative construction feature is extracted；

4) LSTM shot and long term memory network is constructed；

5) time attention mechanism is added in LSTM network；

6) Human bodys' response is carried out using softmax classifier.

In step 1), the video data of RGB monocular vision sensor is obtained, comprising the following steps:

1.1) RGB monocular vision sensor is installed on monitoring area, obtains data in real time；

1.2) server is connected to front end codec, real time video data is downloaded by stream media protocol；

1.3) using the IP connection mode of iSCSI, the storage equipment of the transmission of video that will acquire to server is stored；

1.4) video data of acquisition is pre-processed, and data is sent to artis extraction module and are handled.

In step 2), 2D skeleton joint point data is extracted, comprising the following steps:

2.1) video is subjected to segment processing according to every 10 seconds durations；

2.2) after input picture, picture size length and width are appointed as 368*368；

2.3) OpenPose frame is called, the picture input CNN network of specified size is extracted into part confidence Maps and part affinity fields；

2.4) list is established, for storing detect from picture 18 artis；

2.5) even matching is used to find out part association, artis is connected to the entirety to form human synovial Skeleton.

In step 3), artis co-ordinative construction feature is extracted, comprising the following steps:

3.1) 2 acquired dimension skeleton joint point coordinates are defined are as follows:

p_i(x,y)

3.2) defining extracted two-dimensional framework joint point set is vector J, and J is expressed as follows:

J={ p₁,p₂,...,p₁₈}

3.3) bone vector between two artis is normalized, normalized vector calculates as follows:

Wherein, p_iAnd p_jIndicate two adjacent artis, | | p_i-p_j| | the Euclidean distance between two o'clock calculates such as Under:

3.4) bone vector characteristics are calculated, i.e., adjacent segment point are formed by connecting bone vector, select four groups of upper limb respectively With four groups of the lower limb bone vectors as present embodiment, according to artis definition rule, bone vector characteristics set S is defined Are as follows:

S={ B_2,3,B_3,4,B_5,6,B_6,7,B_8,9,B_9,10,B_11,12,B_12,13}

3.5) bone angle character is calculated, using left wrist, the angle of left shoulder and the left buttocks of difference, right wrist, right shoulder and right stern The angle in portion defines artis p as bone space angle_iAnd p_jFor the angle theta where being projected in three-dimensional space in X/Y plane Are as follows:

Bone angle character set θ is defined as:

θ=(θ_4,8,θ_2,8,θ_5,11,θ_7,11)

3.6) bone length feature is calculated, select bone length as biasing to describe the globality difference of human skeleton, Using backbone vector, i.e., the distance between left buttocks, two nodes of right hips and neck node are used as bone length feature, bone Length characteristic D set is defined as:

D=D_1,8+D_1,11

Wherein, if artis i is connected with artis j；

D_ij=| | p_i-p_j||

3.7) skeleton joint point co-ordinative construction feature is calculated, by bone vector characteristics, bone angle character and bone length Feature carries out linear mosaic, forms the co-ordinative construction feature of skeleton joint point, indicates are as follows:

Feature={ S, θ, D }.

In step 4), LSTM shot and long term memory network is constructed, specific as follows:

It hides in layer unit internal structure, the status level line of uppermost hidden unit is by hidden unit state from upper one A moment is transmitted to next moment, only comprising a small number of linear transformation operations；

LSTM includes three " door " structures, input gate i_t, forget a f_tWith out gate o_t；Each door has sigmoid function Multiply operation with step-by-step, so that hidden unit only remembers useful information as far as possible, abandons useless information；

To being calculated inside LSTM hidden unit, forget in door, W_fIndicate input vector forgets weight, b_fExpression is forgotten Note biasing, forgets that calculating is as follows:

f_t=σ (W_f·[h_t-1,x_t]+b_f)

In input gate, W_iIndicate the update weight of input vector, b_iIt indicates to update biasing, input gate calculates as follows:

i_t=σ (W_i·[h_t-1,x_t]+b_i)

C is the state of hidden unit, and hidden unit calculates as follows:

C_t=f_t*C_t-1+i_t*tanh(W_C·[h_t-1,x_t]+b_C)

In out gate, W_oFor the output weight of input vector, b_oFor output biasing, out gate calculates as follows:

o_t=σ (W_o·[h_t-1,x_t]+b_o)

Finally calculate output layer h:

h_t=o_t*tanh(C_t)

Wherein, x is input layer, and h is output layer, h_t-1For the output unit at t-1 moment, x_t-1For the input list at t-1 moment Member.

In step 5), time attention mechanism is added in LSTM network, comprising the following steps:

5.1) the expression y of some part of Input context information c and current data_i；

5.2) tanh layers of calculating m are used₁, m₂..., m_n, by y_iIt is polymerize with c, if the weight of c is W_cm, y_iWeight For W_ym, then m_iIt calculates as follows:

m_i=tanh (W_cmc+W_ymy_i)

5.3) each weight after polymerization is calculated by softmax function:

Wherein, s_iIt is m_iThe softmax value on learning direction is projected, so softmax is considered to obtain according to context c The most correlation arrived；

5.4) all y are calculated_iWeighted average as output valve z, weight indicates each variable with the correlation of context c Property, z calculates as follows:

Z=∑_is_iy_i。

In step 6), classified using softmax regression model classifier, comprising the following steps:

6.1) training dataset is constructed, is disclosed using the multi-modal Human bodys' response of Berkeley MHAD and UTD-MHAD Data set；

6.2) a softmax classifier is added in the last layer of the LSTM model based on time attention mechanism, Input of the output of LSTM the last layer as classifier obtains final disaggregated model by training classifier；

6.3) use the co-ordinative construction feature for the 2D artis extracted from rgb video as input, utilization is trained Softmax classifier is classified.

Compared with prior art, the present invention have the following advantages that with the utility model has the advantages that

1, the Activity recognition method based on RGB monocular vision sensor, using the behavior characterizing method based on global characteristics, Not only available motion information abundant identifies compound action with this, but also compared with being currently commonly used to Activity recognition field The RGB+D depth camera of research is had lower cost and better universality, can be obtained compared with using wearable inertial sensor More fully information is taken, the limitation of its wearing position and motion information is broken through, to enable behavior identification technology in reality It is popularized in scene.

2, the extraction that rgb video data are carried out with skeletal joint point can not only be extracted to behavior classification more useful bone Frame artis motion information, and a large amount of redundancy can be removed, so as to reduce space, the raising behavior of storing data The speed of identification.In addition, a kind of co-ordinative construction feature extracting method of skeletal joint point is proposed, in removal complex background to human body While the negative interference of Activity recognition, more effective character representation is carried out to raw skeleton artis, it is multiple so as to improve The accuracy rate of Activity recognition under miscellaneous background.

3, Human bodys' response is carried out using time attention mechanism and LSTM model, can effectively solve the problem that depth nerve net When network automatically extracts feature, the problem of assigning time series data property of equal importance.The relationship of video interframe is extracted with LSTM network, is used Time attention mechanism is more concerned about network to the maximum key frame of Activity recognition contribution, to improve the identification to compound action Accuracy rate.

Detailed description of the invention

Fig. 1 is the method for the present invention flow chart.

Fig. 2 is the skeleton artis definition rule schematic diagram that the present invention is extracted from RGB image.

Fig. 3 is LSTM neuronal structure schematic diagram.

Fig. 4 is attention Mechanism Model schematic diagram.

Specific embodiment

Present invention will now be described in further detail with reference to the embodiments and the accompanying drawings, but embodiments of the present invention are unlimited In this.

As shown in Figures 1 to 4, the Human bodys' response based on time attention mechanism and LSTM provided by the present embodiment Method includes the following steps:

1) video monitoring platform is established, obtains rgb video data using the monocular vision sensor of low cost, including following Step:

1.3) the IP connection mode of iSCSI is used, the storage equipment of the transmission of video that will acquire to server is stored；

2) 2 dimension skeleton joint point datas are extracted from rgb video using OpenPose model, comprising the following steps:

2.1) in the present embodiment, for convenience of the extraction of skeleton joint point is carried out, by video according to every 10 seconds when progress Row segment processing；

2.2) in present embodiment, specify image input having a size of 368*368；

2.3) OpenPose frame is called, picture input CNN network is extracted into part confidence maps and part affinity fields；

2.4) list is established, 18 artis detected from picture are stored；

2.5) part association is found out using even matching, artis is connected to the entirety to form human synovial Skeleton.

3) artis is normalized, calculates the co-ordinative construction feature of artis, comprising the following steps:

3.1) in the present embodiment, the definition rule of 18 skeleton joint points is as shown in Fig. 2, 2 dimensions acquired in definition Skeleton joint point coordinate are as follows:

p_i(x,y)

3.2) vector J includes extracted two-dimensional framework joint point set, and J is defined as follows:

J={ p₁,p₂,...,p₁₈}

3.3) bone vector is normalized between two artis, and normalized vector calculates as follows:

3.4) in the present embodiment, bone vector characteristics refer to the principle according to human structurology, by adjacent segment point Be formed by connecting bone vector, selects four groups of upper limb and four groups of the lower limb bone vectors as present embodiment respectively, according to fig. 2 institute The artis definition rule shown, bone vector characteristics set S is defined as:

S={ B_2,3,B_3,4,B_5,6,B_6,7,B_8,9,B_9,10,B_11,12,B_12,13}

3.5) in the present embodiment, bone angle character is using left wrist, left shoulder and the angle of left buttocks respectively, right wrist, Right shoulder and the angle of right hips define artis p as bone space angle_iAnd p_jTo project institute in X/Y plane in three-dimensional space Angle theta are as follows:

Bone angle character set θ is defined as:

θ=(θ_4,8,θ_2,8,θ_5,11,θ_7,11)

3.6) in the present embodiment, due to human body personalization difference, bone length is selected as biasing to describe human body The globality difference of skeleton, using backbone vector, i.e., the conduct of the distance between left buttocks, two nodes of right hips and neck node Bone length feature, bone length feature D set is defined as:

D=D_1,8+D_1,11

Wherein, if artis i is connected with artis j,

D_ij=| | p_i-p_j||

3.7) in the present embodiment, bone vector characteristics, bone angle character and bone length feature are carried out linear Splicing forms the co-ordinative construction feature of skeleton joint point, indicates are as follows:

Feature={ S, θ, D }

4) shot and long term memory network LSTM is constructed, is embodied as follows:

4.1) hide in layer unit internal structure, the status level line of uppermost hidden unit by hidden unit state from A upper moment is transmitted to next moment, only comprising a small number of linear transformation operations, is conducive to the state for maintaining hidden unit It is constant；

4.2) LSTM includes three special " door " structures, input gate i_t, forget a f_tWith out gate o_t.Each Men Douyou Sigmoid function and step-by-step multiply operation, so that hidden unit only remembers useful information as far as possible, abandon useless information, from And it solves the problems, such as to rely on for a long time；

4.3) to being calculated inside LSTM hidden unit, forget in door, W_fIndicate input vector forgets weight, b_fTable Show and forget to bias, forgets that calculating is as follows:

f_t=σ (W_f·[h_t-1,x_t]+b_f)

i_t=σ (W_i·[h_t-1,x_t]+b_i)

C is the state of hidden unit, and hidden unit calculates as follows:

C_t=f_t*C_t-1+i_t*tanh(W_C·[h_t-1,x_t]+b_C)

o_t=σ (W_o·[h_t-1,x_t]+b_o)

Finally calculate output layer h:

h_t=o_t*tanh(C_t)

5) time attention mechanism is added in LSTM network, extracts temporal aspect, is embodied as follows:

m_i=tanh (W_cmc+W_ymy_i)

5.3) each weight after polymerization is calculated by softmax function:

Wherein, s_iIt is m_iThe softmax value on learning direction is projected, so softmax may be considered according to context The most correlation that c is obtained；

Z=∑_is_iy_i

6) Human bodys' response is carried out using softmax classifier, specific implementation step is as follows:

6.3) it uses the co-ordinative construction feature for the 2D artis extracted from rgb video in step 3) as input, utilizes Trained softmax classifier is classified.

In conclusion the Human bodys' response method provided by the present invention based on time attention mechanism and LSTM, structure The deep neural network model based on RGB monocular vision sensor is built, can be improved the Activity recognition system in view-based access control model Universality；2D skeleton joint point is extracted using OpenPose Open Framework in rgb video, is proposed a kind of based on skeletal joint point Structure feature extracting method, processing speed and the raising of Activity recognition system can be improved by reducing Video Redundancy information Real-time；The LSTM model for proposing a kind of binding time attention mechanism, can be improved the accuracy rate of the identification to complex behavior. In addition, technical method provided by the invention can also be extended to human body exception monitoring, video monitoring, smart home, identity authentication with And the various fields such as motion analysis, there is extensive research significance, be worthy to be popularized.

In above-described embodiment, included modules are that function logic according to the invention is divided, but simultaneously It is not limited to above-mentioned division, as long as corresponding functions can be realized, the protection scope that is not intended to restrict the invention.

The above is the preferable embodiment of the present invention, but embodiments of the present invention are not by the limit of above-described embodiment System, other any changes, modifications, substitutions, combinations, simplifications made without departing from the spirit and principles of the present invention, It should be equivalent substitute mode, be included within the scope of the present invention.

Claims

1. the Human bodys' response method based on time attention mechanism and LSTM, which comprises the following steps:

1) video data of RGB monocular vision sensor is obtained；

2) 2D skeleton joint point data is extracted；

3) artis co-ordinative construction feature is extracted；

4) LSTM shot and long term memory network is constructed；

5) time attention mechanism is added in LSTM network；

6) Human bodys' response is carried out using softmax classifier.

2. the Human bodys' response method according to claim 1 based on time attention mechanism and LSTM, feature exist In, in step 1), the video data of acquisition RGB monocular vision sensor, comprising the following steps:

3. the Human bodys' response method according to claim 1 based on time attention mechanism and LSTM, feature exist In, in step 2), extraction 2D skeleton joint point data, comprising the following steps:

2.3) OpenPose frame is called, the picture input CNN network of specified size is extracted into part confidence maps With part affinity fields；

2.4) list is established, for storing detect from picture 18 artis；

2.5) part association is found out using even matching, artis is connected to the whole bone to form human synovial Frame.

4. the Human bodys' response method according to claim 1 based on time attention mechanism and LSTM, feature exist In, in step 3), extraction artis co-ordinative construction feature, comprising the following steps:

p_i(x,y)

J={ p₁,p₂,...,p₁₈}

Wherein, p_iAnd p_jIndicate two adjacent artis, | | p_i-p_j| | the Euclidean distance between two o'clock calculates as follows:

3.4) bone vector characteristics are calculated, i.e., adjacent segment point are formed by connecting bone vector, select respectively four groups of upper limb and under Four groups of the limb bone vectors as present embodiment, according to artis definition rule, bone vector characteristics set S is defined as:

S={ B_2,3,B_3,4,B_5,6,B_6,7,B_8,9,B_9,10,B_11,12,B_12,13}

3.5) bone angle character is calculated, using left wrist, left shoulder and the angle of left buttocks respectively, right wrist, right shoulder and right hips Angle defines artis p as bone space angle_iAnd p_jFor the angle theta where being projected in three-dimensional space in X/Y plane are as follows:

Bone angle character set θ is defined as:

θ=(θ_4,8,θ_2,8,θ_5,11,θ_7,11)

3.6) bone length feature is calculated, select bone length as biasing to describe the globality difference of human skeleton, use Backbone vector, i.e., the distance between left buttocks, two nodes of right hips and neck node are used as bone length feature, bone length Feature D set is defined as:

D=D_1,8+D_1,11

Wherein, if artis i is connected with artis j；

D_ij=| | p_i-p_j||

3.7) skeleton joint point co-ordinative construction feature is calculated, by bone vector characteristics, bone angle character and bone length feature Linear mosaic is carried out, the co-ordinative construction feature of skeleton joint point is formed, is indicated are as follows:

Feature={ S, θ, D }.

5. the Human bodys' response method according to claim 1 based on time attention mechanism and LSTM, feature exist In, in step 4), LSTM shot and long term memory network is constructed, specific as follows:

Hide in layer unit internal structure, the status level line of uppermost hidden unit by hidden unit state from upper one when Next moment is transmitted to quarter, only comprising a small number of linear transformation operations；

LSTM includes three " door " structures, input gate i_t, forget a f_tWith out gate o_t；Each door have sigmoid function and by Position multiplies operation, so that hidden unit only remembers useful information as far as possible, abandons useless information；

To being calculated inside LSTM hidden unit, forget in door, W_fIndicate input vector forgets weight, b_fExpression is forgotten partially It sets, forgets that calculating is as follows:

f_t=σ (W_f·[h_t-1,x_t]+b_f)

i_t=σ (W_i·[h_t-1,x_t]+b_i)

C is the state of hidden unit, and hidden unit calculates as follows:

C_t=f_t*C_t-1+i_t*tanh(W_C·[h_t-1,x_t]+b_C)

o_t=σ (W_o·[h_t-1,x_t]+b_o)

Finally calculate output layer h:

h_t=o_t*tanh(C_t)

Wherein, x is input layer, and h is output layer, h_t-1For the output unit at t-1 moment, x_t-1For the input unit at t-1 moment.

6. the Human bodys' response method according to claim 1 based on time attention mechanism and LSTM, feature exist In, in step 5), the addition time attention mechanism in LSTM network, comprising the following steps:

5.2) tanh layers of calculating m are used₁, m₂..., m_n, by y_iIt is polymerize with c, if the weight of c is W_cm, y_iWeight be W_ym, Then m_iIt calculates as follows:

m_i=tanh (W_cmc+W_ymy_i)

5.3) each weight after polymerization is calculated by softmax function:

Wherein, s_iIt is m_iThe softmax value on learning direction is projected, so softmax is considered what foundation context c was obtained Most correlation；

5.4) all y are calculated_iWeighted average as output valve z, weight indicates each variable with the correlation of context c, z It calculates as follows:

Z=∑_is_iy_i。

7. the Human bodys' response method according to claim 1 based on time attention mechanism and LSTM, feature exist In being classified in step 6) using softmax regression model classifier, comprising the following steps:

6.1) training dataset is constructed, the multi-modal Human bodys' response public data of Berkeley MHAD and UTD-MHAD is used Collection；

6.2) a softmax classifier is added in the last layer of the LSTM model based on time attention mechanism, LSTM Input of the output of the last layer as classifier obtains final disaggregated model by training classifier；