CN113239884A

CN113239884A - Method for recognizing human body behaviors in elevator car

Info

Publication number: CN113239884A
Application number: CN202110625850.0A
Authority: CN
Inventors: 邓海南; 詹跃明; 余晓毅; 王文蝶; 陈鸣利; 肖渝; 赵兴明; 蔡润龙; 杨利鸿; 杨玲; 郎仲宽
Original assignee: Chongqing Tielian Intelligent Technology Co ltd; Chongqing Energy College
Current assignee: Chongqing Tielian Intelligent Technology Co ltd; Chongqing Energy College
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2021-08-10

Abstract

The invention relates to a human behavior recognition method in an elevator car, which is based on a behavior recognition method of a framework self-adaptation and joint enhancement graph convolution network; firstly, learning the connection relation among all joint points by using a Gaussian function of embedded operation, and adaptively adjusting the joint structure according to input skeleton data; secondly, introducing a soft attention mechanism, and measuring the difference of the contribution of each joint point so as to enhance the feature expression of the joint points with high contribution and weaken the interference of the features of the joint points with low contribution; and finally, constructing an end-to-end graph convolution network to learn the spatio-temporal co-occurrence characteristics of the skeleton joint points. The problem of insufficient modeling of human joint motion in the current algorithm can be avoided, and the interference is reduced. The algorithm provided by verification on the MSR Action 3D of the public behavior recognition data set is feasible, the recognition accuracy is high, and another feasible choice is provided for 3D human body skeleton data modeling and human behavior recognition.

Description

Method for recognizing human body behaviors in elevator car

Technical Field

The invention belongs to the technical field of data identification processing in physics, and particularly relates to a human behavior identification method in an elevator car.

Background

At present, researches and engineering development based on convolutional neural networks are more and more, such as CN110223706A, CN110222653A and CN1109740695A, and extensive researches are obtained in the direction of skeleton behavior recognition based on deep learning, and as mentioned in the above-mentioned CN110222653A, human body behavior recognition is also an important subject in the field of decision-making of behavior patterns of human body through body motion analysis.

In the study based on the convolutional neural network, the skeleton data is used as a pseudo image in a vector matrix form, and the spatial local features of the joint points are extracted; modeling the skeleton data into coordinate vectors based on a cyclic neural network, learning high-level time dynamic characteristics of skeleton sequences, introducing a global context attention mechanism, and selectively paying attention to important joints of each frame of skeleton sequences; in order to extract detailed characteristics from any structure diagram, a Graph Convolution Network (GCN) technology is applied, and a space-time graph modeling is carried out on human skeleton data by using a space-time graph convolution network (TS-GCN) so as to obtain a good identification effect. However, it is clearly not sufficient to rely on only one fixed skeleton map to express flexible and variable joint movements for different behaviors of different individuals. Furthermore, from the human kinematics perspective, the contribution of each joint feature of the skeleton to the action differentiation is inconsistent, and for some similar behaviors, too many low-contribution features may affect the final classification decision.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method for identifying human body behaviors in an elevator car, avoid the problem of insufficient modeling of human body joint actions in the current algorithm, provide another feasible selection for 3D human body skeleton data modeling, and obtain feasible and accurate technical effects.

In order to solve the technical problems, the invention adopts the following technical scheme:

the human behavior recognition method in the elevator car comprises the following steps: 1) designing a human skeleton space-time topological graph, learning the connection relation among all joint points by using a Gaussian function of embedded operation, and adaptively adjusting joint structure connection according to input sample data;

2) a soft attention mechanism is introduced to measure the difference between the joint points, enhance the feature expression of the joint points with high contribution and weaken the interference of the features of the joint points with low contribution, and extract the joint features with high discriminability;

3) and constructing an end-to-end graph convolution network to learn the space-time co-occurrence characteristics of the human skeleton joint points.

Furthermore, the predefined human skeleton diagram is formed by joint points and skeleton edges connected with every two joints according to the body structure; a complete human body action comprises a T frame skeleton sequence with N jointsThe skeleton map is represented by G ═ (V, E), where the set of joint points V ═ V_tiI T1., T, i 1., N includes all joints in the bone sequence, and the set of bone edges E ═ E · E ·, N_S,E_T}；

Wherein, the space edge E represents the connection of adjacent joint points in a frame of skeleton sequence_S＝{v_tiv_tjL (i, j) belongs to H }; h is the joint set naturally connected with human body, and the time edge E connected with the same joint point between the continuous frames_T＝{v_tiv_(t+1)i}。

Further, in the spatial dimension, the graph convolution operation is applied to any joint point v in the skeleton graph_tiThe feature extraction of (a) is expressed as:

where f is the extracted joint point v_tiB (v) of_ti)＝{v_tj|d(v_tj,v_ti) D is v_tiD controlling the range of the neighbor node taken, v_tjIs represented by the formula_tiDirectly connected articulation points; z_ti(v_tj)＝|{v_tk|l_ti(v_tk)＝l_ti(v_tj) } | is a normalization term, w is a weight function of a neighboring joint point; in order to learn the differential characteristics of each neighbor node differentially, the neighbor nodes are divided into K sets according to the human motion characteristics, and the labels are mapped to_ti(v_tj):B(v_ti) → 0.. K-1, for each node v_tjAssigning a unique weight vector;

the operation of graph convolution in the time domain is obtained by graph convolution expansion on the space domain, and the parameter gamma is used as the time range for controlling the neighbor set, so that the neighbor set in two dimensions of space and time is represented as follows:

label mapping set corresponding to neighbor nodeIs synthesized into_ST(v_qj)＝l_ti(v_tj) + (q-t + Γ/2) xK, wherein l_ti(v_tj) Is the case of a single frame v_tiMapping the label of (2);

in a graph convolution network, after performing self node information aggregation and neighbor node information aggregation on each joint point, the information is propagated to the next joint point until all joint points in a skeleton sequence are traversed, and a graph convolution formula is expressed as follows:

the characteristic graph f is a tensor of C multiplied by T multiplied by N dimensionality, C is the number of channels, T is the frame number of the skeleton graph sequence, and N is the joint number of a single-frame skeleton graph;

the connection relationship of each joint in the skeleton diagram is represented by an N × N adjacency matrix A and a unit matrix I, D is a degree matrix of each joint point, and D^-1/2(A+I)D^-1/2Representing a normalized skeleton graph structure, wherein W is a weight matrix learned in a graph convolution network; according to the neighbor node partition strategy, the adjacent matrix A is decomposed into K matrixes A_KThe graph convolution formula is further expressed as:

dynamically adjusting the connection relation and the connection strength of each joint point according to the input skeleton data, wherein the formula of the self-adaptive graph volume layer is adjusted as follows:

A_koriginal matrix representing natural articulation of joints in the body, B_kExpressing an adaptive matrix, embedding a normalized Gaussian function in a network layer to calculate the similarity of two joint points in a skeleton diagram, so as to measure the connection relation of the two joint points, specifically:

in the formula, phi (v)_i)＝W_Φv_iAnd Ψ (v)_j)＝W_Ψv_jAre all embedded operations, W_ΦAnd W_ΨIs the corresponding weight parameter;

performing two-way parallel convolution kernel by using two embedded functions of phi and psi to perform 1 × 1 convolution operation, performing dimension transformation on output characteristics of the phi and psi respectively, performing matrix multiplication, and classifying by using a softmax function to obtain an adaptive matrix:

B_k＝softmax(f_in ^TW_Φk ^TW_Φkf_in) (7)

wherein the elements

Is normalized to [0, 1 ]]In the meantime.

Further, f containing the spatial structure characteristics of each joint is taken out from the t frame skeleton diagram_tFirstly, transforming the joint characteristics through a full connection layer, aggregating all transformed joint characteristics to obtain query characteristics:

wherein W is a learnable weight matrix, and the attention scores of all the joint points in the skeleton diagram are expressed as:

m_t＝softmax(W_stanh(W_ff_t+W_qq_t+b_fq)+b_s) (9)

wherein W_s,W_f,W_qAre all learnable weight matrices, b_fq,b_sAre all offset, m_t＝(m_t1,m_t2,...,m_tN) Representing the importance degree of the corresponding joint point in the t frame skeleton diagram, and the value of the importance degree is normalized by the softmax function[0，1](ii) a The attention layer can output a skeleton map with enhanced joint characteristics, and each joint point v_tiIs expressed as:

further, it is verifiable by public behavior recognition data set MSRAction 3D.

Compared with the prior art, the invention has the following beneficial effects:

the invention relates to a human behavior recognition method in an elevator car, which is based on a behavior recognition method of a framework self-adaptation and joint enhancement graph convolution network; firstly, learning the connection relation among all joint points by using a Gaussian function of embedded operation, and adaptively adjusting the joint structure according to input skeleton data; secondly, introducing a soft attention mechanism, and measuring the difference of the contribution of each joint point so as to enhance the feature expression of the joint points with high contribution and weaken the interference of the features of the joint points with low contribution; and finally, constructing an end-to-end graph convolution network to learn the spatio-temporal co-occurrence characteristics of the skeleton joint points. The problem of insufficient modeling of human joint motion in the current algorithm can be avoided, and the interference is reduced. The algorithm provided by verification on the MSR Action 3D of the public behavior recognition data set is feasible, the recognition accuracy is high, and another feasible choice is provided for 3D human body skeleton data modeling and human body behavior recognition. During the in-service use, use in the elevator car, can gather passenger information, provide the safety guarantee, supervise civilization and take advantage of the ladder.

Drawings

FIG. 1 is a block diagram of an overall framework of a framework adaptation and joint enhancement graph convolution network in a method according to an embodiment;

FIG. 2 is a schematic diagram of a human spatiotemporal skeleton diagram represented by a MSRACtion 3D skeleton diagram in an embodiment;

fig. 3 is a schematic diagram of a partition strategy according to a spatial structure of an MSRAction 3D skeleton diagram in an embodiment;

FIG. 4 is a diagram of a skeleton adaptive map convolutional layer in an embodiment;

FIG. 5 is a schematic view of an enhanced attention layer of a joint in an exemplary embodiment;

FIG. 6 is a sample diagram of the MSRACtion 3D partial skeleton action in an embodiment;

FIG. 7 is a graph of the variation of training loss values in an exemplary embodiment;

FIG. 8 is a graph illustrating a test recognition rate variation in an exemplary embodiment;

FIG. 9 is a diagram of an original adjacency matrix in an embodiment;

FIG. 10 is a diagram of a skeletal adaptive adjacency matrix in an embodiment;

FIG. 11 is a schematic diagram of an exemplary embodiment of a confusion matrix for a reference network TS-GCN;

FIG. 12 is a schematic diagram of an exemplary confusion matrix between the reference network and the joint enhancement layer TS-GCN + JE;

Detailed Description

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

The overall framework of the method for identifying human behaviors in the elevator car of the specific embodiment is shown in fig. 1, a 3-layer framework self-adaption and joint enhancement graph convolution network (SA + JE-GCN) is stacked, and by utilizing the hierarchical characteristics of the network, not only can high-level complex motion characteristics be learned, but also the expression capability and performance of a network model can be improved. In the SA + JE-GCN network, a skeleton self-adaptive layer and a joint attention enhancement layer learn the global features and the local features of a skeleton map space, a time convolution layer is used for extracting the time dynamic features of continuous frames of a skeleton map sequence, and a residual error structure is introduced to prevent the problems of gradient explosion and gradient disappearance in the model training process so as to further improve the performance of the network.

MSRAction 3D skeleton map representation:

the raw skeletal data extracted from the depth map is composed of three-dimensional coordinates in space of a series of body joint points. Graph convolution networks are based on the convolution operation of topological graph structures, and in the T-SGCN, a predefined human skeleton graph can be formed by joint points and bone edges connected with every two joints according to body structures. A complete human motion comprises T frames of bones with N jointsA skeleton map may be represented by G ═ (V, E), where the set of joint points V ═ V_tiI T1., T, i 1., N includes all joints in the bone sequence, and the set of bone edges E ═ E · E ·, N_S,E_TWherein represents the spatial edge E connecting adjacent joint points in a frame skeleton sequence_S＝{v_tiv_tjL (i, j) belongs to H, wherein H is a joint set naturally connected with the human body, and a time edge E connected with the same joint point between continuous frames_T＝{v_tiv_(t+1)i}. Fig. 2 is a space-time skeleton diagram structure of a running motion sequence, which includes the characteristic representation of the skeleton sequence in the space dimension and the time dimension. The circles are joint points, and the two boxes respectively represent a space edge connecting the joint point and the adjacent node in the single-frame sequence and a time edge connecting the same joint point in the continuous frames.

Performing convolution modeling on a space-time diagram:

based on a predefined human body spatio-temporal skeleton diagram, the TS-GCN network adopts a layered structure to stack nine-layer diagram convolution network so as to learn the complex and high-level spatio-temporal co-occurrence characteristics of skeleton diagram data. In the space dimension, the graph volume operation is carried out on any joint point v in the skeleton graph_tiThe feature extraction of (a) is expressed as:

where f is the extracted joint point v_tiB (v) of_ti)＝{v_tj|d(v_tj,v_ti) D ≦ means v_tiD controls the range of the neighbor node taken, D ═ 1 means that only and v is taken_tiDirectly connected articulation point v_tj。Z_ti(v_tj)＝|{v_tk|l_ti(v_tk)＝l_ti(v_tj) The | is a normalization item, w is a weight function of a neighbor joint point, in order to learn the differentiated characteristics of each neighbor node differently, the neighbor nodes are divided into K sets according to the motion characteristics of the human body, and a label is mapped to l_ti(v_tj):B(v_ti) → 0, K-1 for each node v_tjAssign a unique weight toAmount of the compound (A).

In this embodiment, the spatial structure partitioning strategy in fig. 3 is adopted, and with number 4 spine joint points as the center of gravity of the human skeleton, the neighbor set can be divided into three subsets: a root node set (the joint point itself is No. 2), a near center point set (closer than the root node to the gravity point, for example, No. 3), and a far center point set (farther than the root node to the gravity point, for example, No. 9).

The operation of graph convolution in the time domain can be obtained by graph convolution expansion on the space domain, and the parameter Γ is used as the time range for controlling the neighbor set, so that the neighbor set in two dimensions of space and time can be represented by the following formula:

then the set of label mappings corresponding to the neighbor node is l_ST(v_qj)＝l_ti(v_tj) + (q-t + Γ/2) xK, wherein l_ti(v_tj) Is the case of a single frame v_tiThe label mapping of (2).

In the graph convolution network, after the node information of each joint point is aggregated with the neighbor node information, the node information is propagated to the next joint point until all the joint points in the skeleton sequence are traversed. In conjunction with a graph matrix that can represent the topology of the graph in graph theory, the graph convolution formula can be written as:

the feature map f is a tensor of one dimension C x T x N, C is the number of channels, T is the number of frames of the skeleton map sequence, and N is the number of joints of the single frame skeleton map. The connection relationship of each joint in the skeleton diagram is represented by an N × N adjacency matrix A and a unit matrix I, D is a degree matrix of each joint point, and D is used^-1/2(A+I)D^-1/2To represent the normalized skeletal graph structure, W is the weight matrix learned in the graph convolution network. According to the neighbor node partition strategy, the adjacency matrix A is decomposed into K matrixes A_KThe graph convolution formula can be furtherExpressed as:

framework self-adaptive graph volume layer:

in the graph convolution formula (3), the adjacency matrix A is an N × N weight matrix, A_ijThe connection relationship between the joint points i and j is represented, and the skeleton diagram structure of the input can be described. However, in the TS-GCN, only one skeleton topology map is adopted according to natural connection of joints of a human body, and the skeleton map input in the hierarchical network structure is fixed and unchangeable, and this mode is not optimal in the task of learning different kinds of behavior skeleton features. Thus, the present embodiment proposes a method for extracting the original adjacency matrix A by utilizing the capability of the convolutional network to extract advanced features_kBecome a learnable adaptive matrix BETA_kThe connection relationship and the connection strength of each joint point are dynamically adjusted according to the input skeleton data, and the formula of the self-adaptive graph convolution layer can be adjusted as follows:

adaptive matrix B_kWith an original matrix A representing the natural connection of the joints in the body_kAnd adding the two joint points to replace a fixed predefined skeleton graph, and embedding a normalized Gaussian function in the network layer to calculate the similarity of the two joint points in the skeleton graph so as to measure the connection relation of the two joint points. The specific operation is as follows:

in the formula, phi (v)_i)＝W_Φv_iAnd Ψ (v)_j)＝W_Ψv_jAre all embedded operations, W_ΦAnd W_ΨIs the corresponding weight parameter.

It should be noted that: the non-local neural network may be through a non-local structureAnd establishing a connection relation between the single joint and the skeleton map global joint, and performing convolution operation in an embedding space to update the state between the nodes. The structure of the skeleton adaptive graph convolution layer is shown in fig. 4: the light grey square part (four stars) indicates that the parameters are learnable, input f_inIs a C_inThe characteristic data of the X T multiplied by N dimensionality respectively corresponds to the number of input channels, the number of frame sequences of a skeleton diagram and the number of joints, two-way parallel convolution kernels are executed to be 1 multiplied by 1 convolution operation through two embedded functions of phi and psi, matrix multiplication is carried out after dimensionality transformation is carried out on the output characteristics of the phi and psi, and a self-adaptive matrix is obtained by using softmax function classification:

B_k＝softmax(f_in ^TW_Φk ^TW_Φkf_in) (7)

wherein the elements

Is normalized to [0, 1 ]]Compared with a predefined skeleton diagram, the self-adaptive matrix can establish a new connection relation according to input joint point data, is not limited to natural connection in a human body, and is more flexible and changeable. A residual error structure is added into the self-adaptive layer, 1 multiplied by 1 convolution operation is adopted, so that input dimension is consistent with output dimension, and original features of a lower layer are also reserved in output high-layer features.

Joint enhancement layer based on attention mechanism:

in the process of classifying various behaviors of a human body by using a graph convolution network, extracting high-grade effective motion characteristics is the key for improving the identification precision. The attention mechanism is an information processing mechanism researched according to human visual characteristics, can give enough attention to important information and weaken interfered information, and improves the signal-to-noise ratio to achieve the aim of information enhancement. In the behavior recognition based on skeleton information, such as behaviors of ' clapping ' hands, waving hands ' hands and ' shaking hands ', joint features on arms are more important than other joint features of the skeleton, and in some similar behaviors, too many interference features can also influence the final classification effect. The joint enhancement layer with the soft attention mechanism adaptively focuses on key nodes in the skeleton map and automatically computes the importance of each joint.

As shown in FIG. 5, the attention layer having joint enhancement function is obtained by extracting f including the spatial structure characteristics of each joint from the t-th frame skeleton diagram_tFirstly, transforming the joint characteristics through a full connection layer, aggregating all transformed joint characteristics to obtain query characteristics:

where W is a learnable weight matrix, the attention scores of all the joints in the skeleton map can be written as:

m_t＝softmax(W_stanh(W_ff_t+W_qq_t+b_fq)+b_s) (9)

wherein W_s,W_f,W_qAre all learnable weight matrices, b_fq,b_sAre all offset, m_t＝(m_t1,m_t2,...,m_tN) Representing the importance degree of the corresponding joint point in the t-th frame skeleton diagram, and the value of the importance degree is normalized to 0, 1 by the softmax function]。

Thus, the attention layer may output a skeletal map with enhanced joint features, each joint point v_tiThe spatial features of (a) may be expressed as:

and (3) comparison, verification and analysis:

in order to verify the effectiveness of the behavior recognition model of the method, three groups of control experiments are respectively carried out on a representative MSRAMion 3D framework data set by taking a time-space graph convolutional network (TS-GCN) as a base line network, and the comparison is carried out with other existing methods.

Dataset, MSRAction 3D dataset: the MSRAMion 3D data set is based on 20 skeleton joint 3D coordinates obtained by a Kinect depth camera and is commonly used for verifying the validity of a human behavior recognition algorithm. Specific joint names referring to fig. 3, the data set was presented by 10 subjects with 20 motion categories in table 1, each motion was repeated 2-3 times, and there were 567 motion sequence sample data, and the frame number of each motion sequence was 10-100.

TABLE 120 action classes for MSRACtion 3D

FIG. 6 is a skeleton sample diagram of partial actions, where the actions of this data set have very high similarity, and the difference of the frame number captured by different actions is large, which brings great challenge to the verification of the model of the method. In the data sample preprocessing stage, there is a serious data loss in 20 action sequences, so the total data amount of this experiment is 547 samples, and the cross-validation method classified by subjects is used to test the performance of the model, wherein

subjects

1, 3, 5, 7, 9 are used for training, and subjects 2, 4, 6, 8, 10 are used for testing.

The experiment was performed based on a graph convolution network of skeletal adaptation and joint enhancement, as shown in fig. 1. The reference network is a stacked 3-layer space-time graph convolution (TS-GCN), and the input/output channels and Padding are (3,32,1), (32,64,2), (64,128,2), respectively. A random gradient descent (SGD) optimization strategy with a momentum Nesterov of 0.9 is adopted, a cross entropy loss function is used as calculation of a gradient back propagation error, the training period is 120 epochs, the initial learning rate is set to be 0.1, attenuation is 0.1 times respectively at 50 epochs and 80 epochs, the weight attenuation coefficient is 0.0001, the dropout is 0.25, and the training batch and the testing batch are both 16. And respectively adding a framework self-adaptive layer (SA), a joint enhancement layer (JE) and a framework self-adaptive and joint enhancement layer (SA + JE) into the base-line network to test the performance of each component.

Four groups of human behavior recognition experiments are carried out on the MSRACtion 3D data set, and the effectiveness of the method is verified by using a reference network TS-GCN and two network layer combined training models respectively. Fig. 7 and 8 are graphs of Loss values (Loss) of the four experiments during network training and recognition accuracy (Acc) of the network test with the training period (epoch), respectively. FIG. 6 is a group diagram with an offset of 0.1, in which the convergence speed of the loss values of the reference network TS-GCN is the slowest, and the loss value is the lowest 0.0313; and the loss value of the framework self-adaptive layer and the joint enhancement layer can be added to realize faster convergence, the minimum loss value is 0.013, and 3D framework data can be better fitted.

The behavior recognition accuracy of the four experiments in fig. 8 tends to be stable when the number of epochs is substantially 100, wherein the recognition rate of the reference network has larger oscillation, and the network with the framework adaptive layer and the joint enhancement layer is more stable. The recognition accuracy of the four experimental tests is given in table 2, respectively.

TABLE 2 comparison of recognition rates of the method in MSRACtion 3D data sets

When the space-time graph convolutional network TS-GCN is used as a reference network, the accuracy is 92.05%, when a framework adaptive layer (SA) and a joint enhancement layer (JE) are added, the recognition accuracy is respectively improved by 1.67% and 1.23%, the recognition rate can reach 95.36% after the two are fused, and the recognition rate is improved by 3.31% compared with the reference network TS-GCN. From the detail point of view, the skeleton adaptive layer carries out adaptive adjustment on the adjacent matrix expressing the skeleton map structure.

The original neighborhood matrix predefined in fig. 9 shows that the connection relationship between the joint points is fixed and invariant, which cannot effectively interpret the potential dependency relationship between the position movement information of the joint points of different action skeletons and the joint points. Such as simple running movements, the cooperative dependency between the joints on the foot and hand regions is important and not negligible. The skeletal adaptive adjacency matrix in fig. 10 is more formally flexible and adapts to different behavior recognition tasks better as the skeletal data changes.

Formally, the attention layer with joint augmentation assigns higher weights to highly contributing joint points in the skeletal action, and relatively reduces the weights of the low contributing nodes. In fig. 12, compared to fig. 11, the addition of the reference network of the joint enhancement layer increases the recognition rate of the motions of grab (HCh), jab (DX), side punch (SB), pick up, and throw (PT) by 0.17, 0.16, 0.22, and 0.15, respectively. It is worth noting that the recognition rate of the two actions of High Throw (HT) and positive top-hand ball (TSr) is decreased by 0.17 and 0.12 instead, and the actions are classified as punch (FP) and hammer (H), which have extremely high similarity, and the high-contribution joints between similar actions are also similar to each other, and after the reinforced joint layer is added, the two actions are mixed in classification. This problem states that some similar actions should be given more attention to the joint points rather than focusing on some nodes and discarding information of other low-contributing nodes.

In order to more fully verify the performance of the algorithm proposed in this embodiment, further experimental comparison is performed with the current most advanced method (SOTA) based on the behavior recognition field of the MSRAction 3D dataset, and the comparison result is shown in table 3.

TABLE 3 recognition rate comparison between MSRACtion 3D datasets and other methods

The identification accuracy rate of the graph convolution network based on the skeleton self-adaption and the joint enhancement in the experiment is 95.36%, compared with the self-adaption skeleton central point^[17]The method is improved by 6.89%, compared with the popular difference cyclic neural network algorithm at home and abroad, the method is improved by 3.33%, and the method is superior to most of the existing behavior recognition methods. Wherein Features combination^[23]The medium-feature fusion method has the highest identification accuracy, but uses more than thirty parameters, and needs to consume in order to obtain the optimal identification resultA large amount of time cost is used for manually adjusting parameter values, and the training difficulty of the model is increased. The experimental results in table 3 show that the algorithm provided by the invention has higher feasibility for modeling 3D human skeleton data, and has certain competitiveness compared with the existing behavior identification method; and a feasible selection is provided for 3D human body skeleton data modeling and human body behavior identification.

Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims

1. Method for identifying human behavior in elevator car, characterized by comprising: designing a human skeleton space-time topological graph, learning the connection relation among all joint points by using a Gaussian function of embedded operation, and adaptively adjusting joint structure connection according to input sample data;

a soft attention mechanism is introduced to measure the difference between the joint points, enhance the feature expression of the joint points with high contribution and weaken the interference of the features of the joint points with low contribution, and extract the joint features with high discriminability;

and constructing an end-to-end graph convolution network to learn the space-time co-occurrence characteristics of the human skeleton joint points.

2. The method for recognizing human behavior in an elevator car according to claim 1, characterized in that: the predefined human skeleton graph is formed by joint points and skeleton edges connected with every two joints according to the body structure; a complete body motion comprises a T-frame skeleton sequence with N joints, and the skeleton map is denoted by G ═ (V, E), where the set of joint points V ═ { V, E ═ V-_tiI T1., T, i 1., N includes all joints in the bone sequence, and the set of bone edges E ═ E · E ·, N_S,E_T}；

Wherein, it represents the connection of adjacent joint points in a frame skeleton sequenceConnected space edge E_S＝{v_tiv_tjL (i, j) belongs to H }; h is the joint set naturally connected with human body, and the time edge E connected with the same joint point between the continuous frames_T＝{v_tiv_(t+1)i}。

3. The method for recognizing human behavior in an elevator car according to claim 2, characterized in that: in the space dimension, the graph volume operation is carried out on any joint point v in the skeleton graph_tiThe feature extraction of (a) is expressed as:

the label mapping set corresponding to the neighbor node is l_ST(v_qj)＝l_ti(v_tj) + (q-t + Γ/2) xK, wherein l_ti(v_tj) Is the case of a single frame v_tiMapping the label of (2);

B_k＝softmax(f_in ^TW_Φk ^TW_Φkf_in) (7)

wherein the elements

Is normalized to [0, 1 ]]In the meantime.

4. The method for recognizing human behavior in an elevator car according to claim 3, characterized in that: f containing spatial structure characteristics of each joint is taken out from the t frame skeleton diagram_tFirstly, transforming the joint characteristics through a full connection layer, aggregating all transformed joint characteristics to obtain query characteristics:

m_t＝softmax(W_stanh(W_ff_t+W_qq_t+b_fq)+b_s) (9)

wherein W_s,W_f,W_qAre all learnable weight matrices, b_fq,b_sAre all offset, m_t＝(m_t1,m_t2,...,m_tN) Denotes the t-thThe importance degree of the corresponding joint point in the frame skeleton map is normalized to [0, 1 ] by the softmax function](ii) a The attention layer can output a skeleton map with enhanced joint characteristics, and each joint point v_tiIs expressed as:

5. the method for recognizing human behavior in an elevator car according to claim 4, characterized in that: and performing 3D verification through an MSR Action of the public behavior recognition data set.