CN114582030A

CN114582030A - Behavior recognition method based on service robot

Info

Publication number: CN114582030A
Application number: CN202210484610.8A
Authority: CN
Inventors: 李婕; 王恩果; 李毅; 李青清; 刘钊; 高澄; 肖克爽; 张峻嘉; 张振平; 巩朋成; 李刚
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2022-05-06
Filing date: 2022-05-06
Publication date: 2022-06-03
Anticipated expiration: 2042-05-06
Also published as: CN114582030B

Abstract

The application relates to a behavior identification method based on a service robot, which comprises the following specific steps: extracting human body joint point sequences of 13 common behavior categories in a service robot application scene to form a training data set; preprocessing a training data set; carrying out weighting optimization on the joint point data by combining with an actual application scene to output 17 main joint points; constructing a lightweight multi-scale aggregation space-time map convolution deep learning neural network model by using a multi-scale space-time map convolution and time convolution module; training and testing the data set by using the constructed network model; identifying human body behaviors in the video image under the real scene to be identified by using the trained model; and the service robot receives the human behavior recognition result and makes a corresponding response. The invention can accurately identify the human body behaviors in the scene, and ensures the service quality of the service robot.

Description

Behavior recognition method based on service robot

Technical Field

The application relates to the technical field of human behavior recognition, in particular to a behavior recognition method based on a service robot.

Background

With the development of science and technology and the intensive research of artificial intelligence technology, the application field of robot technology is not limited to industrial robots any more, but is popularized and applied in the direction of life activation and civilization, and the service robot gradually enters the daily life of people. In recent years, service robots develop towards intellectualization, have more and more abundant functions, and are widely applied to the aspects of cleaning, medical treatment, rescue, logistics, maintenance, security and the like. The development of the service robot industry can effectively relieve the social service pressure of old disabled people, improve the quality of life of people, promote the rapid development of civil science and technology, and is a strategic measure for realizing the benefit of advanced scientific and technological achievements to the people, so that all countries in the world pay great attention to the development of the service robot industry and invest in a large amount of resources for research and development. Although the relevant research technology of the service robot is mature, the complex external environment is still a great challenge for the service robot in the research of positioning navigation, human-computer interaction, computer vision, reasoning tasks and the like. By carrying out algorithm analysis on the video images captured by the service robot, the behavior of people in the scene can be judged, and then response reactions can be made. In order to identify the human behavior in the video, firstly, information with high relevance of the target human behavior in the video needs to be extracted, then key information is obtained through algorithm processing, and finally the obtained key information is used for identifying the human behavior.

Along with the miniaturization, integration and intellectualization of the camera and the flexibility of an interface of the camera, the service robot can capture indoor environment pictures in real time by carrying the camera. The traditional feature extraction method is to extract visual high-dimensional features by methods of space-time key point sampling, dense track sampling, body part sampling and the like, perform behavior prediction by using classifiers such as SVM (Support Vector Machine), RF (random form) and the like, perform end-to-end feature extraction and recognition by using a deep learning method in an automatic feature learning manner, particularly apply a graph convolution network on a human skeleton, avoid the influence of complex background, shape, RGB (red, green, blue) color and other information on recognition precision as much as possible, apply a key point recognition algorithm (such as openpose, mediaprofile and the like) to captured video pictures to obtain sequence information such as human key points and the like, send the key point sequence information to a constructed multi-scale aggregation space-time graph convolution network model to calculate so as to obtain behavior information of corresponding characters, and further enable a service robot to make corresponding responses (such as waving motions, the robot recognizes the motion and approaches the character).

In the existing scheme, a human body skeleton behavior identification method based on graph convolution mostly treats a human body skeleton sequence as a series of non-intersected graphs, and extracts features through a Graph Convolution (GCN) module in a space dimension and a convolution (TCN) module in a time dimension. Under the complex working environment of the service robot, the recognition efficiency of the behavior recognition model constructed based on the common graph convolution is not high, and the wrong recognition can cause the wrong interaction of the robot, so that the service quality of the robot and the experience of a service object are influenced. Therefore, a lightweight human behavior recognition model is urgently needed to be applied to the service robot.

Disclosure of Invention

The embodiment of the application aims to provide a behavior recognition method based on a service robot, and a lightweight volume human behavior recognition model capable of crossing space-time relations is designed, so that the overall recognition effect is ensured, the false recognition of similar actions is reduced, and the quality of remote visual interaction of the service robot is improved.

In order to achieve the above purpose, the present application provides the following technical solutions:

the embodiment of the application provides a behavior identification method based on a service robot, which comprises the following specific steps:

s1, extracting human body joint point sequences of 13 behavior categories commonly used in the service robot application scene to form a training data set;

s2, preprocessing the training data set, firstly extracting key frames of the joint point sequence, and then optimizing the joint point data by combining with the actual application scene;

s3, for a video shot in a real scene, firstly, carrying out key point estimation by adopting a body-25 human posture estimation model in openposition to obtain 25 key point coordinates and confidence coefficients, then, carrying out key point vacancy value filling on the obtained key point data by adopting a K nearest neighbor method, and finally, carrying out weighting optimization on joint point data by combining with an actual application scene to output 17 main joint points;

s4, constructing a lightweight multi-scale aggregation space-time map convolution deep learning neural network model by using a multi-scale space-time map convolution and time convolution module;

s5, training and testing the data set by using the constructed network model;

s6, identifying human body behaviors in the video image under the real scene to be identified by using the trained model;

and S7, the service robot receives the human behavior recognition result and responds correspondingly.

In the step S1, the training data set is derived from an NTU-RGB + D human behavior data set, and 13 behavior categories are selected: drinking, picking up, throwing away, sitting down, standing up, jumping, shaking head, tumbling, chest pain, waving hands, kicking, hugging and walking, 12324 skeleton files in total.

The step S2 of extracting key frames from the skeleton sequence includes:

on the premise that each section of video corresponding to different behavior types in the service robot application scene is extracted at intervals of 30 frames, 300 frames of data are reserved as a training set, less than 300 frames are repeatedly extracted from the beginning of the video, the number of people in joint data is judged, and joint data only containing one person is reserved for training and verifying the model.

The step S3 specifically includes:

s31, detecting character key points in a video image under a real scene by using an openposition human body key point detection algorithm model, obtaining horizontal and vertical coordinate values (x, y) of 25 skeleton joint points by using a body-25 human body joint point labeling model, splicing discrete joint points according to the physical connection mode of the human body joint points to form a human body skeleton space topological model, and then splicing the space topological graph of each frame in a time sequence to finally obtain a human body skeleton structure change space-time graph;

s32, for the missing detection condition of the whole frame data, defining the 0 th, 1 st and 8 th joint points as main key points, if the output joint point data corresponding to the video image has the condition that any one of the three groups of data is missing in a certain frame, judging that the whole frame data is missing detection, and deleting the joint point data corresponding to the video frame; for the condition that a part of key points of a certain frame are missing, a 2-order K nearest neighbor method is adopted for filling, training and parameter estimation are not needed, and the average value of horizontal and vertical coordinate values (x, y) of frames before and after the point is directly taken for supplement.

The step S4 specifically includes:

s41, graph convolution calculation process: in the best ofAfter the coordinates of the joint points are reached, the joint points are taken as vertexes, the natural connection of the joint points is taken as a bone edge, and the human bone is represented as a picture

Will be

The frame skeleton diagram is in time sequence

Arranging and connecting the same-position joint points to form a space-time skeleton graph and a node set

Is the set of all the joint points in each skeleton diagram, wherein

The number of joints per frame; edge set

Represented by two sets, the first subset representing the intra-skeleton connections of each frame, represented as

Wherein

Is a set of naturally connected human joints, the second subset representing connecting edges of identically located joint points between adjacent frames, to

To indicate that the user is not in a normal position,

as serial number of joint point, by node set

Hem edge set

A adjacency matrix can be obtained

The graph convolution is calculated as follows:

wherein,

in order to be an input, the user can select,

in order to be output, the output is,

in the form of a contiguous matrix, the matrix,

is a weight that can be learned by the user,

is the spatial dimension kernel size;

s42 calculating the self-adaptive graph convolution process as shown in the following formula

On the basis of (1), add

And

two matrices are provided, which are arranged in a matrix,

is a weight that can be trained in a particular way,

a unique map is learned for each sample,

s43, multi-scale space-time graph convolution calculation process: to better connect the spatial and temporal skeleton information, the first node of each node

The jump-to-adjacency matrix is tiled to form one

Of (2) matrix

，

Each node in the network is directly connected with the corresponding neighbor nodes on all frames, so that the jump connection between the nodes is realized, and the calculation process is as follows:

s44, MS-GCN multi-scale space-time graph convolution module: to input node information respectively

Extracting the jump adjacency matrix and finally extracting the jump adjacency matrix

The matrix is spliced together and then the matrix is spliced,

the serial number of the joint point;

is the coordinates of the joint point and,

representing nodes

The shortest distance between hops;

s45, MS-TCN time expansion convolution module: by using

Convolution for adjusting the number of channels of input information

The convolution kernel processes the integrated information, processes the features after convolution processing in a mode similar to void convolution, connects the extracted features together, and finally adds the step length of 2

Convolution is used for outputting the processed characteristics of the information;

s46, lightweight multi-scale space-time graph convolutional network MS-SGTCN _ S: in order to increase the robustness of the extracted features, two network branches are designed to carry out reasoning operation on input joint point data, wherein the first network branch consists of

The system comprises a convolution module, an MS-GCN module and a full connection layer, wherein 4 MS-GCN modules are adopted in the middle to extract multi-scale space-time characteristics, and the multi-scale space-time characteristics are realized by adopting different time and space sliding windows; the second branch consists of an MS-GCN module and two MS-TCN modules, a long-range time module is adopted to enhance the attention degree of the network to the context change of the joint point in the time dimension, then the characteristic information obtained by the two branches is uniformly sent to the MS-TCN module, then the characteristics are spliced together through a full connecting layer, and then the characteristics are subjected to softmax classifierThe category with the maximum probability obtained after processing is the predicted human behavior, in order to further improve the accuracy of the algorithm, a double-flow network is designed to train the joint points and the framework sequences respectively, then confidence statistics is carried out on the prediction results of the joint points and the framework double-flow network, and the human behavior with high confidence is used as the predicted value of final output.

In step S46, a dual-flow network is designed to train the joint point and the skeleton sequence, and then a confidence statistic is performed on the prediction results of the joint point and the skeleton dual-flow network, and the human behavior with higher confidence is the final output prediction value.

In the step 7, in order to reduce the influence of the complex external environment on the working quality of the service robot, the robot is designed to respond to a certain behavior after receiving a behavior signal for more than 2 seconds continuously, and for dangerous behaviors, the service robot sends alarm information to remind a worker to process the dangerous behaviors.

Compared with the prior art, the invention has the beneficial effects that: the invention can accurately identify the human body behaviors in the scene, and ensures the service quality of the service robot.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a visualization diagram of a training data skeleton according to an embodiment of the present invention;

FIG. 3 is a body skeleton spatial topology model according to an embodiment of the present invention;

FIG. 4 is a time-space diagram illustrating the change of the skeleton structure of a human body according to an embodiment of the present invention;

FIG. 5 is a block diagram illustrating a MS-GCN multi-scale space-time graph convolution module according to an embodiment of the present invention;

FIG. 6 is a block diagram of an MS-TCN time-dilation convolution module in accordance with an embodiment of the present invention;

FIG. 7 is a multi-scale space-time graph convolution network according to an embodiment of the present invention;

FIG. 8 shows a test set RGB video image test result 1 according to an embodiment of the present invention;

FIG. 9 shows a test set RGB video image test result 2 according to an embodiment of the present invention;

fig. 10 is a result 1 of human behavior recognition in a real scene according to an embodiment of the present invention;

fig. 11 is a result 2 of human behavior recognition in a real scene according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

As shown in fig. 1, an embodiment of the present application provides a behavior recognition method based on a service robot, including the following specific steps:

s2, preprocessing the training data set, firstly extracting key frames of the joint point sequence, and then optimizing the joint point data by combining with an actual application scene;

s5, training and testing the data set by using the constructed network model;

In step S1, the training data is derived from the NTU-RGB + D human behavior data set manufactured by the university of southern beauty, singapore, and 13 daily behaviors and medical behaviors are selected: drinking, picking up, throwing away, sitting down, standing up, jumping, shaking head, tumbling, chest pain, waving hands, kicking, hugging and walking, 12324 skeleton files in total.

In step S2, for the situation that the video durations corresponding to different action types are different, the original data is processed by adopting a method of sampling at intervals and cyclically repeating from the starting frame, that is, on the premise that each video segment extracts one frame at intervals of 30 frames, 200 frames of data are retained as a training set, and less than 200 frames of data are repeatedly extracted from the beginning of the video. The design algorithm judges the number of people in the joint point data, the joint point data only containing a single person is reserved for training and verifying the model, specifically, counting operation is carried out on the joint points, if the total number of the joint points is more than 25, the fact that interference crowds appear in the joint point data can be judged, and then the joint point data is deleted.

In order to further improve the efficiency of algorithm operation and be compatible with key point data of a body _25 human body posture estimation model in openposition, 25 joint point data in an original data set are subjected to weighted optimization processing to remove part of joint points which have little influence on the service robot service object behavior, and the joint point data are recoded.

The set of nodes of the training data set is represented by the following equation:

wherein,

in order to train a set of data set nodes,

to be at time

The coordinate values of the joint points are obtained, and the data set is subjected to a frame-taking process, wherein

Is set to be 200 a and is,

there are 25 joint points in total, which are the serial numbers of the joint points.

The set of 17 joint points after the weighted optimization process is represented by the following formula:

wherein,

in order to weight-optimize the set of back-joint points,

for after weighted optimization in time

Joint of the lower limbPoint coordinate values, as in the above formula, here

Is set to a maximum value of 200 a,

there are 17 joint points in total, which are the serial numbers of the joint points.

The joint points with the middle serial numbers of 1, 3, 4, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16 and 17 correspond to the joint points respectively

Joint point numbers

1, 5, 6, 9, 10, 13, 14, 15, 16, 17, 18, 19, 20, 21 in (1).

In the head

joint points

3, 4

Integrating the joint point set into a joint point 2 through weighted optimization calculation;

the left-hand

joint points

7, 8, 22 and 23 in

Integrating the joint point set into a joint point 5 through weighted optimization calculation;

right hand

joint points

11, 12, 24, 25 in

Integrating the joint point set into a joint point 8 through weighted optimization calculation; the weighted calculation formula is as follows:

wherein

From a set of joint points

；

From a set of joint points

；

The optimization coefficients are weighted for the joint.

After recoding, the 17 joint points are finally output as training set data, and the formed human skeleton topological structure diagram is shown in fig. 2 below.

The specific flow of step S3 is:

and S31, detecting the key points of the person in the video image under the real scene by using an openposition human key point detection algorithm model, and obtaining the horizontal and vertical coordinate values (x, y) and the confidence coefficient S of the 25 skeletal joint points by using a body-25 human joint point labeling model. The discrete joint points are spliced together according to the physical connection mode of the human body joint points to form a human body skeleton space topological model, such as a graph 2, and then the space topological graph of each frame in time sequence is spliced together to finally obtain a human body skeleton structure change space-time graph, such as a graph 3.

S32, due to the influence of external factors such as light, shading, character behavior change and the like, the problem of missing detection is difficult to avoid when the openposition human posture estimation algorithm is used for key point estimation, and the missing detection has two conditions of missing detection of the whole frame and missing detection of partial key points in a certain frame. For the first case, defining the 0 th, 1 st and 8 th as main key points, if any group of the three groups of data in a certain frame is missing, judging that the data in the whole frame is missed to be detected, and deleting the data of the frame; for the condition that the second part of key points are missing, a 2-order K neighbor method is adopted for filling, the mean value of frame data before and after the point is taken for filling, the rationality of data filling is ensured under the condition of small calculated amount, and the accuracy relation of complete joint point data to human behavior identification is tight.

And S33, outputting 17 joint points after the 25 joint points in the figure 3 are recoded by a weighted optimization algorithm by the joint points in the synchronization step S2.

The 25 node sets output by the body _25 model in openposition human body posture estimation are represented by the following formula:

wherein,

in order to train a set of data set nodes,

for the joint coordinate values taken at time, since the data is used for testing, here

Is the duration T of the entire video segment,

The joint points with the middle serial numbers of 1, 3, 4, 5, 6, 7, 8, 9, 10, 13, 14 and 17 correspond to the joint points respectively

Joint point numbers

8, 5, 6, 7, 2, 3, 4, 12, 13, 9, 10, 1 in (1).

The head

joint points

0, 15, 16, 17 and 18 in

in the left foot joint 14, 21

The joint point set is integrated into a joint point 11 through weighted optimization calculation, and

joint points

19 and 20 are in

Integrating the joint point set into joint points 12 through weighted optimization calculation;

right

foot articulation point

11, 24 in

The joint point set is integrated into a joint point 15 through weighted optimization calculation, and

joint points

22 and 23 are in

Integrating the joint point set into a joint point 16 through weighted optimization calculation; the weighted calculation formula is as follows:

wherein

From a set of joint points

；

From a set of joint points

；

The optimization coefficients are weighted for the joint.

The specific flow of step S4 is:

s41, graph convolution calculation process: after the coordinates of the joint points are obtained, the joint points are taken as vertexes

The natural connection of the joint points is taken as the skeleton edge

Human skeleton can be represented as a map

Will be

The frame skeleton diagrams are arranged according to a time sequence and are connected with the joint points at the same positions to form a space-time skeleton diagram. Node set

Is the set of all the joint points in each skeleton diagram, and the calculation process of the diagram convolution is as follows:

wherein,

in order to be an input, the user can select,

in order to be output, the output is,

in the form of a contiguous matrix, the matrix,

are learnable weights.

And S42, in the adaptive graph convolution calculation process, scientific researchers successively put forward adaptive graph convolution due to the fact that a topological structure fixed by graph convolution is not friendly to the joint points which are not physically connected but have strong relevance. The calculation process is shown in the following formula, in the original adjacency matrix

On the basis of (2), newly adding

And

two matrices.

Is a trainable weight and has no pairIt is subject to any constraints such as normalization, i.e.

The parameters are parameters completely learned from data, and can not only indicate whether two nodes are in contact or not, but also indicate the strength of the contact, and the difference between the parameters and the ST-GCN is a fusion mode. ST-GCN is a multiplication, here an addition, which can result in a nonexistent association.

A unique graph is learned for each sample, and a very classical Gaussian embedding function is adopted, so that the similarity between joints can be captured.

S43, a multi-scale space-time graph convolution calculation process: to better connect the spatial and temporal skeleton information, the first node of each node

The jump-to-adjacency matrix is tiled to form one

Of (2) matrix

，

s44, MS-GCN multi-scale space-time graph convolution module: respectively to the input nodeFirst of information

The matrices are spliced together.

S45, MS-TCN time expansion convolution module: by using

Convolution for adjusting the number of channels of input information

The convolution has a certain correction effect on the features after the output information is processed and the proposed features.

S46, lightweight multi-scale space-time graph convolutional network MS-SGTCN _ S: in order to increase the robustness of the extracted features, two network branches are designed to carry out reasoning operation on the input joint point data. The first network branch is composed of

The system comprises a convolution module, an MS-GCN module and a full connection layer, wherein 4 MS-GCN modules are adopted in the middle to extract multi-scale space-time characteristics, and the multi-scale space-time characteristics are realized by adopting different time and space sliding windows; the second branch is composed of an MS-GCN module and two MS-TCN modules, and a long-range time module is adopted to strengthen the attention of the network to the context change of the joint point in the time dimension. Then, the feature information obtained by the two branches is uniformly sent to an MS-TCN module, then the features are spliced together through a full connection layer, and the class with the maximum probability is obtained after the features are processed by a softmax classifierPredicted human behavior. In order to further improve the accuracy of the algorithm, a double-flow network is designed to train the joint points and the framework sequences respectively, then confidence statistics is carried out on the prediction results of the joint points and the framework double-flow network, and the human body behavior with higher confidence is used as the final output prediction value.

Step S5 trains and tests the data set using the constructed network model

The method is characterized in that a double-flow behavior recognition model is realized based on PyTorch, the method is carried out under CUDA11.1 and 3080Ti GPU, a small-batch stochastic gradient descent algorithm is used for learning network working parameters, the batch size is set to be 32, the momentum is set to be 0.9, the initial learning rate is 0.05, training iterations are reduced in the 25 th and 35 th training iterations, the weight attenuation is set to be 0.0005, and the accuracy under the division rules of X-Sub and X-View is shown in the following table:

in order to verify that the designed network can reduce the false recognition of similar human behaviors, the accuracy rates of 13 human behaviors are respectively counted, and the results are shown in the following table:

the accuracy of recognizing three human behaviors with high similarity, namely drinking, shaking head and chest pain, still reaches more than 95%, and the designed lightweight multi-scale space-time diagram convolutional network can reduce the false recognition of similar human behaviors.

The results of testing the test set of RGB video images are shown in fig. 8 and 9 below;

the result of identifying the human behavior in the real scene is shown in fig. 10 and fig. 11, and thus, the multi-scale space-time graph convolution model constructed by the invention can quickly and effectively identify the human behavior in the real scene, and the service quality of the service robot is ensured.

In order to reduce the influence of the complex external environment on the working quality of the service robot, the robot is designed to react to a certain behavior after receiving a certain behavior signal for more than 2 seconds continuously. For some dangerous behaviors, the service robot sends alarm information to remind a worker to process; the positioning technology and the obstacle avoidance technology of the service robot are combined, for example, when the service robot receives a hand waving action, the robot can move to a customer to carry out service; by combining the face recognition technology, the guest information can be registered and guided to a specified seat for service.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A behavior identification method based on a service robot is characterized by comprising the following specific steps:

s5, training and testing the data set by using the constructed network model;

2. The behavior recognition method based on the service robot as claimed in claim 1, wherein the training data set in step S1 is derived from NTU-RGB + D human behavior data set, and 13 behavior categories are selected: drinking, picking up, throwing away, sitting down, standing up, jumping, shaking head, tumbling, chest pain, waving hands, kicking, hugging and walking, 12324 skeleton files in total.

3. The service robot-based behavior recognition method as claimed in claim 1, wherein the step S2 of performing key frame extraction on the skeleton sequence comprises:

4. The service robot-based behavior recognition method according to claim 1, wherein the step S3 specifically comprises:

s31, detecting the human key points in the video image under the real scene by using an openposition human key point detection algorithm model, obtaining the horizontal and vertical coordinate values (x, y) of 25 skeletal joint points by using a body-25 human joint point labeling model, splicing the discrete joint points according to the physical connection mode of the human joint points to form a human skeleton space topological model, and then splicing the space topological graph of each frame in time sequence to finally obtain a human skeleton structure change space-time graph;

s32, for the missing detection condition of the whole frame data, defining the 0 th, 1 st and 8 th joint points as the main key points, if the output joint point data corresponding to the video image has any one of the three groups of data that a certain frame is missing, judging that the whole frame data is missing, and deleting the joint point data corresponding to the video frame; for the condition that a part of key points of a certain frame are missing, a 2-order K nearest neighbor method is adopted for filling, training and parameter estimation are not needed, and the average value of horizontal and vertical coordinate values (x, y) of frames before and after the point is directly taken for supplement.

5. The service robot-based behavior recognition method according to claim 1, wherein the step S4 specifically comprises:

s41, graph convolution calculation process: after obtaining the coordinates of the joint points, the human skeleton is represented as a map by using the joint points as vertexes and the natural connections of the joint points as skeleton edges

Will be

The frame skeleton diagram is in time sequence

Is the set of all the joint points in each skeleton diagram, wherein

The number of joints per frame; edge set

Wherein

To indicate that the position of the movable member,

as serial number of joint point, by node set

Hem edge set

An adjacency matrix can be obtained

The graph convolution is calculated as follows:

wherein,

in order to be an input, the user can select,

in order to be output, the output is,

in the form of a contiguous matrix, the matrix,

is a weight that can be learned by the user,

is the spatial dimension kernel size;

On the basis of (2), newly adding

And

two matrices are provided, which are arranged in a matrix,

is a weight that can be trained in a particular way,

a unique map is learned for each sample,

The jump-to-adjacency matrix is tiled to form one

Of (2) matrix

，

Corresponding on each node and all frames inThe neighbor nodes are directly connected, so that the jump connection between the nodes is realized, and the calculation process is as follows:

The matrix is spliced together and then the matrix is spliced,

the serial number of the joint point;

is the coordinate of the joint point and is,

representing nodes

The shortest distance between hops;

s45, MS-TCN time expansion convolution module: by using

Convolution for adjusting the number of channels of input information

The system comprises a convolution module, an MS-GCN module and a full connection layer, wherein 4 MS-GCN modules are adopted in the middle to extract multi-scale space-time characteristics, and the multi-scale space-time characteristics are realized by adopting different time and space sliding windows; the second branch is composed of an MS-GCN module and two MS-TCN modules, a long-range time module is adopted to enhance the attention degree of the network to the context change of the joint point in the time dimension, then the feature information obtained by the two branches is uniformly sent to the MS-TCN module, then the features are spliced together through a full connecting layer, the type with the maximum probability is the predicted human behavior after being processed by a softmax classifier, in order to further improve the accuracy of the algorithm, a double-flow network is designed to train the joint point and the framework sequence respectively, then the confidence statistics is carried out on the prediction results of the joint point and the framework double-flow network, and the human behavior with high confidence is taken as the predicted value of final output.

6. The behavior recognition method based on the service robot as claimed in claim 5, wherein in step S46, a dual-flow network is designed to train the joint point and the skeleton sequence, and then a confidence statistic is performed on the prediction results of the joint point and the skeleton dual-flow network, and the human behavior with higher confidence is the final predicted value.

7. The behavior recognition method based on the service robot as claimed in claim 1, wherein in step 7, in order to reduce the influence of the complex external environment on the working quality of the service robot, the robot is designed to respond to a certain behavior signal after receiving the behavior signal for more than 2 seconds continuously, and for dangerous behaviors, the service robot sends alarm information to remind a worker to process the dangerous behaviors.