CN114612938A

CN114612938A - Dynamic gesture recognition method based on multi-view three-dimensional skeleton information fusion

Info

Publication number: CN114612938A
Application number: CN202210276784.5A
Authority: CN
Inventors: 刘振宇; 李劭晨; 段桂芳; 谭建荣
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2022-06-10

Abstract

The invention provides a dynamic gesture recognition method based on multi-view three-dimensional skeleton information fusion. Comprises the following steps: segmenting an original dynamic multi-gesture sequence by adopting a detection method based on a sliding window to obtain a plurality of single-gesture sequences; carrying out three-dimensional space coordinate transformation on each single gesture sequence to obtain corresponding multi-view three-dimensional skeleton information; coding each multi-view three-dimensional skeleton information to obtain a single-view total skeleton map corresponding to a plurality of views; and respectively inputting the single-view total skeleton map of each view into the corresponding branch convolutional neural network for feature extraction, then inputting the single-view total skeleton map into a polymerization network based on a view attention mechanism, and then sequentially inputting the single-view total skeleton map into a flattening layer and a full connection layer, wherein the full connection layer outputs a single gesture classification result. The gesture recognition method based on the single visual angle can solve the problems of insufficient utilization of space information, difficulty in recognition of complex gestures, poor robustness and the like in the traditional gesture recognition method based on the single visual angle, and greatly improves the recognition accuracy.

Description

Dynamic gesture recognition method based on multi-view three-dimensional skeleton information fusion

Technical Field

The invention belongs to the field of human-computer interaction, relates to a dynamic gesture recognition method, and particularly relates to a dynamic gesture recognition method based on multi-view three-dimensional skeleton information fusion.

Background

In the field of human-computer interaction, dynamic gestures as a natural and efficient communication medium have been applied to a plurality of aspects such as robot remote control, virtual assembly, sign language recognition and the like. Dynamic gesture recognition has gained more and more attention from researchers as the most important link in intelligent gesture interaction. The existing dynamic recognition methods can be classified into recognition methods based on image recognition and recognition methods based on hand skeleton data according to the difference of input data. Among them, image-based methods are susceptible to illumination variations, resulting in difficult extraction of hand features. The method based on three-dimensional skeleton data is very robust to illumination and background change and has small calculation amount, so that the method becomes a mainstream research method.

Scholars at home and abroad develop a series of researches on gesture recognition based on three-dimensional skeletal data. Some researchers have performed gesture classification by means of manually designed features. However, the feature discrimination capability designed in this way is poor, and the gesture recognition rate is low. With the rapid development of the deep neural network, some scholars use the Recurrent Neural Network (RNN) and the long-short term memory unit (LSTM) to extract the characteristics of three-dimensional bone data, so that a better classification effect is achieved. However, when such networks encounter longer sequences, the gradient vanishing problem occurs, and the networks cannot be further optimized. With the introduction of Convolutional Neural Network (CNN) into the field of image classification, many scholars adopt CNN to extract depth features from three-dimensional skeletal data, and further use the depth features for gesture classification, and have made certain progress.

However, the existing gesture recognition methods based on three-dimensional skeleton data of the CNN all adopt gesture sequences observed from a single view angle as input, and the method ignores the influence of the important factor of the view angle on the gesture recognition accuracy. The effect of "view angle" on the gesture recognition rate is mainly focused on two aspects. On one hand, the observation of the gesture sequence from different perspectives can better capture the spatial structure information of the gesture sequence, and the spatial information is helpful for more comprehensively describing the characteristics of the gesture sequence; on the other hand, certain gestures may be difficult to recognize from a single perspective, while from another perspective may be easier to recognize.

Disclosure of Invention

In order to solve the problems in the background art, the invention provides a dynamic gesture image recognition method based on multi-view three-dimensional skeleton information fusion. According to the invention, by fully utilizing multi-view skeleton information, the problems of insufficient utilization of space information, difficulty in identification of complex gestures, poor robustness and the like in the traditional gesture identification method adopting a single view can be solved. The method provided by the invention has important significance for improving the accuracy of dynamic gesture recognition, can be simply and effectively deployed in other classification tasks, and has wide application value.

The technical scheme of the invention is as follows, which mainly comprises the following steps:

s1: the method comprises the steps of segmenting and processing an original dynamic multi-gesture sequence by adopting a detection method based on a sliding window to obtain a plurality of single-gesture sequences with equal length;

s2: carrying out three-dimensional space coordinate transformation on each single gesture sequence to obtain corresponding multi-view three-dimensional skeleton information;

s3: encoding the current multi-view three-dimensional skeleton information to obtain a single-view total skeleton map corresponding to a plurality of views, wherein the single-view total skeleton map of each view consists of an X-coordinate skeleton map, a Y-coordinate skeleton map and a Z-coordinate skeleton map;

s4: respectively inputting the single-view total skeleton map of each view into corresponding branch convolutional neural networks for feature extraction to respectively obtain corresponding single-view depth features, and forming the multi-view depth features of the current single-gesture sequence by using a plurality of single-view depth features;

s5: inputting the multi-view depth features of the current single-gesture sequence into an aggregation network based on a view attention mechanism for feature aggregation to generate global features of the current single-gesture sequence;

s6: the global features of the current single-gesture sequence are sequentially input into the flattening layer and the full-connection layer for gesture classification, and the full-connection layer outputs the classification result of the current single-gesture sequence;

s7: and repeating S2-S6, and classifying the remaining single-gesture sequences to obtain corresponding gesture classification results.

The S1 specifically includes:

s11: performing sliding detection on the original dynamic multi-gesture sequence by adopting a sliding detection window with a fixed length, determining the starting position and the ending position of each gesture, positioning the corresponding gesture according to the starting position and the ending position of each gesture, and further dividing the original dynamic multi-gesture sequence into a plurality of gesture sequences with single gestures to be used as a single-gesture sequence;

s12: and adjusting the sequence length of each single-gesture sequence by adopting a two-dimensional linear interpolation method, so that the lengths of the single-gesture sequences are equal.

In S2, each single gesture sequence is transformed by the following formula:

wherein s isⁱ _t,jA set of skeletal point coordinates representing an ith view angle of a t-th frame in a current single-gesture sequence,

respectively representing the X coordinate, the Y coordinate and the Z coordinate of the jth skeleton point of the ith visual angle of the tth frame in the current single gesture sequence, wherein i belongs to [1, N ]]N represents the total number of views; t is an element of [1, T ∈]T represents the total length of the current single-gesture sequence; j is an element of [1, J ∈ ]]J denotes the total number of skeleton points, T denotes the transpose operation, x_t,j,y_t,j,z_t,jAn X-coordinate, a Y-coordinate, and a Z-coordinate representing a jth skeletal point of a tth frame in a current single-gesture sequence, a Z-axis rotation matrix,

representing the angle of rotation of the current single-gesture sequence around the Z-axis at the ith viewing angle.

In S3, the current multi-view three-dimensional skeleton information is split in view dimension to obtain three-dimensional skeleton information of multiple single views, and spatial sequences of skeleton points are re-encoded according to link relationships between skeleton points based on the three-dimensional skeleton information of each single view to obtain a single-view total skeleton map of each view.

The structure of each branch convolutional neural network in the S4 is the same, and the branch convolutional neural network comprises six convolutional layers, a dimension conversion layer and four pooling layers,

the input of the branch convolutional neural network is input into a first convolutional layer, the first convolutional layer is connected with a second convolutional layer, the second convolutional layer is connected with a dimension conversion layer, the dimension conversion layer is connected with a fourth pooling layer after sequentially passing through a third convolutional layer, a first pooling layer, a fourth convolutional layer, a second pooling layer, a fifth convolutional layer, a third pooling layer and a sixth convolutional layer, the output of the fourth pooling layer is used as the output of the branch convolutional neural network, and the branch convolutional neural network outputs the single visual angle depth characteristic of the current single-hand potential sequence.

And training the multi-branch convolutional neural networks by adopting a parameter sharing mode.

The perspective attention mechanism-based aggregation network in S5 includes a plurality of convolutional layers, an average pooling layer, a maximum pooling layer and an active layer, a plurality of single visual angle depth features in the multi-visual angle depth features of the current single gesture sequence are respectively input into corresponding convolution layers for feature dimension compression, the dimension compression features output by each convolution layer are spliced to obtain mixed features, the mixed features are respectively input into an average pooling layer and a maximum pooling layer for attention weight calculation to respectively obtain an average visual angle attention weight and a maximum visual angle attention weight, the average visual angle attention weight and the maximum visual angle attention weight are subjected to element summation and then input into an activation layer, and the active layer outputs a view attention weight, and the view attention weight and the multi-view depth feature of the current single-gesture sequence are fused into a global feature in a vector dot product mode and output.

In each multi-branch convolutional neural network, a first dropout layer is further arranged between the third convolutional layer and the first pooling layer, and a second dropout layer is further arranged between the fourth convolutional layer and the second pooling layer.

In each multi-branch convolutional neural network, a first LeakyRelu activation function layer is further arranged between the fifth convolutional layer and the third pooling layer, a second LeakyRelu activation function layer is further arranged between the sixth convolutional layer and the fourth pooling layer, and the output of the fourth pooling layer is used as the output of the multi-branch convolutional neural network.

The invention has the beneficial effects that:

1. the accuracy of model classification is high: and the multi-view three-dimensional skeleton information is used as input, and the spatial structure information with rich dynamic gestures can be obtained. The increase in available information compared to a single perspective is equivalent to an information augmentation that helps to more accurately classify gestures. By designing the aggregation network based on the visual angle attention mechanism, the multi-visual angle depth features can be organically fused, and global information with more distinguishing capability is generated. By testing on the public data set, the method has higher accuracy than the existing method.

2. The model adaptability is strong: by effectively utilizing the spatial information of each visual angle, the rough operation gesture and the fine operation gesture commonly used in human-computer interaction can be accurately classified. The parameter sharing mechanism adopted in the multi-branch convolutional neural network can effectively reduce the parameter quantity, so that the model can be deployed on other classification tasks, and the adaptability is strong.

In summary, the method provided by the invention can solve the problems of insufficient utilization of spatial information, difficult recognition of complex gestures, poor robustness and the like in the traditional gesture recognition method adopting a single visual angle by fully utilizing multi-visual angle skeleton information. The method provided by the invention can be simply and effectively deployed in other classification tasks, and has wide application value.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a general framework of the method of the present invention;

FIG. 3 is a diagram of a branched convolutional neural network for extracting single-view features according to the present invention;

FIG. 4 is a diagram of a view attention mechanism based aggregation network architecture in accordance with the present invention;

FIG. 5 is a graph illustrating network convergence analysis during training according to the present invention;

FIG. 6 is a diagram illustrating confusion matrices for gesture classification results according to the present invention.

Detailed Description

The invention is further described in detail below with reference to the accompanying drawings and specific experiments performed using the published data set SHREC' 17Track Dataset.

The embodiment of the invention adopts the public data set SHREC' 17Track Dataset to carry out model training and testing. The dataset was acquired using an Intel Real Sense depth camera, containing 2800 gesture sequences from 28 participants. The data set contains 28 gesture categories in total, which are divided into two major categories: single-finger gestures and full-hand gestures. Each of which contains 14 common gestures. In this embodiment, the training set includes 1960 gesture sequences, and the test set includes 840 gesture sequences.

As shown in fig. 1 and 2, the present invention includes the following steps:

s1: the method comprises the steps of segmenting an original dynamic multi-gesture sequence by adopting a detection method based on a sliding window to obtain a plurality of equilong single-gesture sequences;

s1 specifically includes:

s11: sliding detection is carried out on the original dynamic multi-gesture sequence by adopting a sliding detection window with a fixed length, the starting position and the ending position of each gesture are determined, the corresponding gesture is positioned according to the starting position and the ending position of each gesture, and then the original dynamic multi-gesture sequence is divided into a plurality of gesture sequences with single gestures and used as a single gesture sequence, so that the single gesture can be conveniently identified in the follow-up process;

s12: aiming at the problem of different gesture sequence frame numbers in the original data set, sequence length adjustment is carried out on each single-gesture sequence by adopting a two-dimensional linear interpolation method, so that the length of each single-gesture sequence is equal, and feature extraction is conveniently carried out through a convolutional neural network. In this embodiment, each of the one-gesture sequences has a length of 32.

S2: in order to avoid the influence of the initial coordinate position on the gesture recognition, coordinate system conversion is performed on the coordinate data. The origin of coordinates of each sequence is set at the wrist joint point of the first frame.

Then, by adopting a method of annular uniform arrangement, N-8 virtual cameras are arranged around the gesture sequence, the interval angle between each two cameras is 45 degrees, and the spatial structure of the gesture sequence is observed from multiple angles.

Since the observation of the gesture sequence from multiple angles is equivalent to the direct rotation of the gesture sequence around a coordinate axis, the three-dimensional space coordinate transformation of each single-gesture sequence is carried out to obtain the corresponding multi-view three-dimensional skeleton information;

in S2, each single gesture sequence is transformed by the following formula in three-dimensional space coordinates, the specific formula is as follows:

respectively representing the X coordinate, the Y coordinate and the Z coordinate of the jth skeleton point of the ith visual angle of the tth frame in the current single-gesture sequence, i belongs to [1, N ∈]N represents the total number of views; t is an element of [1, T ∈]T represents the total length of the current single-gesture sequence; j is an element of [1, J ∈ ]]J denotes the total number of skeleton points, T denotes the transpose operation, x_t,j,y_t,j,z_t,jAn X coordinate, a Y coordinate, and a Z coordinate representing a jth skeletal point of a tth frame in a current single-gesture sequence, a Z-axis rotation matrix,

representing the angle of rotation of the current single-gesture sequence about the Z-axis at the ith perspective. The method is equivalent to arranging a plurality of virtual cameras around the gesture sequence, and achieves the purpose of observing the spatial structure of the gesture sequence from multiple visual angles. The X-axis represents the direction of movement of the gesture, the Y-axis represents the direction of view of the camera, and the Z-axis represents the direction vertically up perpendicular to the X-Y plane.

S3: encoding the current multi-view three-dimensional skeleton information to obtain a single-view total skeleton map corresponding to a plurality of views, wherein the single-view total skeleton map of each view consists of an X-coordinate skeleton map, a Y-coordinate skeleton map and a Z-coordinate skeleton map; the dimensions of the single-view total bone map for each view are T × J × 3, T and J being the total length of the sequence and the total number of bone points, respectively.

In the step S3, the current multi-view three-dimensional skeleton information is split in view dimension to obtain the three-dimensional skeleton information of a plurality of single views, and the spatial sequence of each skeleton point is recoded according to the link relation between the skeleton points based on the three-dimensional skeleton information of each single view to obtain the single-view total skeleton map of each view.

S4: respectively inputting the single-view total skeleton map of each view into corresponding branch convolution neural networks for feature extraction to respectively obtain corresponding single-view depth features, and forming the multi-view depth features of the current single-gesture sequence by using a plurality of single-view depth features;

as shown in fig. 3, each branched convolutional neural network in S4 has the same structure, including six convolutional layers (Conv), a dimension transform layer, and four Pooling layers (Pooling),

the input of the branch convolutional neural network is input into a first convolutional layer, the first convolutional layer is connected with a second convolutional layer, the sizes of convolutional kernels of the first convolutional layer and the second convolutional layer are respectively set to be 1 × 1 and 3 × 1, wherein "1" indicates that the dimensions of the two layers in the bone joint are 1. The second convolution layer is connected with the dimension conversion layer, the dimension conversion layer is connected with the fourth pooling layer after sequentially passing through the third convolution layer, the first pooling layer, the fourth convolution layer, the second pooling layer, the fifth convolution layer, the third pooling layer and the sixth convolution layer, the output of the fourth pooling layer is used as the output of the branch convolutional neural network, and the branch convolutional neural network outputs the single-view angle depth characteristic of the current single-hand gesture sequence. The first convolution layer and the second convolution layer are used for extracting shallow layer characteristics; the dimension conversion layer exchanges the channel dimension and the skeleton joint dimension through matrix transposition conversion; the third, fourth, fifth and sixth convolutional layers are used to extract depth information, and the four pooling layers are used to guarantee feature invariance.

And training the multi-branch convolutional neural networks by adopting a parameter sharing mode. This approach helps to reduce the number of parameters while improving the training efficiency of the neural network.

S5: inputting the multi-view depth features of the current single-gesture sequence into a polymerization network based on a view attention mechanism for feature polymerization to generate global features of the current single-gesture sequence;

as shown in fig. 4, the view attention mechanism-based aggregation network in S5 includes a plurality of convolution layers, an average pooling layer, a maximum pooling layer, and an active layer, the convolution kernel of the convolution layers is 1 x 1, a plurality of single-view depth features in the multi-view depth features of the current single-gesture sequence are respectively input into corresponding convolution layers for feature dimension compression, the dimension compression features output by each convolution layer are spliced to obtain mixed features, the mixed features are respectively input into an average pooling layer and a maximum pooling layer for attention weight calculation to respectively obtain an average view attention weight and a maximum view attention weight, the average view attention weight and the maximum view attention weight are subjected to Element Summation (Element-wise Summation) and then input into an active layer, and the active layer outputs a view attention weight, and the view attention weight and the multi-view depth feature of the current single-gesture sequence are fused into a global feature in a vector dot product mode and output.

In the training stage of the model, Pytrch is used as a development platform in the method, and NVIDIA 1060TI GPU is used for parallel computing to accelerate the training speed of the model. The training set was divided into different batches for training, each batch containing data sized to 64. In the training process, a cross entropy function is adopted to calculate the loss value of the model. Model parameters were optimized by Adam algorithm with a learning rate set to 0.001 for a total of 50 cycles of training.

In the testing stage, 840 gesture sequences in the testing set are input into the trained model for testing, so that a gesture classification result and recognition accuracy are obtained.

In order to verify the effectiveness of the method, three different groups of multi-view feature fusion methods are compared. The first method is an element-wise addition. The method adds the characteristic values of each view angle characteristic, and has the advantage that the space position and the characteristic dimension of the characteristic can be ensured to be unchanged. The second method is characterized by splicing (conjugation). According to the method, the multi-view features are spliced in the channel dimension, and feature information can be added on the premise of ensuring the features of the space dimension to be unchanged. Method three is the max _ pooling method. The method takes the maximum value of the feature of the corresponding position of each view angle feature as the final feature value.

As shown in fig. 5, the convergence analysis graph of the network in the model training process is obtained by comparing the method with the above three methods. As can be seen from fig. 5, the method two has an improvement in both convergence rate and recognition accuracy compared to the method one. The reason is that the feature stitching can preserve the original information of each view. Method three is a further improvement over the first two methods, demonstrating that the information at each location in the feature can be better highlighted using the maximum pooling approach. Compared with the three methods, the method has higher accuracy, faster convergence speed and smaller data jitter. The method is proved to be capable of extracting the useful information in the multi-view characteristic more accurately and has stronger robustness.

In order to further verify the superiority of the method compared with the traditional single-view method, several methods for describing the gesture by adopting the traditional characteristics and extracting gesture information by adopting a Recurrent Neural Network (RNN) and a Convolutional Neural Network (CNN) are compared. These methods include HON4D, HOG2, SoCJ, RNN and CNN. As can be seen from table 1, the method provided by the present invention makes full use of the bone information of multiple viewing angles, and is significantly superior to the conventional method in gesture recognition effect, thereby further verifying the effectiveness of the method.

TABLE 1 comparison of gesture recognition accuracy rates for different methods

As shown in fig. 6, the recognition result of each gesture is displayed by the confusion matrix. The gestures recognized by the method comprise 14 common gestures: 1. grabbing (G); 2. kiss (T); 3. expanding (E); 4. kneading (P); 5. clockwise rotation (R-CW); 6. counterclockwise rotation (R-CCW); 7. sliding to the right (S-R); 8. sliding to the left (S-L); 9. sliding upwards (S-U); 10. sliding downwards (S-D); 11. x-type sliding (S-X); 12. v-shaped sliding (S-V); 13. type + sliding (S- +); 14. wave hand (Sh). For 11 gestures, the recognition effect of the method exceeds 90%. For all gestures, the average recognition rate of the method reaches 93.62%.

By utilizing the method provided by the invention, the SHREC' 17Track Dataset can be accurately classified. Meanwhile, the dynamic gesture recognition method based on multi-view three-dimensional skeleton information fusion, which is provided by the invention, can solve the problems of insufficient utilization of space information, difficulty in recognition of complex gestures, poor robustness and the like in the traditional gesture recognition method adopting a single view through fully utilizing multi-view skeleton information. The method provided by the invention can be simply and effectively deployed in other classification tasks, and has wide application value.

The above examples are merely the results of the present invention on this example, but the specific implementation of the present invention is not limited to this example. Any alternatives which have similar effects according to the principles and concepts of the invention should be considered as the protection scope of the invention.

Claims

1. A dynamic gesture recognition method based on multi-view three-dimensional skeleton information fusion is characterized by comprising the following steps:

s1: the method comprises the steps that a sliding window-based detection method is adopted to segment and process an original dynamic multi-gesture sequence to obtain a plurality of equilong single-gesture sequences;

2. The method for dynamic gesture recognition based on multi-view three-dimensional skeletal information fusion according to claim 1, wherein the S1 specifically comprises:

s11: adopting a sliding detection window with a fixed length to perform sliding detection on the original dynamic multi-gesture sequence, determining the starting position and the ending position of each gesture, positioning the corresponding gesture according to the starting position and the ending position of each gesture, and further dividing the original dynamic multi-gesture sequence into a plurality of gesture sequences with single gestures to be used as a single-gesture sequence;

3. The method for dynamic gesture recognition based on multi-view three-dimensional skeleton information fusion of claim 1, wherein in S2, each single gesture sequence is transformed by the following formula:

wherein s isⁱ _t,jA set of skeletal point coordinates representing the ith view angle of the t-th frame in the current single gesture sequence,

respectively representing the X coordinate, the Y coordinate and the Z coordinate of the jth skeleton point of the ith visual angle of the tth frame in the current single-gesture sequence, i belongs to [1, N ∈]N represents the total number of views; t is an element of [1, T ∈]T represents the total length of the current single-gesture sequence; j is an element of [1, J ∈ ]]J denotes the total number of skeleton points, T denotes the transpose operation, x_t,j,y_t,j,z_t,jAn X-coordinate, a Y-coordinate, and a Z-coordinate representing a jth skeletal point of a tth frame in a current single-gesture sequence, a Z-axis rotation matrix,

representing the angle of rotation of the current single-gesture sequence about the Z-axis at the ith perspective.

4. The dynamic gesture recognition method based on multi-view three-dimensional bone information fusion of claim 1, wherein in S3, the current multi-view three-dimensional bone information is split in view dimension to obtain a plurality of single-view three-dimensional bone information, and based on the single-view three-dimensional bone information, the spatial order of the bone points is re-encoded according to the link relationship between the bone points to obtain a single-view total bone map of each view.

5. The method of claim 1, wherein each branch convolutional neural network in S4 has the same structure and comprises six convolutional layers, a dimension transform layer and four pooling layers,

6. The method for dynamic gesture recognition based on multi-view three-dimensional skeletal information fusion of claim 1, wherein the multi-branch convolutional neural networks are trained by parameter sharing.

7. The method according to claim 1, wherein the aggregation network based on the visual angle attention mechanism in S5 includes multiple convolution layers, an average pooling layer, a maximum pooling layer and an active layer, wherein multiple single-visual angle depth features in the multiple visual angle depth features of the current single-handed gesture sequence are respectively input into the corresponding convolution layers for feature dimension compression, the dimension compression features output by the convolution layers are spliced to obtain a mixed feature, the mixed feature is respectively input into the average pooling layer and the maximum pooling layer for attention weight calculation to respectively obtain an average visual angle attention weight and a maximum visual angle attention weight, and the average visual angle attention weight and the maximum visual angle attention weight are subjected to element summation and then input into the active layer, and the active layer outputs the visual angle attention weight, and the visual angle attention weight and the multi-visual angle depth feature of the current single-gesture sequence are fused into a global feature in a vector point multiplication mode and output.

8. The dynamic gesture recognition method based on multi-view three-dimensional bone information fusion of claim 5, wherein in each multi-branch convolutional neural network, a first dropout layer is further arranged between a third convolutional layer and a first pooling layer, and a second dropout layer is further arranged between a fourth convolutional layer and a second pooling layer.

9. The dynamic gesture recognition method based on the multi-view three-dimensional skeletal information fusion of claim 5, characterized in that, in each multi-branch convolutional neural network, a first LeakyRelu activation function layer is further arranged between a fifth convolutional layer and a third pooling layer, a second LeakyRelu activation function layer is further arranged between a sixth convolutional layer and a fourth pooling layer, and the output of the fourth pooling layer is used as the output of the multi-branch convolutional neural network.