CN117152838A

CN117152838A - Gesture recognition method based on multi-core dynamic attention mechanism

Info

Publication number: CN117152838A
Application number: CN202311098247.7A
Authority: CN
Inventors: 齐静; 马俐; 崔振超
Original assignee: Hebei University
Current assignee: Hebei University
Priority date: 2023-08-29
Filing date: 2023-08-29
Publication date: 2023-12-01

Abstract

The invention relates to a gesture recognition method based on a multi-core dynamic attention mechanism, which comprises the following steps: s1, constructing a gesture recognition model; s2, acquiring RGB images and depth images of gestures; s3, extracting gesture features; s4, detecting the position of the gesture; s5, recognizing gestures; and S6, sending the recognition result of the gesture to the robot terminal in a message format. The multi-mode feature extraction is performed by using the multi-core dynamic attention mechanism method, so that the gesture features of the RGB image and the gesture features of the depth image can be better extracted and fused, and a better gesture recognition effect is obtained. By the method, the static gestures captured by the real sensor depth camera can be recognized in real time, meanwhile, a person can control the operation of the mobile operation robot in real time through the gestures, the technical effects of improving the gesture recognition effect and enhancing the interaction experience are achieved, and the method has positive significance in promoting the development of the fields of man-machine interaction, virtual reality, intelligent home and the like.

Description

Gesture recognition method based on multi-core dynamic attention mechanism

Technical Field

The invention relates to a target detection method, in particular to a gesture recognition method based on a multi-core dynamic attention mechanism.

Background

The gesture recognition technology has wide application in the fields of man-machine interaction, virtual reality, intelligent home and the like. Through gesture recognition, a user can interact with the device by using natural gesture actions so as to provide a more visual and convenient operation control mode. Gesture recognition relates to the technical fields of computer vision, pattern recognition, machine learning and the like. With the increasing application of mobile robots, gesture recognition is a natural and intuitive interaction mode and has been widely applied to the field of control of mobile robots. Gesture recognition techniques can translate human motion instructions into control commands for a robot by analyzing and understanding the morphology and motion information of human gestures. Traditional gesture recognition methods are mainly based on image processing and machine learning algorithms, but are limited by the diversity and complexity of gestures, and have certain defects in real-time performance and accuracy.

Conventional monocular gesture recognition methods generally include the steps of: RGB image acquisition including gestures, hand detection and segmentation in RGB images, gesture feature extraction in images, and gesture recognition. There are certain limitations to this type of gesture recognition approach. Firstly, the RGB image can only provide two-dimensional image information, and lacks perception of hand depth information, so that the RGB image is easily affected by factors such as illumination, shielding and the like; secondly, the recognition effect on complex gestures is not high, such as hand rotation, fine motions and the like.

In multi-modal learning, conventional modal fusion often uses a static fusion approach. However, the contribution degree of each mode may change with the scene and the content, and the static fusion method generally uses fixed weights, so that the changes cannot be adapted, which leads to the reduction of the accuracy performance of the gesture recognition model, the reduction of the robustness and the information loss. Complex correlations and dependencies may exist between different modalities, and traditional static fusion methods may only capture a portion of these relationships, which limits the characterization capabilities of the gesture recognition model. In addition, multimodal data is also affected by a variety of factors such as illumination, viewing angle, object size, etc. Traditional fusion methods may not be able to accommodate these complex and varying influencing factors, resulting in performance fluctuations of gesture recognition models in different environments and scenarios.

Disclosure of Invention

The invention aims to provide a gesture recognition method based on a multi-core dynamic attention mechanism, which aims to solve the problems that the existing recognition method is easy to influence and has poor recognition effect on complex gestures.

The purpose of the invention is realized in the following way:

a gesture recognition method based on a multi-core dynamic attention mechanism comprises the following steps:

s1, constructing a gesture recognition model: using a parallel double-branched YOLOv5 network as a gesture recognition model; the YOLOv5 network comprises a backbone network for extracting gesture characteristics and a detection head for predicting the position and the category of the gesture.

S2, acquiring RGB images and depth images of gestures: a depth camera is used to acquire RGB images and depth images of the operation control gestures performed by the controller.

S3, extracting gesture features: inputting RGB images and depth images of gestures into the gesture recognition model through forward propagation by utilizing a multi-core dynamic attention mechanism, respectively extracting image features in the RGB images, and fusing the image features extracted from the RGB images and the image features extracted from the depth images; the extracted image features contain semantic information of gestures.

S4, detecting the position of the gesture: and predicting the gesture in the extracted image features through the detection head in the gesture recognition model to obtain the position of the gesture frame, the category of the gesture frame and the confidence of the gesture frame.

S5, recognizing gestures: and screening the gesture frames according to the obtained positions and the confidence degrees of the gesture frames, eliminating the overlapped gesture frames by using a non-maximum suppression algorithm (NMS), and finally outputting the gesture frames containing recognition results including the positions of the gestures, the types of the gestures and the confidence degrees of the gestures.

And S6, sending the recognition result of the gesture to the robot terminal in a message format.

Further, the working mode of the multi-core dynamic attention mechanism of the gesture recognition model in step S3 is as follows: the multi-core dynamic attention convolution dynamically adjusts the convolution weight and bias parameters by using the attention weight to carry out weighted average on the convolution parameters, calculates the attention weights of different modes through the SE module, and then uses the weights to carry out weighted summation on the characteristics of different modes so as to obtain the fused characteristics.

In step 3, the multi-core dynamic attention mechanism is utilized to extract the image characteristics of the gestures, RGB images and depth images containing the gestures can be accurately obtained, and the weight is automatically adjusted in the gesture recognition process by utilizing the multi-core dynamic attention mechanism, so that the characteristic information of each gesture can be better mined and utilized. By extracting and fusing the features in the RGB image and the depth image, gestures can be more comprehensively and accurately described, and therefore accurate gesture recognition is achieved.

Further, the specific working mode of the multi-core dynamic attention mechanism of the gesture recognition model is as follows:

s3-1-1 generates K different weights, each weight corresponds to one convolution kernel, and the convolution calculation of each convolution kernel input is expressed as:

output[k]＝Conv2d(x，W[k])+b[k]

wherein x is the input feature map, W [ k ] is the kth convolution kernel, b [ k ] is the kth offset, output [ k ] is the output feature map obtained by adding the offset b [ k ] after the convolution operation of the kth convolution kernel W [ k ] on the input feature map x.

S3-1-2 the weight of each convolution kernel is multiplied by the input signature.

All the outputs of S3-1-3 are accumulated together to form the final output as:

where pi [ k ] is the kth weight obtained by the attention mechanism.

Further, the feature fusion of the gesture recognition model in step S3 works in the following manner:

s3-2-1, acquiring RGB images of gestures through a gesture recognition model is as follows: RGB (red, green and blue) _in ∈R ^C×H×W The method comprises the steps of carrying out a first treatment on the surface of the The depth image of the gesture is obtained through the gesture recognition model: depth _in ∈R ^C×H×W The RGB image and the depth image are obtained after each pass through a gating convolution layer:

RGB′ _in ＝σ(Conv _1x1 (RGB _in ))·RGB _in

Depth′ _in ＝σ(Conv _1x1 (Depth _in ))·Depth _in

wherein σ represents the activation function, conv _1x1 (x) Representing a 1 x 1 convolution operation performed on the input x.

After the gating operation, adaptive Average Pooling (AAP) is performed to generate a cross-modal space descriptor: x= (X) ₁ ，...，X _k ，...，X _2C )：

X＝AAP(RGB′ _in ||Depth′ _in )

Wherein || denotes that RGB is to be treated _in And Depth _in And (5) splicing.

The cross-modal attention vector for the S3-2-2 depth input is learned by:

W _rgb ＝σ(Conv _1x1 (ReLU(DM(X))))

W _depth ＝σ(Conv _1x1 (ReLU(DM(X))))

wherein W is _rgb Is representative of RGB gesture feature weight, W _depth Is a weight representing Depth gesture feature, DM (x) represents the proposed multi-core dynamic attention module, conv _1x1 (x) Representing a 1 x 1 convolution operation.

S3-2-3 warpGesture feature map RGB 'of overgating operation' _in And Depth' _in Multiplying by channel weights W, respectively _rgb And W is _depth To adjust or enhance the gesture features:

RGBf＝Wrgb·RGB′ _in ；

Depth _f ＝W _depth ·Depth′ _in 。

s3-2-4 convolving RGBF and Depth _f Fusion is carried out to obtain an RGB-D characteristic diagram RGBD _f ：

RGBD _f ＝Conv _1x1 ([RGB _f ；Depth _f ])。

S3-2-5 RGBD for RGB-D characteristic map _f Attention weight a of RGB and depth features is obtained through multi-core dynamic attention convolution _rgb And a _depth ：

a _rgb ＝DM _rgb (RGBD _f )

a _depth ＝DM _depth (RGBD _f )

Wherein, DM _rgb () Representing performing multi-core dynamic attention convolution on the RGB features; DM (DM) _depth () Representing performing a multi-core dynamic attention convolution on the depth features; a, a _rgb Is a weight assigned to each position in the RGB gesture feature map, a _depth Is a weight assigned to each location in the depth gesture feature map.

S3-2-6, carrying out softmax normalization on the spatial attention weight of each mode to obtain a final spatial attention weight:

wherein,and->

S3-2-7 gesture feature map RGB 'subjected to gating operation' _in And Depth' _in Respectively multiplying the corresponding attention weights, and adding to obtain the fused gesture feature RGBD _out ：

RGBD _out ＝A _rgb ·RGB′ _in +A _depth ·Depth′ _in 。

According to the gesture recognition method, the weights of the RGB mode and the depth mode can be adaptively adjusted by utilizing the multi-core dynamic attention mechanism, so that the characteristic information of each gesture can be better mined and utilized. Compared with the traditional static fusion method, the method for dynamically aggregating a plurality of convolution kernels can more flexibly capture and model complex interactions of RGB modes and depth modes, and accuracy and robustness of gesture features after multi-mode fusion are improved. The method can automatically sense and adapt to the changes of environments and scenes such as illumination changes, object size differences and the like by self-adaptively adjusting the convolution kernel and using an attention mechanism, so that the method has stronger robustness and is less influenced by illumination conditions. In order to facilitate deployment on a mobile robot, the invention selects a lightweight network YOLOv5 as a backbone network for gesture recognition. Such a selection can reduce the consumption of computing resources, making the method more efficient to run on mobile robots. Therefore, the gesture recognition method provided by the invention not only has lower calculation resource requirement, but also has stronger adaptability, and is very suitable for being applied to mobile robots.

The invention well solves the problem of how to accurately extract and fuse gesture features in RGB images and depth images by using a multi-core dynamic attention mechanism. By introducing a multi-core dynamic attention mechanism, the weights of the RGB mode and the depth mode can be flexibly adjusted, and the characteristic information of each gesture can be more effectively mined and utilized, so that the accuracy and the robustness of gesture recognition are improved. The gesture recognition method can adaptively adjust the convolution kernel and introduce a attention mechanism to adapt to the changes of environments and scenes such as illumination changes, object size differences and the like, so that the gesture recognition method has stronger robustness and is less influenced by illumination conditions.

The robot gesture interaction system constructed according to the invention provides a convenient and reliable robot system control mode. By establishing the robot gesture interaction databases, establishing the mapping relation among the databases, establishing the static gesture data set and the label, and receiving the message of the gesture recognition result, intelligent gesture interaction can be realized. The control mode of the robot system enables a user to interact with the robot naturally and intuitively through gestures, provides a convenient and reliable control mode, and further enhances user experience.

The invention adopts the method of simultaneously acquiring RGB image and depth image containing gestures to carry out gesture recognition. Meanwhile, feature extraction is a key link of gesture recognition, and the effect of feature extraction influences the result of gesture recognition. The invention adopts the RGB image and the depth image containing gestures to simultaneously acquire for gesture recognition, which means that the gesture characteristics of RGB images and the gesture characteristics of depth images are to be simultaneously extracted.

The multi-mode feature extraction is performed by using the multi-core dynamic attention mechanism method, so that the gesture features of the RGB image and the gesture features of the depth image can be better extracted and fused, and a better gesture recognition effect is obtained. By the method, the static gesture captured by the RealSense depth camera can be recognized in real time, and meanwhile, a person can control the operation of the mobile operation robot in real time through the gesture.

According to the method, a multi-core dynamic attention mechanism is adopted, and multi-mode feature extraction and fusion are carried out by simultaneously acquiring the RGB image and the depth image containing gestures, so that gesture recognition has better accuracy and instantaneity. The gesture recognition method has the advantages of improving the gesture recognition effect, enhancing the technical effect of interaction experience, providing more visual and convenient operation modes, and having positive significance for promoting the development of the fields of man-machine interaction, virtual reality, intelligent home and the like.

Drawings

FIG. 1 is a general architecture diagram of a gesture recognition method of the present invention.

FIG. 2 is a mapping of static gestures to motion patterns, tools/modes of operation, user intent.

FIG. 3 is a block diagram of a gesture recognition model.

FIG. 4 is a block diagram of the modules; wherein, (a) is a Focus module, (b) is a CBS module, (C) is a C3 module, and (d) is a structural diagram of the SPP module.

Fig. 5 is a block diagram of a feature fusion module.

FIG. 6 is a block diagram of a multi-core dynamic attention mechanism.

FIG. 7 is a static gesture recognition flow chart.

Detailed Description

The application object of the invention is a mobile operation robot, in particular a robot for performing an explosion-venting task. In the processing of crisis events handling explosives in public places such as subways, malls and airports, it is often necessary to implement wireless signal shielding measures on site in order to avoid triggering explosions. In this scenario, a better way to remotely control the robot using gestures than conventional tow-line or wireless handle control schemes. This is because the gesture-controlled robot has a more natural interaction and a more convenient use, and can better adapt to the working environment where the network signals are shielded.

As shown in fig. 1, the present invention mainly relates to the following three aspects: 1. mapping relation between an operation instruction of a mobile robot and a static gesture, 2, a static gesture recognition method based on a multi-core attention mechanism, and 3, man-machine interaction between a gesture executor and the mobile operation robot. When a gesture executor makes a preset static gesture, the gesture recognition method can recognize the meaning of the gesture and send the gesture to the robot end in the form of message information. After the robot receives the message, comparing gesture information in the message, searching an operation instruction corresponding to the gesture in the mapping relation, and finally executing corresponding operation.

The invention is based on a robot gesture interaction system, and the construction of the robot gesture interaction system mainly comprises the following aspects:

1. and establishing a robot gesture interaction database.

The robot gesture interaction database comprises four parts: a motion pattern database, a tool/operation pattern database, a user intent database, and a static gesture database. The purpose of these databases is to record information such as robot movements, operation modes, user intentions, and static gestures.

The motion pattern database records various motion modes of the robot, including leftward front, forward, rightward front, leftward rear, rearward, rightward rear, arm-in-arm upward, arm-in-arm downward, arm-in-arm front upward, arm-in-arm front downward, arm-in-arm chassis leftward and arm-in-chassis rightward.

The tool/mode of operation database records the modes of operation of the end effector of the robotic arm, which is a clamp for a wheeled mobile manipulator robot, and thus the modes of operation of the opening and closing of the clamp are also included in the database.

The user intent database includes three representations of user intent: yes, no, indeterminate, please repeat. These intended representations are used for information exchange and understanding between the user and the mobile manipulator robot.

The static gesture database records 17 different static gestures by which a user can interact with the robot. FIG. 2 shows the gesture variety used for a particular operation.

2. And constructing a mapping relation among the databases.

The mapping relation between the databases comprises the mapping relation between the static gestures and the user intention, the mapping relation between the static gestures and the tool/operation mode and the mapping relation between the static gestures and the robot motion mode.

In the robot gesture interaction system, mapping relations exist among databases, and the purpose of the mapping relations is to achieve conversion and matching between static gestures and a robot motion mode, static gestures and a tool/operation mode and between the static gestures and user intentions.

First, a mapping relation between static gestures and user intentions is established. Through the identification of the static gesture, the robot gesture interaction system can judge whether the intention of the user is yes, no or uncertain and please repeat. And secondly, establishing a mapping relation between the static gesture and the tool/operation mode. By recognizing static gestures, the robotic gesture interaction system is able to determine specific operations that the user wishes to perform, such as opening a clip or closing a clip, etc. And finally, establishing a mapping relation between the static gesture and the robot motion mode. Through the recognition of the static gestures, the robot gesture interaction system can determine the motion mode that the robot should take, such as forward movement, backward movement or rotation. The mapping relations are established, so that the robot gesture interaction system can accurately perform corresponding operation and interaction according to gesture instructions of a user. Fig. 2 gives a legend for the overall mapping.

3. A static gesture dataset is established.

And shooting various preset static gestures by using a function call camera in the pyrealsense2 package, and simultaneously acquiring RGB images and depth images of the static gestures to generate a static gesture data set.

To train the gesture recognition model, a Intel RealSense D455 depth camera is used to capture the static gesture dataset. And (3) calibrating the internal parameters and the external parameters of the camera by a calibration method provided by the RealSense SDK so as to ensure the accuracy of data.

Gesture pictures of not less than 300, 300 and 200 groups are respectively acquired at the distances of 0.4m, 0.8m and 1.2 m. Each group of pictures includes an RGB image and a depth image with a resolution of 640 x 480 pixels. To ensure the diversity and robustness of the data, each gesture is recorded by multiple different participants in a laboratory environment. Meanwhile, left and right hand gestures are considered to be the same gesture category. Finally, 6232 gesture pictures are selected from the static gesture data set to serve as a training set, 5284 gesture pictures are selected to serve as a verification set, and 5074 gesture pictures are selected to serve as a test set. These datasets constitute static gesture data that will be used to train and evaluate the performance of the gesture recognition model.

4. Labeling of a static gesture dataset.

The RGB image and the depth image of each gesture picture in the static gesture data set are annotated, including the position where the gesture appears and the category information of the gesture, thereby creating a computer file containing the annotation information.

For static gesture datasets, labeling is required in order to train a gesture recognition model. The labeling process comprises labeling of the gesture appearance positions and labeling of gesture categories. When the gesture appears, a labeling tool (such as labelme) is used to select the region where the gesture appears in the image, and a corresponding labeling file is generated. Meanwhile, the rectangular frame is used for marking the position of the gesture so as to facilitate subsequent processing. For labeling of gesture categories, a corresponding label is created for each gesture category and associated with the gesture occurrence location to obtain a dataset file containing labeling information.

5. And receiving a message of the gesture recognition result, and making corresponding operation according to the mapping relation.

And after the robot gesture interaction system receives the recognition result sent by the gesture recognition model, comparing gesture information in the recognition result, and searching an operation instruction corresponding to the gesture in the mapping relation on the terminal of the wheeled mobile operation robot. According to the mapping relationship, the robot gesture interaction system will perform operations corresponding to gestures, such as movements of the robot, switching of tool/operation modes, and understanding of user intent.

Through the robot gesture interaction system constructed by the steps, natural and visual operation control of a user on the mobile operation robot can be realized.

As shown in fig. 7, the gesture recognition method based on the multi-core dynamic attention mechanism of the present invention includes the following steps:

s1, constructing a gesture recognition model.

And a parallel double-branch YOLOv5 network is adopted as a gesture recognition model. As shown in fig. 3, the YOLOv5 network includes a backbone network and a detection header. The backbone network is mainly responsible for extracting gesture features, while the detection header is used to predict the location and class of the gesture. Key modules of the YOLOv5 network include a Focus module, a CBS module, a C3 module, and an SPP (spatial pyramid pooling) module, and the corresponding structures are shown in fig. 4.

The Focus module amplifies the number of channels to 4 times of the original number by slicing the input image, and obtains a down-sampling feature map through one convolution operation. This downsampling operation not only reduces the number of model parameters, but also increases the speed of reasoning. The CBS module is the basic component of the YOLOv5 network, which combines two-dimensional convolution, batch normalization, and the SiLU activation function. The C3 module is a key part for constructing a YOLOv5 network backbone network, and is composed of a plurality of CBS modules, and forms a residual connection. The SPP module performs maximum pooling operation on the feature images by applying kernels with different sizes, and is used for solving the problems of inconsistent sizes of input images and object size change.

The backbone network is used to extract image features at different scales, including low-level features (such as texture, edges, etc.) extracted from shallow layers and high-level semantic features extracted from deep layers. As shown in FIG. 5, the invention improves the backbone network and introduces a feature fusion module, and the structure of the invention is that a feature fusion module is added behind each C3 module in the back bone part of YOLOv5, so that RGB gesture features and depth gesture features can be better extracted and fused.

The feature fusion module is used for combining RGB gesture features and depth gesture features to obtain richer and more accurate feature representation. After each C3 module, the RGB and depth gesture features are concatenated and further fused through a series of convolution operations. Therefore, the gesture recognition model can better utilize complementarity between RGB and depth gesture information, and the gesture detection and recognition performance is improved.

S2, acquiring RGB images and depth images of the gestures.

In order to acquire data containing gesture-related information, the invention adopts a RealSense depth camera to acquire RGB images and depth images of gestures simultaneously. In this way, rich gesture visual information and depth information can be obtained to aid in the gesture recognition process.

S3, extracting gesture features.

The RGB image and the depth gesture image of the gesture are input into the gesture recognition model through forward propagation by utilizing a multi-core dynamic attention mechanism so as to extract image features in the RGB image and the depth gesture image respectively, and then the image features extracted from the RGB image and the image features extracted from the depth image are fused so as to better describe and represent the gesture, wherein the extracted image features contain semantic information of the gesture.

S3-1 construction of a multi-core dynamic attention mechanism.

The multi-core dynamic attention convolution dynamically adjusts the convolution weight and bias parameter by using the attention weight, calculates the attention weight of different modes by the SE module, and then uses the weights to weight and sum the characteristics of different modes to obtain the fused characteristics.

In fig. 6, the portion of the dashed box is the modified SE attention mechanism for obtaining the weight pi of each convolution kernel. The operation mode is as follows: firstly, multiplying K convolution kernels with corresponding weights element by element to obtain weighted convolution kernels; then, connecting the weighted convolution kernels, and processing the connected result through batch normalization and a ReLU activation function; and finally, outputting the improved characteristic diagram. The specific operation mode is as follows:

s3-1-1 generates K different weights, each weight corresponding to one convolution kernel, and the convolution calculation of each convolution kernel input can be expressed as:

output[k]＝Conv2d(x，W[k])+b[k]

All of the outputs of S3-1-3 are accumulated together to form the final output:

where αk is the kth weight obtained by the attention mechanism. Compared with a standard convolution layer, the dynamic convolution can enable the gesture recognition model to adaptively adjust parameters according to the characteristics of each input sample, so that the performance of the gesture recognition model is improved.

And S3-2, constructing a feature fusion module.

As shown in fig. 6, the fusion module performs fusion and refinement on the RGB and depth features in the spatial dimension through the feature fusion stage, so that the gesture recognition model can capture and utilize the remote spatial dependency relationship between the two modalities. Through the fusion and refining processes, the gesture recognition model can better understand the distribution and morphological changes of gestures in the image. Next, feature fusion is performed in the channel dimension by an attention directed feature fusion stage. This step enables the gesture recognition model to capture and exploit cross-channel contextual dependencies between RGB and depth modalities. The feature fusion module extracts important channel information through an attention mechanism, so that the perception capability of the gesture recognition model on the correlation between different modalities is enhanced. The working mode is as follows:

s3-2-1, acquiring RGB images of gestures through a gesture recognition model is as follows: RGB (red, green and blue) _in ∈R ^C×H×W The depth image of the gesture is obtained by: depth _in ∈R ^C×H×W The method comprises the steps of carrying out a first treatment on the surface of the The RGB image and the depth image are obtained after each pass through a single gating convolution layer:

RGB′ _in ＝σ(Conv _1x1 (RGB _in ))·RGB _in

Depth′ _in ＝σ(Conv _1x1 (Depth _in ))·Depth _in

X＝AAP(RGB′ _in ||Depth′ _in )

The cross-modal attention vector for the S3-2-2 depth input is learned by:

W _rgb ＝σ(Conv _1x1 (ReLU(DM(X))))

W _depth ＝σ(Conv _1x1 (ReLU(DM(X))))

S3-2-3 gesture feature map RGB 'subjected to gating operation' _in And Depth' _in Multiplying by channel weights W, respectively _rgb And W is _depth To adjust or enhance the gesture features:

RGBf＝Wrgb·RGB′ _in

Depth _f ＝W _depth ·Depth′ _in

this ensures that important features are obtained from both inputs and can be adjusted as required by the gesture recognition model.

The weighting of different channels can be achieved by multiplying the gesture feature map (RGB and depth map) with the corresponding channel weights element by element. The purpose of this weighting operation is to highlight or weaken the information of a particular channel, thereby adjusting or enhancing the gesture features. The purpose of multiplying by the weights is to introduce a correlation or weight relationship between the channels in the gesture feature map. By properly selecting and adjusting the weight values, the representation of the gesture feature map may be changed such that the features of some channels are more apparent while the features of other channels are weakened or ignored. The expression of the gesture features can be adjusted according to the requirements by multiplying the channel weights, so that the influence degree of the information of different channels in the feature representation is adjusted, and the gesture features are weighted. Such operations facilitate extracting more distinguishing and important features, providing more accurate and efficient input for subsequent gesture recognition or other tasks.

RGBD _f ＝Conv _1x1 ([RGB _f ；Depth _f ])

In this regard, the use of the RGB image and depth image containing gestures creates a unified gesture feature descriptor RGBD _f 。

a _rgb ＝DM _rgb (RGBD _f )

a _depth ＝DM _depth (RGBD _f )

wherein,and->

RGBD _out ＝A _rgb ·RGB′ _in +A _depth ·Depth′ _in

Blend gesture features RGBD _out And inputting the subsequent gesture recognition model.

Through the design and improvement of the fusion module, the invention can effectively fuse RGB and depth gesture characteristics and fully utilize the space and channel correlation between the RGB and depth gesture characteristics. In this way, the network can more fully understand gesture data and improve the performance and accuracy of gesture recognition tasks. The feature fusion module is introduced to provide richer information expression capability for the network, and further optimize the performance of the gesture recognition model. Therefore, the module provides an effective solution for the fields of multi-mode data fusion and analysis while improving gesture recognition.

S4, detecting the position of the gesture.

And accurately predicting the extracted gesture features through a detection head in the gesture recognition model, so that the accurate position of the gesture frame and the corresponding gesture category are obtained. Through carrying out detailed analysis and decoding on the detection head, the robot gesture interaction system can acquire the accurate position of the gesture in the image and identify the specific category to which the gesture belongs.

S5, recognizing gestures.

Further processing and screening are carried out according to the predicted gesture frame positions and the corresponding confidence scores:

the S5-1 system sorts all detected gesture frames according to the confidence scores to preserve frames with higher confidence.

The S5-2 system utilizes non-maximum suppression (NMS) techniques to effectively eliminate overlapping boxes, leaving only the gesture boxes that are most representative and accurate.

The gesture frame output by the S5-3 contains detailed information such as accurate position, corresponding category labels, confidence and the like of the gesture.

To facilitate subsequent application and system integration, the system will send the results of gesture recognition in the form of a message (message). In this way, the system is able to integrate and interactively communicate the recognized gesture results to other systems or applications.

Claims

1. A gesture recognition method based on a multi-core dynamic attention mechanism is characterized by comprising the following steps:

s1, constructing a gesture recognition model: using a parallel double-branched YOLOv5 network as a gesture recognition model; the YOLOv5 network comprises a backbone network for extracting gesture characteristics and a detection head for predicting the position and the category of the gesture;

s2, acquiring RGB images and depth images of gestures: using a depth camera to acquire RGB images and depth images of operation control gestures performed by a controller;

s3, extracting gesture features: inputting RGB images and depth images of gestures into the gesture recognition model through forward propagation by utilizing a multi-core dynamic attention mechanism, respectively extracting image features in the RGB images, and fusing the image features extracted from the RGB images and the image features extracted from the depth images; the extracted image features contain semantic information of gestures;

s4, detecting the position of the gesture: predicting the gesture in the extracted image features through the detection head in the gesture recognition model to obtain the position of the gesture frame, the category of the gesture frame and the confidence of the gesture frame;

s5, recognizing gestures: screening the gesture frames according to the obtained positions and the confidence degrees of the gesture frames, eliminating overlapped gesture frames by using a non-maximum suppression algorithm (NMS), and finally outputting a recognition result including the positions of the gestures, the types of the gestures and the confidence degrees of the gestures in the gesture frames;

2. The gesture recognition method according to claim 1, wherein the multi-core dynamic attention mechanism of the gesture recognition model in step S3 operates in the following manner: the multi-core dynamic attention convolution dynamically adjusts the convolution weight and bias parameters by using the attention weight to carry out weighted average on the convolution parameters, calculates the attention weights of different modes through the SE module, and then uses the weights to carry out weighted summation on the characteristics of different modes so as to obtain the fused characteristics.

3. The gesture recognition method according to claim 2, wherein the specific working mode of the multi-core dynamic attention mechanism of the gesture recognition model is:

output[k]＝Conv2d(x，W[k])+b[k]

wherein x is the input feature map, W [ k ] is the kth convolution kernel, b [ k ] is the kth offset, output [ k ] is the output feature map obtained by adding the offset b [ k ] after the convolution operation of the input feature map x by the kth convolution kernel W [ k ];

s3-1-2, multiplying the weight of each convolution kernel by an input feature map;

all the outputs of S3-1-3 are accumulated together to form the final output as:

where pi [ k ] is the kth weight obtained by the attention mechanism.

4. The gesture recognition method according to claim 1, wherein the feature fusion of the gesture recognition model in step S3 works in the following manner:

RGB′ _in ＝σ(Conv _1x1 (RGB _in ))·RGB _in

Depth′ _in ＝σ(Conv _1x1 (Depth _in ))·Depth _in

wherein σ represents the activation function, conv _1x1 (x) Representing a 1 x 1 convolution operation performed on the input x;

X＝AAP(RGB′ _in ||Depth′ _in )

Wherein || denotes that RGB is to be treated _in And Depth _in Splicing;

the cross-modal attention vector for the S3-2-2 depth input is learned by:

W _rgb ＝σ(Conv _1x1 (ReLU(DM(X))))

W _depth ＝σ(Conv _1x1 (ReLU(DM(X))))

wherein W is _rgb Is representative of RGB gesture feature weight, W _depth Is a weight representing Depth gesture feature, DM (x) represents the proposed multi-core dynamic attention module, conv _1x1 (x) Representing a 1 x 1 convolution operation;

s3-2-3 gesture feature map RGB 'subjected to gating operation' _in And Depth' _in Multiplying by respective channel weights W _rgb And W is _depth To adjust or enhance the gesture features:

RGBf＝Wrgb·RGB′ _in ；

Depth _f ＝W _depth ·Depth′ _in ；

RGBD _f ＝Conv _1x1 ([RGB _f ；Depth _f ])；

a _rgb ＝DM _rgb (RGBD _f )

a _depth ＝DM _aepth (RGBD _f )

Wherein, DM _rgb () Representing performing multi-core dynamic attention convolution on the RGB features; DM (DM) _depth () Representing performing a multi-core dynamic attention convolution on the depth features; a, a _rgb Is a weight assigned to each position in the RGB gesture feature map, a _depth Is a weight assigned to each location in the depth gesture feature map;

wherein,and->

S3-2-7 gesture feature map RGB 'subjected to gating operation' _in And Depth' _in Respectively multiplied by the corresponding attention weightsAdding to obtain the fused gesture feature RGBD _out ：

RGBD _out ＝A _rgb ·RGB′ _in +A _depth ·Depth′ _in 。