WO2024108377A1

WO2024108377A1 - Multimodal multi-task workshop target recognition method

Info

Publication number: WO2024108377A1
Application number: PCT/CN2022/133437
Authority: WO
Inventors: 吴春江; 邓建华; 周锦霆; 刘裕; 祝睿; 刘埙乐
Original assignee: 上海成电福智科技有限公司
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2024-05-30

Abstract

Disclosed is a multimodal multi-task workshop target recognition method, comprising: constructing a sample data set, wherein a data sample therein contains a set consisting of a color image and a depth image, and annotation is performed at a target detection level and an instance segmentation level; constructing a multimodal multi-task workshop target recognition network; training the multimodal multi-task workshop target recognition network; and performing task recognition using the multimodal multi-task workshop target recognition network. In the multimodal multi-task workshop target recognition network constructed in the present invention, an encoding portion uses two ResNet50 backbone networks, and four fusion modules are configured between the two ResNet50 backbone networks, and a decoding portion uses two task branches, and three feature sharing modules are configured between the two task branches. The present invention achieves good recognition accuracy for targets having similar colors in a workshop scenario, and enables instance segmentation and target detection in the same scenario.

Description

A multi-modal and multi-task workshop target recognition method

Technical Field

The invention relates to an image processing method, and in particular to a multi-modal multi-task workshop target recognition method.

Background technique

The existing workshop scene target recognition network mainly adopts the form of a single backbone network, which uses the backbone network to extract features from RGB images and passes them into the decoding network to predict the final result. Its structure is shown in Figure 1 of the attached drawings of the specification. Therefore, the existing workshop scene target recognition technology mainly adopts a single task mode, which uses the features passed in by the backbone network to perform single task reasoning.

This has the following defects:

1. The existing workshop scene target recognition technology mainly adopts a single modality, that is, only RGB modal features are used for scene target recognition. However, there are a large number of machine tool targets with similar color and shape features in the workshop scene, which is difficult for the recognition network that only uses the RGB single modality to recognize.

2. Existing workshop scene object recognition technology mainly adopts a single task mode, which cannot take into account both object detection and instance segmentation tasks in workshop scenes. If two tasks need to be performed at the same time, two networks need to be inferred at the same time, which is difficult to meet the low computing resource conditions in workshop scenes.

Summary of the invention

The purpose of the present invention is to provide a multimodal and multi-task workshop target recognition method that solves the above-mentioned problems and can accurately identify color recognition targets in workshop scenes and perform target detection tasks and instance segmentation tasks in parallel in workshop scenes.

In order to achieve the above object, the technical solution adopted by the present invention is as follows: a multi-modal multi-task workshop target recognition method, comprising the following steps;

(1) Construct a sample data set;

Use a depth camera to capture images of the workshop scene, and obtain a set of corresponding color images and depth images each time;

Determining categories of targets, wherein the categories of targets include workers, lathes, and material transport robots;

On color images and depth images, perform object detection level and instance segmentation level annotation to obtain the real box of object detection and the real instance mask of instance segmentation;

A set of annotated color images and depth images is used as a data sample;

(2) Construct a multi-modal and multi-task workshop target recognition network;

The multi-modal multi-task workshop target recognition network includes an encoding module and a decoding module;

The encoding module includes two ResNet50 backbone networks, and the ResNet50 backbone networks are divided into five stages from the input end to the output end, namely the first stage to the fifth stage, and correspondingly output the first eigenvector to the fifth eigenvector;

Two ResNet50 backbone networks input the annotated color image and the annotated depth image respectively. A fusion module is set between the two second stages, the two third stages, the two fourth stages, and the two fifth stages. From front to back, they are the first to the fourth fusion modules. Among them, the input ends of the first three fusion modules are connected to the output ends of the two previous stages, and the output ends are divided into two paths. After being added with the output ends of the two previous stages, they are sent to the two next stages.

The fourth fusion module has its input connected to the outputs of the two fifth stages, and its output is divided into two paths and sent to the decoding module;

The fusion module is used to perform feature fusion on the two input feature vectors and output them;

The decoding module is used to perform target detection and instance segmentation on the output of the encoding module, and output target detection results and instance segmentation results;

(3) Training of multi-modal and multi-task workshop object recognition networks;

Input the data samples in the sample data set into the multimodal multi-task workshop object recognition network to perform object detection and instance segmentation, and in the first decoding branch, take the real box corresponding to the object in the data sample as the expected output, and in the second decoding branch, take the real instance mask of the instance segmentation in the data sample as the expected output, and train until the model converges;

(4) Multi-modal and multi-task workshop target recognition network task recognition;

A set of color images and depth images to be tested in the workshop are obtained and sent to the multimodal multi-task workshop target recognition network, and the target prediction box and predicted instance mask corresponding to the target are output respectively.

As a preference: the fusion method of the fusion module is:

(2.1) The color image is RGB ^C×H×W , and the depth image is Depth ^C×H×W , where C, H, and W are the number of channels, height, and width of the corresponding image respectively;

(2.2) The color image and the depth image are concatenated according to the channel dimension to generate the first concatenated feature RGBD ^2C×H×W , which is then split into S sub-feature blocks according to the channel dimension, marked as X ₀ to X _S-1 , and the dimension of each sub-feature block is

(2.3) For each sub-feature block, convolution operations with different convolution kernel sizes are performed to obtain sub-feature vectors, where the i-th sub-feature block _Xi is convolved according to the following formula:

F _i =Conv _i (X _i ),

Where, _Fi is the sub-feature vector corresponding to _Xi , Conv represents the convolution operation, i = 0 ~ S;

(2.4) Perform a global average pooling operation on the S sub-feature vectors to reduce them to an S×1×1 dimension, and obtain S weight vectors;

(2.5) Normalize the S weight vectors respectively to obtain S attention vectors;

(2.6) Finally, the obtained attention vector is used to perform element-wise product with the first concatenated feature RGBD to obtain the fused feature.

Preferably, the decoding module comprises a first decoding branch for target detection and a second decoding branch for instance segmentation;

The first decoding branch includes a first decoding layer, a second decoding layer, a third decoding layer and a target detection head which are arranged in sequence, wherein the first three layers are used to perform an upsampling operation, each of which outputs a target detection feature, and the target detection feature size output by the third decoding layer is 1/4 of the color picture, and the target detection head first performs an upsampling process on the output of the third decoding layer, and then predicts the prediction box to obtain the target detection result;

The second decoding branch includes a first decoding layer, a second decoding layer, a third decoding layer and an instance segmentation head which are arranged in sequence, wherein the first three layers are used to perform an upsampling operation, each of which outputs a target detection feature, and the instance segmentation feature size output by the third decoding layer is 1/4 of the color image, and the instance segmentation head first performs an upsampling process on the instance segmentation feature output by the third decoding layer, and then predicts the instance mask to obtain the instance segmentation result;

A feature sharing module is provided between the two first decoding layers;

The feature sharing module is used to input the target detection feature and the instance segmentation feature, splice them according to the channel dimension to obtain a second splicing feature, and then divide the second splicing feature into two paths, one path is spatially pooled to obtain a spatial attention vector, and the other path is channel pooled to obtain a channel attention vector;

The spatial attention vector and the channel attention vector are respectively element-wise multiplied with the second splicing feature, and then the results of the element-wise product are added to obtain the processed second splicing feature;

The processed second concatenated features are split according to the channel dimension to obtain processed target detection features and processed instance segmentation features, and are sent back to the first decoding layer of the first decoding branch and the first decoding layer of the second decoding branch respectively, and then sent to the next process by the two first decoding layers;

A feature sharing module is also provided between the two second decoding layers and between the two third decoding layers.

Preferably, the depth camera is an Intel RealSense D455 RGBD camera with a sampling resolution of 640x480.

About the Resnet50 backbone network: The Resnet50 backbone network consists of five stages, namely the first stage to the fifth stage, also known as layer0, layer1, layer2, layer3, layer4. The first stage is layer0, which does not contain residual blocks. It mainly performs convolution, regularization, activation function, and maximum pooling calculations on the input, while the remaining four stages all contain residual blocks. Each stage will correspond to the output feature map or feature vector.

In the encoding module, the present invention adopts two Resnet50 backbone networks and adds four fusion modules between the two Resnet50 backbone networks. This is because the target scales in the workshop scene are too different, so using the same size of convolution kernel in the feature map may ignore the details of small targets, or the receptive field of the convolution kernel cannot capture all the information of large targets. For this reason, the idea of the ESPANet network is referred to, and the input features are split into multiple sub-feature blocks according to the channel dimension, and then the sub-feature blocks are subjected to feature extraction using convolution kernels of different sizes to obtain attention vectors.

Compared with the prior art, the advantages of the present invention are:

In the encoding module, two Resnet50 backbone networks are used, and four fusion modules are added. The second to fifth stages of the Resnet50 backbone network are the four stages of downsampling. After feature extraction in these four stages, the two extracted features are fused and corrected using the fusion module. After inputting the color map features and the depth map features, this module uses channel attention to highlight its own representative features and suppress the noise contained in the data at the same time.

The fusion module splits the input features into multiple sub-feature blocks according to the channel dimension, and then uses convolution kernels of different sizes to extract features from these sub-feature blocks, thereby having better adaptability to multi-scale targets.

In the decoding module, the two branches are inferred in parallel to achieve the target detection task and instance segmentation task of the scene target at the same time. In the decoding stage of each branch, the target detection features and instance segmentation features are passed to the feature sharing module at the first decoding layer, the second decoding layer, and the third decoding layer, respectively, to achieve mutual complementary optimization between tasks.

The decoding module also sets up three feature sharing modules between the first decoding branch and the second decoding branch. The target detection features and instance segmentation features are first concatenated according to the channel dimension, and the concatenated features are pooled in the spatial dimension and the channel dimension to obtain the spatial attention vector and the channel attention vector. After that, the spatial attention vector and the channel attention vector are used to highlight the representative features at the spatial level and the channel level, respectively, while suppressing the noise. Finally, the highlighted features at the channel level and the spatial level are merged by element-wise addition, and split according to the modality to complete the sharing of the two features.

Due to the use of multi-task learning methods, two types of loss values will be generated: target detection and image segmentation. At the same time, due to the differences between tasks, the prediction output of each task will have the characteristics of homoscedastic uncertainty. For this reason, a multi-task learning loss function method is used to simultaneously learn regression and classification problems of different scales and quantities.

In summary, the present invention proposes a new backbone network, uses the attention mechanism for feature fusion, proposes a multi-task network for instance segmentation and target detection at the same time, and designs a feature sharing module to realize information sharing between the target detection decoding branch and the instance segmentation decoding branch. Finally, the present invention has good recognition accuracy for color-similar targets in workshop scenes, can realize instance segmentation and target detection in the same scene, and the accuracy rate of target detection tasks in workshop scenes reaches 87%, and the accuracy rate of instance segmentation tasks reaches 81%.

BRIEF DESCRIPTION OF THE DRAWINGS

Fig. 1 is a flow chart of the present invention;

Figure 2 is a schematic diagram of a fusion module;

Fig. 3 is a schematic diagram of a decoding module;

FIG4 is a schematic diagram of a feature sharing module.

Detailed ways

The present invention will be further described below in conjunction with the accompanying drawings.

Embodiment 1: Referring to FIG. 1 to FIG. 4 , a multi-modal multi-task workshop target recognition method comprises the following steps;

(1) Construct a sample data set;

A set of annotated color images and depth images is used as a data sample;

(2) Construct a multi-modal and multi-task workshop target recognition network;

In this embodiment, the fusion method of the fusion module is:

F _i =Conv _i (X _i ),

The decoding module includes a first decoding branch for object detection and a second decoding branch for instance segmentation;

The first decoding branch includes a first decoding layer, a second decoding layer, a third decoding layer and a target detection head which are arranged in sequence, wherein the first three layers are used to perform an upsampling operation, each of which outputs a target detection feature, and the target detection feature size output by the third decoding layer is 1/4 of the color picture, and the target detection head first performs an upsampling process on the output of the third decoding layer, and then performs a prediction frame prediction to obtain a target detection result;

A feature sharing module is provided between the two first decoding layers;

The depth camera is an Intel RealSense D455 RGBD camera with a sampling resolution of 640x480.

Regarding the fusion module, see Figure 1. For the first three fusion modules, the input end is connected to the output end of the two previous stages, and the output end is divided into two paths, which are added to the output ends of the two previous stages and then sent to the two next stages. In other words:

A fusion module is provided between the two second stages. The input end of the fusion module is connected to the output ends of the two second stages. The output end of the fusion module is divided into two paths, which are added to the output ends of the two second stages and then sent to the two third stages.

A fusion module is also provided between the two third stages. The input end of the fusion module is connected to the output ends of the two third stages. The output end of the fusion module is divided into two paths, which are respectively added with the output ends of the two third stages and then sent to the two fourth stages.

A fusion module is also provided between the two fourth stages. The input end of the fusion module is connected to the output ends of the two fourth stages. The output end of the fusion module is divided into two paths, which are respectively added with the output ends of the two fourth stages and then sent to the two fifth stages.

When training the network: the multimodal multi-task workshop target recognition network is trained using the acquired real target detection labels and instance segmentation labels. Due to the use of multi-task learning methods, two types of loss values, target detection and image segmentation, will be generated. At the same time, due to the differences between tasks, the prediction outputs of each task will have the characteristics of homoscedastic uncertainty. For this reason, a multi-task learning loss function method is used to simultaneously learn regression and classification problems of different scales and quantities.

The multi-task joint loss function is defined to satisfy the following formula.

L(W,σ ₁ ,σ ₂ ) represents the joint loss function of the two tasks, where L ₁ (W) = ||y ₁ -f ^W (x)|| ² , represents the loss value of the regression task, and L ₂ (W) = -logSoftmax(y ₂ ,f ^W (x)) represents the loss of the classification task, where y ₁ ,y ₂ are the true label values, f ^W (x) is the network prediction value, and σ ₁ ,σ ₂ are the noise scalars output by the two task branches respectively.

The general process of the present invention is:

For a data sample, we mark the color image as image A and the depth image as image B. Image A is sent to a ResNet50 backbone network, and image B is sent to another ResNet50 backbone network. See Figure 1. After the first and second stages, the second eigenvector A2 corresponding to image A and the second eigenvector B2 corresponding to image B are output;

A2 and B2 are sent to the fusion module for processing, and after steps (2.1)-(2.6), the fused features are obtained; the fused features are divided into two paths, and added with A2 and B2 respectively, and the two added features A2’ and B2’ are obtained and sent to the next stage, that is, the third stage of the two ResNet50 backbone networks.

Similarly, there are fusion modules between the two third stages, the fourth stage, and the fifth stage. According to the operation process of the above fusion modules, the fourth fusion module finally outputs a fusion feature map, which we call a multimodal fusion feature map; at this time, the encoding module is finished;

The multimodal fusion feature map is divided into two paths, which are respectively sent to the first decoding branch and the second decoding branch of the decoding module. The target detection result is output through the first decoding branch, and the instance segmentation result is output through the second decoding branch. However, we also set a feature sharing module between the two branches. The workflow of the feature sharing module between the two first decoding layers is as follows: the target detection feature and the instance segmentation feature are first spliced according to the channel dimension, and the spliced features are subjected to feature pooling operations in the spatial dimension and the channel dimension respectively to obtain the spatial attention vector and the channel attention vector. Afterwards, the spatial attention vector and the channel attention vector are used to highlight the representative features of the features at the spatial level and the channel level, respectively, while suppressing the noise. Finally, the highlighted features at the channel level and the spatial level are merged by element addition, and split according to the modality to complete the sharing of the two features. The present invention sets three feature sharing modules in the decoding module, and feature sharing is performed once between each decoding layer.

During training, we use the true box of target detection and the true instance mask of instance segmentation as the expected output, and calibrate the predicted box and predicted instance mask.

The present invention has good recognition accuracy for targets with similar colors in workshop scenes, can realize instance segmentation and target detection in the same scene, and the accuracy rate of target detection tasks in workshop scenes reaches 87%, and the accuracy rate of instance segmentation tasks reaches 81%.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the protection scope of the present invention.

Claims

A multi-modal multi-task workshop target recognition method, characterized by comprising the following steps:

(1) Construct a sample data set;

Use a depth camera to capture images of the workshop scene, and obtain a set of corresponding color images and depth images each time;

Determining categories of targets, wherein the categories of targets include workers, lathes, and material transport robots;

On color images and depth images, perform object detection level and instance segmentation level annotation to obtain the real box of object detection and the real instance mask of instance segmentation;

A set of annotated color images and depth images is used as a data sample;

(2) Construct a multi-modal and multi-task workshop target recognition network;

The multi-modal multi-task workshop target recognition network includes an encoding module and a decoding module;

The encoding module includes two ResNet50 backbone networks, and the ResNet50 backbone networks are divided into five stages from the input end to the output end, namely the first stage to the fifth stage, and correspondingly output the first eigenvector to the fifth eigenvector;

Two ResNet50 backbone networks input the annotated color image and the annotated depth image respectively. A fusion module is set between the two second stages, the two third stages, the two fourth stages, and the two fifth stages. From front to back, they are the first to the fourth fusion modules. Among them, the input ends of the first three fusion modules are connected to the output ends of the two previous stages, and the output ends are divided into two paths. After being added with the output ends of the two previous stages, they are sent to the two next stages.

The fourth fusion module has its input connected to the outputs of the two fifth stages, and its output is divided into two paths and sent to the decoding module;

The fusion module is used to perform feature fusion on the two input feature vectors and output them;

The decoding module is used to perform target detection and instance segmentation on the output of the encoding module, and output target detection results and instance segmentation results;

(3) Training of multi-modal and multi-task workshop object recognition networks;

Input the data samples in the sample data set into the multimodal multi-task workshop object recognition network to perform object detection and instance segmentation, and in the first decoding branch, take the real box corresponding to the object in the data sample as the expected output, and in the second decoding branch, take the real instance mask of the instance segmentation in the data sample as the expected output, and train until the model converges;

(4) Multi-modal and multi-task workshop target recognition network task recognition;

A set of color images and depth images to be tested in the workshop are obtained and sent to the multimodal multi-task workshop target recognition network, and the target prediction box and predicted instance mask corresponding to the target are output respectively.
The multi-modal multi-task workshop target recognition method according to claim 1 is characterized in that: the fusion method of the fusion module is:

(2.1) The color image is RGB C×H×W , and the depth image is Depth C×H×W , where C, H, and W are the number of channels, height, and width of the corresponding image respectively;

(2.2) The color image and the depth image are concatenated according to the channel dimension to generate the first concatenated feature RGBD 2C×H×W , which is then split into S sub-feature blocks according to the channel dimension, marked as X 0 to X S-1 , and the dimension of each sub-feature block is

(2.3) For each sub-feature block, convolution operations with different convolution kernel sizes are performed to obtain sub-feature vectors, where the i-th sub-feature block Xi is convolved according to the following formula:

F i =Conv i (X i ),

Where, Fi is the sub-feature vector corresponding to Xi , Conv represents the convolution operation, i = 0 ~ S;

(2.4) Perform a global average pooling operation on the S sub-feature vectors to reduce them to a dimension of S×1×1, and obtain S weight vectors;

(2.5) Normalize the S weight vectors respectively to obtain S attention vectors;

(2.6) Finally, the obtained attention vector is used to perform element-wise product with the first concatenated feature RGBD to obtain the fused feature.
The multi-modal multi-task workshop target recognition method according to claim 1 is characterized in that: the decoding module includes a first decoding branch for target detection and a second decoding branch for instance segmentation;

The first decoding branch includes a first decoding layer, a second decoding layer, a third decoding layer and a target detection head which are arranged in sequence, wherein the first three layers are used to perform an upsampling operation, each of which outputs a target detection feature, and the target detection feature size output by the third decoding layer is 1/4 of the color picture, and the target detection head first performs an upsampling process on the output of the third decoding layer, and then predicts the prediction box to obtain the target detection result;

The second decoding branch includes a first decoding layer, a second decoding layer, a third decoding layer and an instance segmentation head which are arranged in sequence, wherein the first three layers are used to perform an upsampling operation, each of which outputs a target detection feature, and the instance segmentation feature size output by the third decoding layer is 1/4 of the color image, and the instance segmentation head first performs an upsampling process on the instance segmentation feature output by the third decoding layer, and then predicts the instance mask to obtain the instance segmentation result;

A feature sharing module is provided between the two first decoding layers;

The feature sharing module is used to input the target detection feature and the instance segmentation feature, splice them according to the channel dimension to obtain a second splicing feature, and then divide the second splicing feature into two paths, one path is spatially pooled to obtain a spatial attention vector, and the other path is channel pooled to obtain a channel attention vector;

The spatial attention vector and the channel attention vector are respectively element-wise multiplied with the second splicing feature, and then the results of the element-wise product are added to obtain the processed second splicing feature;

The processed second concatenated features are split according to the channel dimension to obtain processed target detection features and processed instance segmentation features, and are sent back to the first decoding layer of the first decoding branch and the first decoding layer of the second decoding branch respectively, and then sent to the next process by the two first decoding layers;

A feature sharing module is also provided between the two second decoding layers and between the two third decoding layers.
According to a multimodal multi-task workshop target recognition method as described in claim 1, it is characterized in that the depth camera is an Intel RealSense D455 RGBD camera with a sampling resolution of 640x480.