WO2024108377A1 - Multimodal multi-task workshop target recognition method - Google Patents
Multimodal multi-task workshop target recognition method Download PDFInfo
- Publication number
- WO2024108377A1 WO2024108377A1 PCT/CN2022/133437 CN2022133437W WO2024108377A1 WO 2024108377 A1 WO2024108377 A1 WO 2024108377A1 CN 2022133437 W CN2022133437 W CN 2022133437W WO 2024108377 A1 WO2024108377 A1 WO 2024108377A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature
- decoding
- task
- workshop
- output
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 238000001514 detection method Methods 0.000 claims abstract description 61
- 230000011218 segmentation Effects 0.000 claims abstract description 59
- 230000004927 fusion Effects 0.000 claims abstract description 46
- 238000012549 training Methods 0.000 claims abstract description 6
- 239000013598 vector Substances 0.000 claims description 46
- 230000008569 process Effects 0.000 claims description 10
- 238000011176 pooling Methods 0.000 claims description 5
- 239000000463 material Substances 0.000 claims description 3
- 238000007500 overflow downdraw method Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 239000003086 colorant Substances 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 3
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000003709 image segmentation Methods 0.000 description 2
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 1
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Definitions
- the invention relates to an image processing method, and in particular to a multi-modal multi-task workshop target recognition method.
- the existing workshop scene target recognition network mainly adopts the form of a single backbone network, which uses the backbone network to extract features from RGB images and passes them into the decoding network to predict the final result. Its structure is shown in Figure 1 of the attached drawings of the specification. Therefore, the existing workshop scene target recognition technology mainly adopts a single task mode, which uses the features passed in by the backbone network to perform single task reasoning.
- the existing workshop scene target recognition technology mainly adopts a single modality, that is, only RGB modal features are used for scene target recognition.
- RGB modal features are used for scene target recognition.
- the purpose of the present invention is to provide a multimodal and multi-task workshop target recognition method that solves the above-mentioned problems and can accurately identify color recognition targets in workshop scenes and perform target detection tasks and instance segmentation tasks in parallel in workshop scenes.
- a multi-modal multi-task workshop target recognition method comprising the following steps
- Determining categories of targets wherein the categories of targets include workers, lathes, and material transport robots;
- a set of annotated color images and depth images is used as a data sample
- the multi-modal multi-task workshop target recognition network includes an encoding module and a decoding module
- the encoding module includes two ResNet50 backbone networks, and the ResNet50 backbone networks are divided into five stages from the input end to the output end, namely the first stage to the fifth stage, and correspondingly output the first eigenvector to the fifth eigenvector;
- Two ResNet50 backbone networks input the annotated color image and the annotated depth image respectively.
- a fusion module is set between the two second stages, the two third stages, the two fourth stages, and the two fifth stages. From front to back, they are the first to the fourth fusion modules. Among them, the input ends of the first three fusion modules are connected to the output ends of the two previous stages, and the output ends are divided into two paths. After being added with the output ends of the two previous stages, they are sent to the two next stages.
- the fourth fusion module has its input connected to the outputs of the two fifth stages, and its output is divided into two paths and sent to the decoding module;
- the fusion module is used to perform feature fusion on the two input feature vectors and output them;
- the decoding module is used to perform target detection and instance segmentation on the output of the encoding module, and output target detection results and instance segmentation results;
- a set of color images and depth images to be tested in the workshop are obtained and sent to the multimodal multi-task workshop target recognition network, and the target prediction box and predicted instance mask corresponding to the target are output respectively.
- the fusion method of the fusion module is:
- the color image is RGB C ⁇ H ⁇ W
- the depth image is Depth C ⁇ H ⁇ W
- C, H, and W are the number of channels, height, and width of the corresponding image respectively;
- Fi is the sub-feature vector corresponding to Xi
- the decoding module comprises a first decoding branch for target detection and a second decoding branch for instance segmentation;
- the first decoding branch includes a first decoding layer, a second decoding layer, a third decoding layer and a target detection head which are arranged in sequence, wherein the first three layers are used to perform an upsampling operation, each of which outputs a target detection feature, and the target detection feature size output by the third decoding layer is 1/4 of the color picture, and the target detection head first performs an upsampling process on the output of the third decoding layer, and then predicts the prediction box to obtain the target detection result;
- the second decoding branch includes a first decoding layer, a second decoding layer, a third decoding layer and an instance segmentation head which are arranged in sequence, wherein the first three layers are used to perform an upsampling operation, each of which outputs a target detection feature, and the instance segmentation feature size output by the third decoding layer is 1/4 of the color image, and the instance segmentation head first performs an upsampling process on the instance segmentation feature output by the third decoding layer, and then predicts the instance mask to obtain the instance segmentation result;
- a feature sharing module is provided between the two first decoding layers
- the feature sharing module is used to input the target detection feature and the instance segmentation feature, splice them according to the channel dimension to obtain a second splicing feature, and then divide the second splicing feature into two paths, one path is spatially pooled to obtain a spatial attention vector, and the other path is channel pooled to obtain a channel attention vector;
- the spatial attention vector and the channel attention vector are respectively element-wise multiplied with the second splicing feature, and then the results of the element-wise product are added to obtain the processed second splicing feature;
- the processed second concatenated features are split according to the channel dimension to obtain processed target detection features and processed instance segmentation features, and are sent back to the first decoding layer of the first decoding branch and the first decoding layer of the second decoding branch respectively, and then sent to the next process by the two first decoding layers;
- a feature sharing module is also provided between the two second decoding layers and between the two third decoding layers.
- the depth camera is an Intel RealSense D455 RGBD camera with a sampling resolution of 640x480.
- the Resnet50 backbone network consists of five stages, namely the first stage to the fifth stage, also known as layer0, layer1, layer2, layer3, layer4.
- the first stage is layer0, which does not contain residual blocks. It mainly performs convolution, regularization, activation function, and maximum pooling calculations on the input, while the remaining four stages all contain residual blocks.
- Each stage will correspond to the output feature map or feature vector.
- the present invention adopts two Resnet50 backbone networks and adds four fusion modules between the two Resnet50 backbone networks. This is because the target scales in the workshop scene are too different, so using the same size of convolution kernel in the feature map may ignore the details of small targets, or the receptive field of the convolution kernel cannot capture all the information of large targets. For this reason, the idea of the ESPANet network is referred to, and the input features are split into multiple sub-feature blocks according to the channel dimension, and then the sub-feature blocks are subjected to feature extraction using convolution kernels of different sizes to obtain attention vectors.
- the encoding module two Resnet50 backbone networks are used, and four fusion modules are added.
- the second to fifth stages of the Resnet50 backbone network are the four stages of downsampling. After feature extraction in these four stages, the two extracted features are fused and corrected using the fusion module. After inputting the color map features and the depth map features, this module uses channel attention to highlight its own representative features and suppress the noise contained in the data at the same time.
- the fusion module splits the input features into multiple sub-feature blocks according to the channel dimension, and then uses convolution kernels of different sizes to extract features from these sub-feature blocks, thereby having better adaptability to multi-scale targets.
- the two branches are inferred in parallel to achieve the target detection task and instance segmentation task of the scene target at the same time.
- the target detection features and instance segmentation features are passed to the feature sharing module at the first decoding layer, the second decoding layer, and the third decoding layer, respectively, to achieve mutual complementary optimization between tasks.
- the decoding module also sets up three feature sharing modules between the first decoding branch and the second decoding branch.
- the target detection features and instance segmentation features are first concatenated according to the channel dimension, and the concatenated features are pooled in the spatial dimension and the channel dimension to obtain the spatial attention vector and the channel attention vector.
- the spatial attention vector and the channel attention vector are used to highlight the representative features at the spatial level and the channel level, respectively, while suppressing the noise.
- the highlighted features at the channel level and the spatial level are merged by element-wise addition, and split according to the modality to complete the sharing of the two features.
- the present invention proposes a new backbone network, uses the attention mechanism for feature fusion, proposes a multi-task network for instance segmentation and target detection at the same time, and designs a feature sharing module to realize information sharing between the target detection decoding branch and the instance segmentation decoding branch.
- the present invention has good recognition accuracy for color-similar targets in workshop scenes, can realize instance segmentation and target detection in the same scene, and the accuracy rate of target detection tasks in workshop scenes reaches 87%, and the accuracy rate of instance segmentation tasks reaches 81%.
- Fig. 1 is a flow chart of the present invention
- Figure 2 is a schematic diagram of a fusion module
- Fig. 3 is a schematic diagram of a decoding module
- FIG4 is a schematic diagram of a feature sharing module.
- Embodiment 1 Referring to FIG. 1 to FIG. 4 , a multi-modal multi-task workshop target recognition method comprises the following steps;
- Determining categories of targets wherein the categories of targets include workers, lathes, and material transport robots;
- a set of annotated color images and depth images is used as a data sample
- the multi-modal multi-task workshop target recognition network includes an encoding module and a decoding module
- the encoding module includes two ResNet50 backbone networks, and the ResNet50 backbone networks are divided into five stages from the input end to the output end, namely the first stage to the fifth stage, and correspondingly output the first eigenvector to the fifth eigenvector;
- Two ResNet50 backbone networks input the annotated color image and the annotated depth image respectively.
- a fusion module is set between the two second stages, the two third stages, the two fourth stages, and the two fifth stages. From front to back, they are the first to the fourth fusion modules. Among them, the input ends of the first three fusion modules are connected to the output ends of the two previous stages, and the output ends are divided into two paths. After being added with the output ends of the two previous stages, they are sent to the two next stages.
- the fourth fusion module has its input connected to the outputs of the two fifth stages, and its output is divided into two paths and sent to the decoding module;
- the fusion module is used to perform feature fusion on the two input feature vectors and output them;
- the decoding module is used to perform target detection and instance segmentation on the output of the encoding module, and output target detection results and instance segmentation results;
- a set of color images and depth images to be tested in the workshop are obtained and sent to the multimodal multi-task workshop target recognition network, and the target prediction box and predicted instance mask corresponding to the target are output respectively.
- the fusion method of the fusion module is:
- the color image is RGB C ⁇ H ⁇ W
- the depth image is Depth C ⁇ H ⁇ W
- C, H, and W are the number of channels, height, and width of the corresponding image respectively;
- Fi is the sub-feature vector corresponding to Xi
- the decoding module includes a first decoding branch for object detection and a second decoding branch for instance segmentation;
- the first decoding branch includes a first decoding layer, a second decoding layer, a third decoding layer and a target detection head which are arranged in sequence, wherein the first three layers are used to perform an upsampling operation, each of which outputs a target detection feature, and the target detection feature size output by the third decoding layer is 1/4 of the color picture, and the target detection head first performs an upsampling process on the output of the third decoding layer, and then performs a prediction frame prediction to obtain a target detection result;
- the second decoding branch includes a first decoding layer, a second decoding layer, a third decoding layer and an instance segmentation head which are arranged in sequence, wherein the first three layers are used to perform an upsampling operation, each of which outputs a target detection feature, and the instance segmentation feature size output by the third decoding layer is 1/4 of the color image, and the instance segmentation head first performs an upsampling process on the instance segmentation feature output by the third decoding layer, and then predicts the instance mask to obtain the instance segmentation result;
- a feature sharing module is provided between the two first decoding layers
- the feature sharing module is used to input the target detection feature and the instance segmentation feature, splice them according to the channel dimension to obtain a second splicing feature, and then divide the second splicing feature into two paths, one path is spatially pooled to obtain a spatial attention vector, and the other path is channel pooled to obtain a channel attention vector;
- the spatial attention vector and the channel attention vector are respectively element-wise multiplied with the second splicing feature, and then the results of the element-wise product are added to obtain the processed second splicing feature;
- the processed second concatenated features are split according to the channel dimension to obtain processed target detection features and processed instance segmentation features, and are sent back to the first decoding layer of the first decoding branch and the first decoding layer of the second decoding branch respectively, and then sent to the next process by the two first decoding layers;
- a feature sharing module is also provided between the two second decoding layers and between the two third decoding layers.
- the depth camera is an Intel RealSense D455 RGBD camera with a sampling resolution of 640x480.
- the input end is connected to the output end of the two previous stages, and the output end is divided into two paths, which are added to the output ends of the two previous stages and then sent to the two next stages.
- a fusion module is provided between the two second stages.
- the input end of the fusion module is connected to the output ends of the two second stages.
- the output end of the fusion module is divided into two paths, which are added to the output ends of the two second stages and then sent to the two third stages.
- a fusion module is also provided between the two third stages.
- the input end of the fusion module is connected to the output ends of the two third stages.
- the output end of the fusion module is divided into two paths, which are respectively added with the output ends of the two third stages and then sent to the two fourth stages.
- a fusion module is also provided between the two fourth stages.
- the input end of the fusion module is connected to the output ends of the two fourth stages.
- the output end of the fusion module is divided into two paths, which are respectively added with the output ends of the two fourth stages and then sent to the two fifth stages.
- the multimodal multi-task workshop target recognition network is trained using the acquired real target detection labels and instance segmentation labels. Due to the use of multi-task learning methods, two types of loss values, target detection and image segmentation, will be generated. At the same time, due to the differences between tasks, the prediction outputs of each task will have the characteristics of homoscedastic uncertainty. For this reason, a multi-task learning loss function method is used to simultaneously learn regression and classification problems of different scales and quantities.
- the multi-task joint loss function is defined to satisfy the following formula.
- 2 , represents the loss value of the regression task, and L 2 (W) -logSoftmax(y 2 ,f W (x)) represents the loss of the classification task, where y 1 ,y 2 are the true label values, f W (x) is the network prediction value, and ⁇ 1 , ⁇ 2 are the noise scalars output by the two task branches respectively.
- the general process of the present invention is:
- Image A is sent to a ResNet50 backbone network
- image B is sent to another ResNet50 backbone network. See Figure 1.
- the second eigenvector A2 corresponding to image A and the second eigenvector B2 corresponding to image B are output;
- A2 and B2 are sent to the fusion module for processing, and after steps (2.1)-(2.6), the fused features are obtained; the fused features are divided into two paths, and added with A2 and B2 respectively, and the two added features A2’ and B2’ are obtained and sent to the next stage, that is, the third stage of the two ResNet50 backbone networks.
- the fourth fusion module finally outputs a fusion feature map, which we call a multimodal fusion feature map; at this time, the encoding module is finished;
- the multimodal fusion feature map is divided into two paths, which are respectively sent to the first decoding branch and the second decoding branch of the decoding module.
- the target detection result is output through the first decoding branch
- the instance segmentation result is output through the second decoding branch.
- a feature sharing module between the two branches.
- the workflow of the feature sharing module between the two first decoding layers is as follows: the target detection feature and the instance segmentation feature are first spliced according to the channel dimension, and the spliced features are subjected to feature pooling operations in the spatial dimension and the channel dimension respectively to obtain the spatial attention vector and the channel attention vector.
- the spatial attention vector and the channel attention vector are used to highlight the representative features of the features at the spatial level and the channel level, respectively, while suppressing the noise.
- the highlighted features at the channel level and the spatial level are merged by element addition, and split according to the modality to complete the sharing of the two features.
- the present invention sets three feature sharing modules in the decoding module, and feature sharing is performed once between each decoding layer.
- the present invention has good recognition accuracy for targets with similar colors in workshop scenes, can realize instance segmentation and target detection in the same scene, and the accuracy rate of target detection tasks in workshop scenes reaches 87%, and the accuracy rate of instance segmentation tasks reaches 81%.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
Disclosed is a multimodal multi-task workshop target recognition method, comprising: constructing a sample data set, wherein a data sample therein contains a set consisting of a color image and a depth image, and annotation is performed at a target detection level and an instance segmentation level; constructing a multimodal multi-task workshop target recognition network; training the multimodal multi-task workshop target recognition network; and performing task recognition using the multimodal multi-task workshop target recognition network. In the multimodal multi-task workshop target recognition network constructed in the present invention, an encoding portion uses two ResNet50 backbone networks, and four fusion modules are configured between the two ResNet50 backbone networks, and a decoding portion uses two task branches, and three feature sharing modules are configured between the two task branches. The present invention achieves good recognition accuracy for targets having similar colors in a workshop scenario, and enables instance segmentation and target detection in the same scenario.
Description
本发明涉及一种图像处理方法,尤其涉及一种多模态多任务车间目标识别方法。The invention relates to an image processing method, and in particular to a multi-modal multi-task workshop target recognition method.
现有车间场景目标识别网络主要采用单一主干网络的形式,其利用主干网络对RGB图像进行特征提取,并传入解码网络中进行最终结果的预测。其结构如说明书附图的图1所示。故现有车间场景目标识别技术,主要采用单一任务模式,其利用主干网络传入的特征进行单一任务的推理。The existing workshop scene target recognition network mainly adopts the form of a single backbone network, which uses the backbone network to extract features from RGB images and passes them into the decoding network to predict the final result. Its structure is shown in Figure 1 of the attached drawings of the specification. Therefore, the existing workshop scene target recognition technology mainly adopts a single task mode, which uses the features passed in by the backbone network to perform single task reasoning.
这就存在以下缺陷:This has the following defects:
1、由于现有车间场景目标识别技术主要采用单一模态,即仅利用RGB模态特征进行场景目标识别,但是车间场景中存在大量色彩外形特征相似的机床目标,这对于仅利用RGB单一模态的识别网络是难以识别的。1. The existing workshop scene target recognition technology mainly adopts a single modality, that is, only RGB modal features are used for scene target recognition. However, there are a large number of machine tool targets with similar color and shape features in the workshop scene, which is difficult for the recognition network that only uses the RGB single modality to recognize.
2、现有车间场景目标识别技术主要采用单一任务模式,其同时无法兼顾车间场景中同时进行目标检测与实例分割任务。如需同时进行两项任务,则需要同时对两个网络进行推理,这对于车间场景下低计算资源条件是难以满足的。2. Existing workshop scene object recognition technology mainly adopts a single task mode, which cannot take into account both object detection and instance segmentation tasks in workshop scenes. If two tasks need to be performed at the same time, two networks need to be inferred at the same time, which is difficult to meet the low computing resource conditions in workshop scenes.
发明内容Summary of the invention
本发明的目的就在于提供一种解决上述问题,能对车间场景中色彩相识目标的准确识别、车间场景下目标检测任务与实例分割任务的并行执行的,一种多模态多任务车间目标识别方法。The purpose of the present invention is to provide a multimodal and multi-task workshop target recognition method that solves the above-mentioned problems and can accurately identify color recognition targets in workshop scenes and perform target detection tasks and instance segmentation tasks in parallel in workshop scenes.
为了实现上述目的,本发明采用的技术方案是这样的:一种多模态多任务车间目标识别方法,包括以下步骤;In order to achieve the above object, the technical solution adopted by the present invention is as follows: a multi-modal multi-task workshop target recognition method, comprising the following steps;
(1)构建样本数据集;(1) Construct a sample data set;
用深度相机对车间现场进行拍摄采集,每次拍摄得到一组对应的彩色图片和深度图片;Use a depth camera to capture images of the workshop scene, and obtain a set of corresponding color images and depth images each time;
确定目标的类别,所述目标的类别包括工人、车床和运料机器人;Determining categories of targets, wherein the categories of targets include workers, lathes, and material transport robots;
在彩色图片和深度图片上,进行目标检测级别与实例分割级别的标注,获得目标检测的真实框和实例分割的真实实例掩码;On color images and depth images, perform object detection level and instance segmentation level annotation to obtain the real box of object detection and the real instance mask of instance segmentation;
将标注后的一组彩色图片和深度图片,作为一数据样本;A set of annotated color images and depth images is used as a data sample;
(2)构建多模态多任务车间目标识别网络;(2) Construct a multi-modal and multi-task workshop target recognition network;
所述多模态多任务车间目标识别网络包括编码模块和解码模块;The multi-modal multi-task workshop target recognition network includes an encoding module and a decoding module;
所述编码模块包括两个ResNet50主干网络,所述ResNet50主干网络从输入端到输出端分为五个阶段,分别为第一阶段到第五阶段,并对应输出第一特征向量到第五特征向量;The encoding module includes two ResNet50 backbone networks, and the ResNet50 backbone networks are divided into five stages from the input end to the output end, namely the first stage to the fifth stage, and correspondingly output the first eigenvector to the fifth eigenvector;
两个ResNet50主干网络分别输入标注后的彩色图片和标注后的深度图片,在两个第二阶段间、两个第三阶段间、两个第四阶段、两个第五阶段间,分别设有一融合模块,从前到后分别为第一个到第四个融合模块,其中前三个融合模块,其输入端连接两个上一阶段的输出端,输出端分为两路,分别与两个上一阶段的输出端进行加和后,送入两个下一阶段;Two ResNet50 backbone networks input the annotated color image and the annotated depth image respectively. A fusion module is set between the two second stages, the two third stages, the two fourth stages, and the two fifth stages. From front to back, they are the first to the fourth fusion modules. Among them, the input ends of the first three fusion modules are connected to the output ends of the two previous stages, and the output ends are divided into two paths. After being added with the output ends of the two previous stages, they are sent to the two next stages.
第四个融合模块,输入端连接两个第五阶段的输出端,输出端分为两路,送入解码模块中;The fourth fusion module has its input connected to the outputs of the two fifth stages, and its output is divided into two paths and sent to the decoding module;
所述融合模块用于将输入的两个特征向量进行特征融合并输出;The fusion module is used to perform feature fusion on the two input feature vectors and output them;
所述解码模块用于将编码模块的输出,进行目标检测和实例分割,输出目标检测结果和实例分割结果;The decoding module is used to perform target detection and instance segmentation on the output of the encoding module, and output target detection results and instance segmentation results;
(3)多模态多任务车间目标识别网络的训练;(3) Training of multi-modal and multi-task workshop object recognition networks;
将样本数据集中的数据样本输入多模态多任务车间目标识别网络,进行目标检测和实例分割,并在在第一解码分支中,以该数据样本中目标对应的真实框为期望输出,在第二解码分支中,以该数据样本中实例分割的真实实例掩码为期望输出,进行训练直至模型收敛;Input the data samples in the sample data set into the multimodal multi-task workshop object recognition network to perform object detection and instance segmentation, and in the first decoding branch, take the real box corresponding to the object in the data sample as the expected output, and in the second decoding branch, take the real instance mask of the instance segmentation in the data sample as the expected output, and train until the model converges;
(4)多模态多任务车间目标识别网络任务识别;(4) Multi-modal and multi-task workshop target recognition network task recognition;
获取车间内待测的一组彩色图片与深度图片,送入多模态多任务车间目标识别网络中,分别输出其中目标对应的目标预测框、和预测实例掩码。A set of color images and depth images to be tested in the workshop are obtained and sent to the multimodal multi-task workshop target recognition network, and the target prediction box and predicted instance mask corresponding to the target are output respectively.
作为优选:所述融合模块的融合方法为:As a preference: the fusion method of the fusion module is:
(2.1)彩色图片为RGB
C×H×W,深度图片为Depth
C×H×W,其中C、H和W分别为对应图片的通道数、高和宽;
(2.1) The color image is RGB C×H×W , and the depth image is Depth C×H×W , where C, H, and W are the number of channels, height, and width of the corresponding image respectively;
(2.2)将彩色图片和深度图片按道维度进行拼接,生成第一拼接特征RGBD
2C×H×W,再按通道维度拆分为S个子特征块,分别标记为X
0~X
S-1,每个子特征块的维度为
(2.2) The color image and the depth image are concatenated according to the channel dimension to generate the first concatenated feature RGBD 2C×H×W , which is then split into S sub-feature blocks according to the channel dimension, marked as X 0 to X S-1 , and the dimension of each sub-feature block is
(2.3)对每个子特征块,分别进行卷积核大小不同的卷积操作,得到子特征向量,其中,第i个子特征块X
i根据下式进行卷积操作;
(2.3) For each sub-feature block, convolution operations with different convolution kernel sizes are performed to obtain sub-feature vectors, where the i-th sub-feature block Xi is convolved according to the following formula:
F
i=Conv
i(X
i),
F i =Conv i (X i ),
式中,F
i为X
i对应的子特征向量,Conv表示卷积操作,i=0~S;
Where, Fi is the sub-feature vector corresponding to Xi , Conv represents the convolution operation, i = 0 ~ S;
(2.4)对S个子特征向量进行全局平均池化操作,降至S×1×1维度大小,得到S个权重向量;(2.4) Perform a global average pooling operation on the S sub-feature vectors to reduce them to an S×1×1 dimension, and obtain S weight vectors;
(2.5)将S个权重向量分别进行归一化处理,得到S个注意力向量;(2.5) Normalize the S weight vectors respectively to obtain S attention vectors;
(2.6)最后再利用获得的注意力向量与第一拼接特征RGBD进行元素乘积,得到融合后特征。(2.6) Finally, the obtained attention vector is used to perform element-wise product with the first concatenated feature RGBD to obtain the fused feature.
作为优选:所述解码模块包括用于目标检测的第一解码分支和用于实例分割的第二解码分支;Preferably, the decoding module comprises a first decoding branch for target detection and a second decoding branch for instance segmentation;
所述第一解码分支包括依次设置的第一解码层、第二解码层、第三解码层和目标检测头部,其中前三层用于执行上采样操作,每次均输出一目标检测特征,且第三解码层输出的目标检测特征尺寸为彩色图片的1/4,目标检测头部将第三解码层的输出先进行上采样处理,再进行预测框的预测,获取得到目标检测结果;The first decoding branch includes a first decoding layer, a second decoding layer, a third decoding layer and a target detection head which are arranged in sequence, wherein the first three layers are used to perform an upsampling operation, each of which outputs a target detection feature, and the target detection feature size output by the third decoding layer is 1/4 of the color picture, and the target detection head first performs an upsampling process on the output of the third decoding layer, and then predicts the prediction box to obtain the target detection result;
所述第二解码分支包括依次设置的第一解码层、第二解码层、第三解码层和实例分割头部,其中前三层用于执行上采样操作,每次均输出一目标检测特征,且第三解码层输出的实例分割特征尺寸为彩色图片的1/4,实例分割头部将第三解码层输出的实例分割特征首先进行上采样处理,再进行实例掩码的预测,获取得到实例分割结果;The second decoding branch includes a first decoding layer, a second decoding layer, a third decoding layer and an instance segmentation head which are arranged in sequence, wherein the first three layers are used to perform an upsampling operation, each of which outputs a target detection feature, and the instance segmentation feature size output by the third decoding layer is 1/4 of the color image, and the instance segmentation head first performs an upsampling process on the instance segmentation feature output by the third decoding layer, and then predicts the instance mask to obtain the instance segmentation result;
两第一解码层之间设有特征共享模块;A feature sharing module is provided between the two first decoding layers;
所述特征共享模块用于输入目标检测特征和实例分割特征,按通道维度拼接得到第二拼接特征,再将第二拼接特征分为两路,一路经空间池化得到空间注意力向量,一路经通道池化得到通道注意力向量;The feature sharing module is used to input the target detection feature and the instance segmentation feature, splice them according to the channel dimension to obtain a second splicing feature, and then divide the second splicing feature into two paths, one path is spatially pooled to obtain a spatial attention vector, and the other path is channel pooled to obtain a channel attention vector;
空间注意力向量、通道注意力向量分别与第二拼接特征进行元素乘积,再将元素乘积的结果相加,得到处理后的第二拼接特征;The spatial attention vector and the channel attention vector are respectively element-wise multiplied with the second splicing feature, and then the results of the element-wise product are added to obtain the processed second splicing feature;
将处理后的第二拼接特征按通道维度进行拆分,得到处理后目标检测特征和处理后实例分割特征,并分别送回第一解码分支第一解码层和第二解码分支的第一解码层中,由两第一解码层送至下一流程中;The processed second concatenated features are split according to the channel dimension to obtain processed target detection features and processed instance segmentation features, and are sent back to the first decoding layer of the first decoding branch and the first decoding layer of the second decoding branch respectively, and then sent to the next process by the two first decoding layers;
两第二解码层间、两第三解码层间也设有特征共享模块。A feature sharing module is also provided between the two second decoding layers and between the two third decoding layers.
作为优选:所述深度相机为Intel RealSense D455 RGBD相机,采样分辨率为640x480。Preferably, the depth camera is an Intel RealSense D455 RGBD camera with a sampling resolution of 640x480.
关于Resnet50主干网络:Resnet50主干网络包括五个阶段,分别为第一阶段到第五阶段,也称为layer0、layer1、layer2、layer3、layer4、第一阶段为layer0,不包含残差块,主要对输入进行卷积、正则化、激活函数、最大池化的计算,而其余四个阶段,都含包含了残差块。每个阶段,都会对应输出特征图或特征向量。About the Resnet50 backbone network: The Resnet50 backbone network consists of five stages, namely the first stage to the fifth stage, also known as layer0, layer1, layer2, layer3, layer4. The first stage is layer0, which does not contain residual blocks. It mainly performs convolution, regularization, activation function, and maximum pooling calculations on the input, while the remaining four stages all contain residual blocks. Each stage will correspond to the output feature map or feature vector.
本发明在编码模块处,采用两个Resnet50主干网络,且在两个Resnet50主干网络间增设了四个融合模块,这是因为,由于车间场景中目标尺度差异过大,所以对特征图中使用相同大小的卷积核可能会忽略小目标的细节信息,或是卷积核的感受野无法捕捉到大目标的全部信息,为此,参考了ESPANet网络的思想,将输入的特征按照通道维度拆分为多个子特征块,再对这些子特征块采用大小不同的卷积核进行特征提取以获取注意力向量。In the encoding module, the present invention adopts two Resnet50 backbone networks and adds four fusion modules between the two Resnet50 backbone networks. This is because the target scales in the workshop scene are too different, so using the same size of convolution kernel in the feature map may ignore the details of small targets, or the receptive field of the convolution kernel cannot capture all the information of large targets. For this reason, the idea of the ESPANet network is referred to, and the input features are split into multiple sub-feature blocks according to the channel dimension, and then the sub-feature blocks are subjected to feature extraction using convolution kernels of different sizes to obtain attention vectors.
与现有技术相比,本发明的优点在于:Compared with the prior art, the advantages of the present invention are:
在编码模块处,采用两个Resnet50主干网络,并增设了四个融合模块。Resnet50主干网络的第二阶段到第五阶段,为下采样的四个阶段,这四个阶段 分别进行特征提取后,利用融合模块进行提取出的两种特征进行融合与校正,此模块在输入彩色图特征与深度图特征之后利用通道注意力对自身代表性特征进行突出并同时对数据中包含的噪声进行抑制。In the encoding module, two Resnet50 backbone networks are used, and four fusion modules are added. The second to fifth stages of the Resnet50 backbone network are the four stages of downsampling. After feature extraction in these four stages, the two extracted features are fused and corrected using the fusion module. After inputting the color map features and the depth map features, this module uses channel attention to highlight its own representative features and suppress the noise contained in the data at the same time.
融合模块将输入的特征按照通道维度拆分为多个子特征块,再对这些子特征块采用大小不同的卷积核进行特征提取,从而对多尺度目标具有更好的适应性。The fusion module splits the input features into multiple sub-feature blocks according to the channel dimension, and then uses convolution kernels of different sizes to extract features from these sub-feature blocks, thereby having better adaptability to multi-scale targets.
解码模块中,两个分支并行推理,同时实现场景目标的目标检测任务与实例分割任务。其中在两个分支各自的解码阶段,分别在第一解码层、第二解码层、第三解码层,将目标检测特征、实例分割特征传入特征共享模块中,实现任务之间的相互补充优化。In the decoding module, the two branches are inferred in parallel to achieve the target detection task and instance segmentation task of the scene target at the same time. In the decoding stage of each branch, the target detection features and instance segmentation features are passed to the feature sharing module at the first decoding layer, the second decoding layer, and the third decoding layer, respectively, to achieve mutual complementary optimization between tasks.
解码模块还在第一解码分支、第二解码分支间设置了三个特征共享模块,目标检测特征与实例分割特征首先按照通道维度进行拼接,并对拼接后的特征分别在空间维度与通道维度进行特征池化操作,获取得到空间注意力向量与通道注意力向量。之后,利用空间注意力向量与通道注意力向量分别对特征在空间层面与通道层面上具有代表性的特征给予突出,同时对噪声进行抑制。最后将通道层面与空间层面突出后的特征采用元素加和的方式进行合并,并按照模态进行拆分完成两种特征的共享。The decoding module also sets up three feature sharing modules between the first decoding branch and the second decoding branch. The target detection features and instance segmentation features are first concatenated according to the channel dimension, and the concatenated features are pooled in the spatial dimension and the channel dimension to obtain the spatial attention vector and the channel attention vector. After that, the spatial attention vector and the channel attention vector are used to highlight the representative features at the spatial level and the channel level, respectively, while suppressing the noise. Finally, the highlighted features at the channel level and the spatial level are merged by element-wise addition, and split according to the modality to complete the sharing of the two features.
由于采用了多任务学习方法,会产生目标检测与图像分割两类损失值,同时由于任务之间的差异性,导致各个任务的预测输出会具有同方差不确定性的特点,为此,采用了多任务学习损失函数方法同时学习不同尺度和数量的回归和分类问题。Due to the use of multi-task learning methods, two types of loss values will be generated: target detection and image segmentation. At the same time, due to the differences between tasks, the prediction output of each task will have the characteristics of homoscedastic uncertainty. For this reason, a multi-task learning loss function method is used to simultaneously learn regression and classification problems of different scales and quantities.
综上,本发明提出了一种新的主干网络,并利用注意力机制进行特征融合, 提出多任务网络同时进行实例分割与目标检测,并设计特征共享模块实现目标检测解码分支与实例分割解码分支之间的信息共享。最终,本发明针对车间场景中色彩相似目标具有良好的识别精度,可实现同一场景下的实例分割与目标检测,且针对车间场景中目标检测任务准确率达到87%,实例分割任务准确率达到81%。In summary, the present invention proposes a new backbone network, uses the attention mechanism for feature fusion, proposes a multi-task network for instance segmentation and target detection at the same time, and designs a feature sharing module to realize information sharing between the target detection decoding branch and the instance segmentation decoding branch. Finally, the present invention has good recognition accuracy for color-similar targets in workshop scenes, can realize instance segmentation and target detection in the same scene, and the accuracy rate of target detection tasks in workshop scenes reaches 87%, and the accuracy rate of instance segmentation tasks reaches 81%.
图1为本发明流程图;Fig. 1 is a flow chart of the present invention;
图2为融合模块示意图;Figure 2 is a schematic diagram of a fusion module;
图3为解码模块示意图;Fig. 3 is a schematic diagram of a decoding module;
图4为特征共享模块示意图。FIG4 is a schematic diagram of a feature sharing module.
下面将结合附图对本发明作进一步说明。The present invention will be further described below in conjunction with the accompanying drawings.
实施例1:参见图1-图4,一种多模态多任务车间目标识别方法,包括以下步骤;Embodiment 1: Referring to FIG. 1 to FIG. 4 , a multi-modal multi-task workshop target recognition method comprises the following steps;
(1)构建样本数据集;(1) Construct a sample data set;
用深度相机对车间现场进行拍摄采集,每次拍摄得到一组对应的彩色图片和深度图片;Use a depth camera to capture images of the workshop scene, and obtain a set of corresponding color images and depth images each time;
确定目标的类别,所述目标的类别包括工人、车床和运料机器人;Determining categories of targets, wherein the categories of targets include workers, lathes, and material transport robots;
在彩色图片和深度图片上,进行目标检测级别与实例分割级别的标注,获得目标检测的真实框和实例分割的真实实例掩码;On color images and depth images, perform object detection level and instance segmentation level annotation to obtain the real box of object detection and the real instance mask of instance segmentation;
将标注后的一组彩色图片和深度图片,作为一数据样本;A set of annotated color images and depth images is used as a data sample;
(2)构建多模态多任务车间目标识别网络;(2) Construct a multi-modal and multi-task workshop target recognition network;
所述多模态多任务车间目标识别网络包括编码模块和解码模块;The multi-modal multi-task workshop target recognition network includes an encoding module and a decoding module;
所述编码模块包括两个ResNet50主干网络,所述ResNet50主干网络从输入端到输出端分为五个阶段,分别为第一阶段到第五阶段,并对应输出第一特征向量到第五特征向量;The encoding module includes two ResNet50 backbone networks, and the ResNet50 backbone networks are divided into five stages from the input end to the output end, namely the first stage to the fifth stage, and correspondingly output the first eigenvector to the fifth eigenvector;
两个ResNet50主干网络分别输入标注后的彩色图片和标注后的深度图片,在两个第二阶段间、两个第三阶段间、两个第四阶段、两个第五阶段间,分别设有一融合模块,从前到后分别为第一个到第四个融合模块,其中前三个融合模块,其输入端连接两个上一阶段的输出端,输出端分为两路,分别与两个上一阶段的输出端进行加和后,送入两个下一阶段;Two ResNet50 backbone networks input the annotated color image and the annotated depth image respectively. A fusion module is set between the two second stages, the two third stages, the two fourth stages, and the two fifth stages. From front to back, they are the first to the fourth fusion modules. Among them, the input ends of the first three fusion modules are connected to the output ends of the two previous stages, and the output ends are divided into two paths. After being added with the output ends of the two previous stages, they are sent to the two next stages.
第四个融合模块,输入端连接两个第五阶段的输出端,输出端分为两路,送入解码模块中;The fourth fusion module has its input connected to the outputs of the two fifth stages, and its output is divided into two paths and sent to the decoding module;
所述融合模块用于将输入的两个特征向量进行特征融合并输出;The fusion module is used to perform feature fusion on the two input feature vectors and output them;
所述解码模块用于将编码模块的输出,进行目标检测和实例分割,输出目标检测结果和实例分割结果;The decoding module is used to perform target detection and instance segmentation on the output of the encoding module, and output target detection results and instance segmentation results;
(3)多模态多任务车间目标识别网络的训练;(3) Training of multi-modal and multi-task workshop object recognition networks;
将样本数据集中的数据样本输入多模态多任务车间目标识别网络,进行目标检测和实例分割,并在在第一解码分支中,以该数据样本中目标对应的真实框为期望输出,在第二解码分支中,以该数据样本中实例分割的真实实例掩码为期望输出,进行训练直至模型收敛;Input the data samples in the sample data set into the multimodal multi-task workshop object recognition network to perform object detection and instance segmentation, and in the first decoding branch, take the real box corresponding to the object in the data sample as the expected output, and in the second decoding branch, take the real instance mask of the instance segmentation in the data sample as the expected output, and train until the model converges;
(4)多模态多任务车间目标识别网络任务识别;(4) Multi-modal and multi-task workshop target recognition network task recognition;
获取车间内待测的一组彩色图片与深度图片,送入多模态多任务车间目标识别网络中,分别输出其中目标对应的目标预测框、和预测实例掩码。A set of color images and depth images to be tested in the workshop are obtained and sent to the multimodal multi-task workshop target recognition network, and the target prediction box and predicted instance mask corresponding to the target are output respectively.
本实施例中,所述融合模块的融合方法为:In this embodiment, the fusion method of the fusion module is:
(2.1)彩色图片为RGB
C×H×W,深度图片为Depth
C×H×W,其中C、H和W分别为对应图片的通道数、高和宽;
(2.1) The color image is RGB C×H×W , and the depth image is Depth C×H×W , where C, H, and W are the number of channels, height, and width of the corresponding image respectively;
(2.2)将彩色图片和深度图片按道维度进行拼接,生成第一拼接特征RGBD
2C×H×W,再按通道维度拆分为S个子特征块,分别标记为X
0~X
S-1,每个子特征块的维度为
(2.2) The color image and the depth image are concatenated according to the channel dimension to generate the first concatenated feature RGBD 2C×H×W , which is then split into S sub-feature blocks according to the channel dimension, marked as X 0 to X S-1 , and the dimension of each sub-feature block is
(2.3)对每个子特征块,分别进行卷积核大小不同的卷积操作,得到子特征向量,其中,第i个子特征块X
i根据下式进行卷积操作;
(2.3) For each sub-feature block, convolution operations with different convolution kernel sizes are performed to obtain sub-feature vectors, where the i-th sub-feature block Xi is convolved according to the following formula:
F
i=Conv
i(X
i),
F i =Conv i (X i ),
式中,F
i为X
i对应的子特征向量,Conv表示卷积操作,i=0~S;
Where, Fi is the sub-feature vector corresponding to Xi , Conv represents the convolution operation, i = 0 ~ S;
(2.4)对S个子特征向量进行全局平均池化操作,降至S×1×1维度大小,得到S个权重向量;(2.4) Perform a global average pooling operation on the S sub-feature vectors to reduce them to an S×1×1 dimension, and obtain S weight vectors;
(2.5)将S个权重向量分别进行归一化处理,得到S个注意力向量;(2.5) Normalize the S weight vectors respectively to obtain S attention vectors;
(2.6)最后再利用获得的注意力向量与第一拼接特征RGBD进行元素乘积,得到融合后特征。(2.6) Finally, the obtained attention vector is used to perform element-wise product with the first concatenated feature RGBD to obtain the fused feature.
所述解码模块包括用于目标检测的第一解码分支和用于实例分割的第二解码分支;The decoding module includes a first decoding branch for object detection and a second decoding branch for instance segmentation;
所述第一解码分支包括依次设置的第一解码层、第二解码层、第三解码层和目标检测头部,其中前三层用于执行上采样操作,每次均输出一目标检测特征,且第三解码层输出的目标检测特征尺寸为彩色图片的1/4,目标检测头部 将第三解码层的输出先进行上采样处理,再进行预测框的预测,获取得到目标检测结果;The first decoding branch includes a first decoding layer, a second decoding layer, a third decoding layer and a target detection head which are arranged in sequence, wherein the first three layers are used to perform an upsampling operation, each of which outputs a target detection feature, and the target detection feature size output by the third decoding layer is 1/4 of the color picture, and the target detection head first performs an upsampling process on the output of the third decoding layer, and then performs a prediction frame prediction to obtain a target detection result;
所述第二解码分支包括依次设置的第一解码层、第二解码层、第三解码层和实例分割头部,其中前三层用于执行上采样操作,每次均输出一目标检测特征,且第三解码层输出的实例分割特征尺寸为彩色图片的1/4,实例分割头部将第三解码层输出的实例分割特征首先进行上采样处理,再进行实例掩码的预测,获取得到实例分割结果;The second decoding branch includes a first decoding layer, a second decoding layer, a third decoding layer and an instance segmentation head which are arranged in sequence, wherein the first three layers are used to perform an upsampling operation, each of which outputs a target detection feature, and the instance segmentation feature size output by the third decoding layer is 1/4 of the color image, and the instance segmentation head first performs an upsampling process on the instance segmentation feature output by the third decoding layer, and then predicts the instance mask to obtain the instance segmentation result;
两第一解码层之间设有特征共享模块;A feature sharing module is provided between the two first decoding layers;
所述特征共享模块用于输入目标检测特征和实例分割特征,按通道维度拼接得到第二拼接特征,再将第二拼接特征分为两路,一路经空间池化得到空间注意力向量,一路经通道池化得到通道注意力向量;The feature sharing module is used to input the target detection feature and the instance segmentation feature, splice them according to the channel dimension to obtain a second splicing feature, and then divide the second splicing feature into two paths, one path is spatially pooled to obtain a spatial attention vector, and the other path is channel pooled to obtain a channel attention vector;
空间注意力向量、通道注意力向量分别与第二拼接特征进行元素乘积,再将元素乘积的结果相加,得到处理后的第二拼接特征;The spatial attention vector and the channel attention vector are respectively element-wise multiplied with the second splicing feature, and then the results of the element-wise product are added to obtain the processed second splicing feature;
将处理后的第二拼接特征按通道维度进行拆分,得到处理后目标检测特征和处理后实例分割特征,并分别送回第一解码分支第一解码层和第二解码分支的第一解码层中,由两第一解码层送至下一流程中;The processed second concatenated features are split according to the channel dimension to obtain processed target detection features and processed instance segmentation features, and are sent back to the first decoding layer of the first decoding branch and the first decoding layer of the second decoding branch respectively, and then sent to the next process by the two first decoding layers;
两第二解码层间、两第三解码层间也设有特征共享模块。A feature sharing module is also provided between the two second decoding layers and between the two third decoding layers.
所述深度相机为Intel RealSense D455 RGBD相机,采样分辨率为640x480。The depth camera is an Intel RealSense D455 RGBD camera with a sampling resolution of 640x480.
关于融合模块,参见图1,前三个融合模块,其输入端连接两个上一阶段的输出端,输出端分为两路,分别与两个上一阶段的输出端进行加和后,送入两个下一阶段。也就是说:Regarding the fusion module, see Figure 1. For the first three fusion modules, the input end is connected to the output end of the two previous stages, and the output end is divided into two paths, which are added to the output ends of the two previous stages and then sent to the two next stages. In other words:
在两个第二阶段间设有一融合模块,该融合模块输入端连接两个第二阶段 的输出端,该融合模块的输出端分为两路,分别与两个第二阶段的输出端进行加和后,送入两个第三阶段。A fusion module is provided between the two second stages. The input end of the fusion module is connected to the output ends of the two second stages. The output end of the fusion module is divided into two paths, which are added to the output ends of the two second stages and then sent to the two third stages.
在两个第三阶段间也设有一融合模块,该融合模块输入端连接两个第三阶段的输出端,该融合模块的输出端分为两路,分别与两个第三阶段的输出端进行加和后,送入两个第四阶段。A fusion module is also provided between the two third stages. The input end of the fusion module is connected to the output ends of the two third stages. The output end of the fusion module is divided into two paths, which are respectively added with the output ends of the two third stages and then sent to the two fourth stages.
在两个第四阶段间也设有一融合模块,该融合模块输入端连接两个第四阶段的输出端,该融合模块的输出端分为两路,分别与两个第四阶段的输出端进行加和后,送入两个第五阶段。A fusion module is also provided between the two fourth stages. The input end of the fusion module is connected to the output ends of the two fourth stages. The output end of the fusion module is divided into two paths, which are respectively added with the output ends of the two fourth stages and then sent to the two fifth stages.
在进行网络训练时:利用获取到的真实目标检测标签与实例分割标签对所述多模态多任务车间目标识别网络进行训练。由于采用了多任务学习方法,会产生目标检测与图像分割两类损失值,同时由于任务之间的差异性,导致各个任务的预测输出会具有同方差不确定性的特点,为此,采用了多任务学习损失函数方法同时学习不同尺度和数量的回归和分类问题。When training the network: the multimodal multi-task workshop target recognition network is trained using the acquired real target detection labels and instance segmentation labels. Due to the use of multi-task learning methods, two types of loss values, target detection and image segmentation, will be generated. At the same time, due to the differences between tasks, the prediction outputs of each task will have the characteristics of homoscedastic uncertainty. For this reason, a multi-task learning loss function method is used to simultaneously learn regression and classification problems of different scales and quantities.
定义多任务联合损失函数满足以下公式。The multi-task joint loss function is defined to satisfy the following formula.
L(W,σ
1,σ
2)表示两个任务的联合损失函数,其中L
1(W)=||y
1-f
W(x)||
2,代指回归任务的损失值,L
2(W)=-logSoftmax(y
2,f
W(x)),代指分类任务的损失,其中y
1,y
2为真实标签值,f
W(x)为网络预测值,σ
1,σ
2分别值两个任务分支输出的噪声标量。
L(W,σ 1 ,σ 2 ) represents the joint loss function of the two tasks, where L 1 (W) = ||y 1 -f W (x)|| 2 , represents the loss value of the regression task, and L 2 (W) = -logSoftmax(y 2 ,f W (x)) represents the loss of the classification task, where y 1 ,y 2 are the true label values, f W (x) is the network prediction value, and σ 1 ,σ 2 are the noise scalars output by the two task branches respectively.
本发明的大致流程为:The general process of the present invention is:
对一数据样本,我们将彩色图片标记为图片A,深度图片标记为图片B,图片A送入一个ResNet50主干网络,图片B送入另一个ResNet50主干网络中,参见图1,别经第一阶段、第二阶段后,输出图片A对应的第二特征向量A2,和图片B对应的第二特征向量B2;For a data sample, we mark the color image as image A and the depth image as image B. Image A is sent to a ResNet50 backbone network, and image B is sent to another ResNet50 backbone network. See Figure 1. After the first and second stages, the second eigenvector A2 corresponding to image A and the second eigenvector B2 corresponding to image B are output;
A2、B2送入融合模块中进行处理,经步骤(2.1)-(2.6),得到融合后特征;将融合后特征分为两路,分别与A2、B2进行加和后,得到两个加和后的特征A2’、B2’送入下一阶段,也就是两个ResNet50主干网络的第三阶段中。A2 and B2 are sent to the fusion module for processing, and after steps (2.1)-(2.6), the fused features are obtained; the fused features are divided into two paths, and added with A2 and B2 respectively, and the two added features A2’ and B2’ are obtained and sent to the next stage, that is, the third stage of the two ResNet50 backbone networks.
同理,两个第三阶段、第四阶段、第五阶段间都设有融合模块,按照上述融合模块的操作流程,最终经第四个融合模块,输出一融合特征图,我们称之为多模态融合特征图;此时编码模块工作结束;Similarly, there are fusion modules between the two third stages, the fourth stage, and the fifth stage. According to the operation process of the above fusion modules, the fourth fusion module finally outputs a fusion feature map, which we call a multimodal fusion feature map; at this time, the encoding module is finished;
将多模态融合特征图分为两路,分别送入解码模块的第一解码分支和第二解码分支中,经第一解码分支输出目标检测结果,经第二解码分支输出实例分割结果。但在两个分支间,我们还设置了特征共享模块,两个第一解码层间的特征共享模块工作流程为:目标检测特征与实例分割特征首先按照通道维度进行拼接,并对拼接后的特征分别在空间维度与通道维度进行特征池化操作,获取得到空间注意力向量与通道注意力向量。之后,利用空间注意力向量与通道注意力向量分别对特征在空间层面与通道层面上具有代表性的特征给予突出,同时对噪声进行抑制。最后将通道层面与空间层面突出后的特征采用元素加和的方式进行合并,并按照模态进行拆分完成两种特征的共享。本发明在解码模块设置了三个特征共享模块,每一解码层间都进行一次特征共享。The multimodal fusion feature map is divided into two paths, which are respectively sent to the first decoding branch and the second decoding branch of the decoding module. The target detection result is output through the first decoding branch, and the instance segmentation result is output through the second decoding branch. However, we also set a feature sharing module between the two branches. The workflow of the feature sharing module between the two first decoding layers is as follows: the target detection feature and the instance segmentation feature are first spliced according to the channel dimension, and the spliced features are subjected to feature pooling operations in the spatial dimension and the channel dimension respectively to obtain the spatial attention vector and the channel attention vector. Afterwards, the spatial attention vector and the channel attention vector are used to highlight the representative features of the features at the spatial level and the channel level, respectively, while suppressing the noise. Finally, the highlighted features at the channel level and the spatial level are merged by element addition, and split according to the modality to complete the sharing of the two features. The present invention sets three feature sharing modules in the decoding module, and feature sharing is performed once between each decoding layer.
训练的时候,我们以目标检测的真实框和实例分割的真实实例掩码为期望输出,对预测框、预测实例掩码进行校正。During training, we use the true box of target detection and the true instance mask of instance segmentation as the expected output, and calibrate the predicted box and predicted instance mask.
本发明针对车间场景中色彩相似目标具有良好的识别精度,可实现同一场景下的实例分割与目标检测,且针对车间场景中目标检测任务准确率达到87%,实例分割任务准确率达到81%。The present invention has good recognition accuracy for targets with similar colors in workshop scenes, can realize instance segmentation and target detection in the same scene, and the accuracy rate of target detection tasks in workshop scenes reaches 87%, and the accuracy rate of instance segmentation tasks reaches 81%.
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the protection scope of the present invention.
Claims (4)
- 一种多模态多任务车间目标识别方法,其特征在于:包括以下步骤;A multi-modal multi-task workshop target recognition method, characterized by comprising the following steps:(1)构建样本数据集;(1) Construct a sample data set;用深度相机对车间现场进行拍摄采集,每次拍摄得到一组对应的彩色图片和深度图片;Use a depth camera to capture images of the workshop scene, and obtain a set of corresponding color images and depth images each time;确定目标的类别,所述目标的类别包括工人、车床和运料机器人;Determining categories of targets, wherein the categories of targets include workers, lathes, and material transport robots;在彩色图片和深度图片上,进行目标检测级别与实例分割级别的标注,获得目标检测的真实框和实例分割的真实实例掩码;On color images and depth images, perform object detection level and instance segmentation level annotation to obtain the real box of object detection and the real instance mask of instance segmentation;将标注后的一组彩色图片和深度图片,作为一数据样本;A set of annotated color images and depth images is used as a data sample;(2)构建多模态多任务车间目标识别网络;(2) Construct a multi-modal and multi-task workshop target recognition network;所述多模态多任务车间目标识别网络包括编码模块和解码模块;The multi-modal multi-task workshop target recognition network includes an encoding module and a decoding module;所述编码模块包括两个ResNet50主干网络,所述ResNet50主干网络从输入端到输出端分为五个阶段,分别为第一阶段到第五阶段,并对应输出第一特征向量到第五特征向量;The encoding module includes two ResNet50 backbone networks, and the ResNet50 backbone networks are divided into five stages from the input end to the output end, namely the first stage to the fifth stage, and correspondingly output the first eigenvector to the fifth eigenvector;两个ResNet50主干网络分别输入标注后的彩色图片和标注后的深度图片,在两个第二阶段间、两个第三阶段间、两个第四阶段、两个第五阶段间,分别设有一融合模块,从前到后分别为第一个到第四个融合模块,其中前三个融合模块,其输入端连接两个上一阶段的输出端,输出端分为两路,分别与两个上一阶段的输出端进行加和后,送入两个下一阶段;Two ResNet50 backbone networks input the annotated color image and the annotated depth image respectively. A fusion module is set between the two second stages, the two third stages, the two fourth stages, and the two fifth stages. From front to back, they are the first to the fourth fusion modules. Among them, the input ends of the first three fusion modules are connected to the output ends of the two previous stages, and the output ends are divided into two paths. After being added with the output ends of the two previous stages, they are sent to the two next stages.第四个融合模块,输入端连接两个第五阶段的输出端,输出端分为两路,送入解码模块中;The fourth fusion module has its input connected to the outputs of the two fifth stages, and its output is divided into two paths and sent to the decoding module;所述融合模块用于将输入的两个特征向量进行特征融合并输出;The fusion module is used to perform feature fusion on the two input feature vectors and output them;所述解码模块用于将编码模块的输出,进行目标检测和实例分割,输出目标检测结果和实例分割结果;The decoding module is used to perform target detection and instance segmentation on the output of the encoding module, and output target detection results and instance segmentation results;(3)多模态多任务车间目标识别网络的训练;(3) Training of multi-modal and multi-task workshop object recognition networks;将样本数据集中的数据样本输入多模态多任务车间目标识别网络,进行目标检测和实例分割,并在在第一解码分支中,以该数据样本中目标对应的真实框为期望输出,在第二解码分支中,以该数据样本中实例分割的真实实例掩码为期望输出,进行训练直至模型收敛;Input the data samples in the sample data set into the multimodal multi-task workshop object recognition network to perform object detection and instance segmentation, and in the first decoding branch, take the real box corresponding to the object in the data sample as the expected output, and in the second decoding branch, take the real instance mask of the instance segmentation in the data sample as the expected output, and train until the model converges;(4)多模态多任务车间目标识别网络任务识别;(4) Multi-modal and multi-task workshop target recognition network task recognition;获取车间内待测的一组彩色图片与深度图片,送入多模态多任务车间目标识别网络中,分别输出其中目标对应的目标预测框、和预测实例掩码。A set of color images and depth images to be tested in the workshop are obtained and sent to the multimodal multi-task workshop target recognition network, and the target prediction box and predicted instance mask corresponding to the target are output respectively.
- 根据权利要求1所述的一种多模态多任务车间目标识别方法,其特征在于:所述融合模块的融合方法为:The multi-modal multi-task workshop target recognition method according to claim 1 is characterized in that: the fusion method of the fusion module is:(2.1)彩色图片为RGB C×H×W,深度图片为Depth C×H×W,其中C、H和W分别为对应图片的通道数、高和宽; (2.1) The color image is RGB C×H×W , and the depth image is Depth C×H×W , where C, H, and W are the number of channels, height, and width of the corresponding image respectively;(2.2)将彩色图片和深度图片按道维度进行拼接,生成第一拼接特征RGBD 2C×H×W,再按通道维度拆分为S个子特征块,分别标记为X 0~X S-1,每个子特征块的维度为 (2.2) The color image and the depth image are concatenated according to the channel dimension to generate the first concatenated feature RGBD 2C×H×W , which is then split into S sub-feature blocks according to the channel dimension, marked as X 0 to X S-1 , and the dimension of each sub-feature block is(2.3)对每个子特征块,分别进行卷积核大小不同的卷积操作,得到子特征向量,其中,第i个子特征块X i根据下式进行卷积操作; (2.3) For each sub-feature block, convolution operations with different convolution kernel sizes are performed to obtain sub-feature vectors, where the i-th sub-feature block Xi is convolved according to the following formula:F i=Conv i(X i), F i =Conv i (X i ),式中,F i为X i对应的子特征向量,Conv表示卷积操作,i=0~S; Where, Fi is the sub-feature vector corresponding to Xi , Conv represents the convolution operation, i = 0 ~ S;(2.4)对S个子特征向量进行全局平均池化操作,降至S×1×1维度大小, 得到S个权重向量;(2.4) Perform a global average pooling operation on the S sub-feature vectors to reduce them to a dimension of S×1×1, and obtain S weight vectors;(2.5)将S个权重向量分别进行归一化处理,得到S个注意力向量;(2.5) Normalize the S weight vectors respectively to obtain S attention vectors;(2.6)最后再利用获得的注意力向量与第一拼接特征RGBD进行元素乘积,得到融合后特征。(2.6) Finally, the obtained attention vector is used to perform element-wise product with the first concatenated feature RGBD to obtain the fused feature.
- 根据权利要求1所述的一种多模态多任务车间目标识别方法,其特征在于:所述解码模块包括用于目标检测的第一解码分支和用于实例分割的第二解码分支;The multi-modal multi-task workshop target recognition method according to claim 1 is characterized in that: the decoding module includes a first decoding branch for target detection and a second decoding branch for instance segmentation;所述第一解码分支包括依次设置的第一解码层、第二解码层、第三解码层和目标检测头部,其中前三层用于执行上采样操作,每次均输出一目标检测特征,且第三解码层输出的目标检测特征尺寸为彩色图片的1/4,目标检测头部将第三解码层的输出先进行上采样处理,再进行预测框的预测,获取得到目标检测结果;The first decoding branch includes a first decoding layer, a second decoding layer, a third decoding layer and a target detection head which are arranged in sequence, wherein the first three layers are used to perform an upsampling operation, each of which outputs a target detection feature, and the target detection feature size output by the third decoding layer is 1/4 of the color picture, and the target detection head first performs an upsampling process on the output of the third decoding layer, and then predicts the prediction box to obtain the target detection result;所述第二解码分支包括依次设置的第一解码层、第二解码层、第三解码层和实例分割头部,其中前三层用于执行上采样操作,每次均输出一目标检测特征,且第三解码层输出的实例分割特征尺寸为彩色图片的1/4,实例分割头部将第三解码层输出的实例分割特征首先进行上采样处理,再进行实例掩码的预测,获取得到实例分割结果;The second decoding branch includes a first decoding layer, a second decoding layer, a third decoding layer and an instance segmentation head which are arranged in sequence, wherein the first three layers are used to perform an upsampling operation, each of which outputs a target detection feature, and the instance segmentation feature size output by the third decoding layer is 1/4 of the color image, and the instance segmentation head first performs an upsampling process on the instance segmentation feature output by the third decoding layer, and then predicts the instance mask to obtain the instance segmentation result;两第一解码层之间设有特征共享模块;A feature sharing module is provided between the two first decoding layers;所述特征共享模块用于输入目标检测特征和实例分割特征,按通道维度拼接得到第二拼接特征,再将第二拼接特征分为两路,一路经空间池化得到空间注意力向量,一路经通道池化得到通道注意力向量;The feature sharing module is used to input the target detection feature and the instance segmentation feature, splice them according to the channel dimension to obtain a second splicing feature, and then divide the second splicing feature into two paths, one path is spatially pooled to obtain a spatial attention vector, and the other path is channel pooled to obtain a channel attention vector;空间注意力向量、通道注意力向量分别与第二拼接特征进行元素乘积,再 将元素乘积的结果相加,得到处理后的第二拼接特征;The spatial attention vector and the channel attention vector are respectively element-wise multiplied with the second splicing feature, and then the results of the element-wise product are added to obtain the processed second splicing feature;将处理后的第二拼接特征按通道维度进行拆分,得到处理后目标检测特征和处理后实例分割特征,并分别送回第一解码分支第一解码层和第二解码分支的第一解码层中,由两第一解码层送至下一流程中;The processed second concatenated features are split according to the channel dimension to obtain processed target detection features and processed instance segmentation features, and are sent back to the first decoding layer of the first decoding branch and the first decoding layer of the second decoding branch respectively, and then sent to the next process by the two first decoding layers;两第二解码层间、两第三解码层间也设有特征共享模块。A feature sharing module is also provided between the two second decoding layers and between the two third decoding layers.
- 根据权利要求1所述的一种多模态多任务车间目标识别方法,其特征在于:所述深度相机为Intel RealSense D455 RGBD相机,采样分辨率为640x480。According to a multimodal multi-task workshop target recognition method as described in claim 1, it is characterized in that the depth camera is an Intel RealSense D455 RGBD camera with a sampling resolution of 640x480.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2022/133437 WO2024108377A1 (en) | 2022-11-22 | 2022-11-22 | Multimodal multi-task workshop target recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2022/133437 WO2024108377A1 (en) | 2022-11-22 | 2022-11-22 | Multimodal multi-task workshop target recognition method |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024108377A1 true WO2024108377A1 (en) | 2024-05-30 |
Family
ID=91194928
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/133437 WO2024108377A1 (en) | 2022-11-22 | 2022-11-22 | Multimodal multi-task workshop target recognition method |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2024108377A1 (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112966644A (en) * | 2021-03-24 | 2021-06-15 | 中国科学院计算技术研究所 | Multi-mode multi-task model for gesture detection and gesture recognition and training method thereof |
CN114494276A (en) * | 2022-04-18 | 2022-05-13 | 成都理工大学 | Two-stage multi-modal three-dimensional instance segmentation method |
CN114821014A (en) * | 2022-05-17 | 2022-07-29 | 湖南大学 | Multi-mode and counterstudy-based multi-task target detection and identification method and device |
US20220254150A1 (en) * | 2021-02-05 | 2022-08-11 | Salesforce.Com, Inc. | Exceeding the limits of visual-linguistic multi-task learning |
-
2022
- 2022-11-22 WO PCT/CN2022/133437 patent/WO2024108377A1/en unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220254150A1 (en) * | 2021-02-05 | 2022-08-11 | Salesforce.Com, Inc. | Exceeding the limits of visual-linguistic multi-task learning |
CN112966644A (en) * | 2021-03-24 | 2021-06-15 | 中国科学院计算技术研究所 | Multi-mode multi-task model for gesture detection and gesture recognition and training method thereof |
CN114494276A (en) * | 2022-04-18 | 2022-05-13 | 成都理工大学 | Two-stage multi-modal three-dimensional instance segmentation method |
CN114821014A (en) * | 2022-05-17 | 2022-07-29 | 湖南大学 | Multi-mode and counterstudy-based multi-task target detection and identification method and device |
Non-Patent Citations (1)
Title |
---|
TANG ZAIZUO, CHEN GUANGZHU; HAN YINHE; LIAO XIAOJUAN; RU QINGJUN; WU YUANYUAN: "Bi-stage multi-modal 3D instance segmentation method for production workshop scene", ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE., PINERIDGE PRESS, SWANSEA., GB, vol. 112, 1 June 2022 (2022-06-01), GB , pages 104858, XP093173111, ISSN: 0952-1976, DOI: 10.1016/j.engappai.2022.104858 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Thakkar et al. | Part-based graph convolutional network for action recognition | |
CN109241895B (en) | Dense crowd counting method and device | |
WO2020228446A1 (en) | Model training method and apparatus, and terminal and storage medium | |
US11328172B2 (en) | Method for fine-grained sketch-based scene image retrieval | |
CN109902548B (en) | Object attribute identification method and device, computing equipment and system | |
CN110059598B (en) | Long-term fast-slow network fusion behavior identification method based on attitude joint points | |
US20170032222A1 (en) | Cross-trained convolutional neural networks using multimodal images | |
CN110674741A (en) | Machine vision gesture recognition method based on dual-channel feature fusion | |
WO2023185494A1 (en) | Point cloud data identification method and apparatus, electronic device, and storage medium | |
WO2022242122A1 (en) | Video optimization method and apparatus, terminal device, and storage medium | |
WO2022152104A1 (en) | Action recognition model training method and device, and action recognition method and device | |
CN108133235A (en) | A kind of pedestrian detection method based on neural network Analysis On Multi-scale Features figure | |
CN110852295A (en) | Video behavior identification method based on multitask supervised learning | |
CN110310305A (en) | A kind of method for tracking target and device based on BSSD detection and Kalman filtering | |
CN107948586A (en) | Trans-regional moving target detecting method and device based on video-splicing | |
WO2019117393A1 (en) | Learning apparatus and method for depth information generation, depth information generation apparatus and method, and recording medium related thereto | |
CN114863407A (en) | Multi-task cold start target detection method based on visual language depth fusion | |
CN113920378B (en) | Bupleurum seed identification method based on attention mechanism | |
CN111476190A (en) | Target detection method, apparatus and storage medium for unmanned driving | |
CN115049833A (en) | Point cloud component segmentation method based on local feature enhancement and similarity measurement | |
CN107274425A (en) | A kind of color image segmentation method and device based on Pulse Coupled Neural Network | |
WO2024108377A1 (en) | Multimodal multi-task workshop target recognition method | |
CN116468902A (en) | Image processing method, device and non-volatile computer readable storage medium | |
CN115222768A (en) | Method and device for positioning tracking object in video, electronic equipment and storage medium | |
CN115457365A (en) | Model interpretation method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22966077 Country of ref document: EP Kind code of ref document: A1 |