CN116433903A

CN116433903A - Instance segmentation model construction method, system, electronic equipment and storage medium

Info

Publication number: CN116433903A
Application number: CN202310347568.XA
Authority: CN
Inventors: 田炜; 余先旺; 孙震; 曹旭; 林艺
Original assignee: Nanchang Intelligent New Energy Vehicle Research Institute
Current assignee: Nanchang Intelligent New Energy Vehicle Research Institute
Priority date: 2023-04-03
Filing date: 2023-04-03
Publication date: 2023-07-14

Abstract

The invention provides an example segmentation model construction method, an example segmentation model construction system, electronic equipment and a storage medium, and belongs to the field of perception technology of automatic driving image processing. The method comprises the following steps: training a monocular depth estimation model by adopting a training framework of a self-supervision monocular depth estimation network; extracting a ResNeXt50 backbone network based on FPN characteristics of a monocular depth estimation model, and carrying out group normalization and weight normalization processing on the backbone network to obtain a preprocessed backbone network; adding the feature pyramid network into a monocular depth estimation model; constructing a depth information guide module based on the improved spatial attention; loading the information between the characteristic pyramid network and the preprocessing backbone network through a depth information guiding module; and correcting the region of interest according to the pyramid characteristics and synthesizing the fixed region so as to construct the target instance segmentation model. The method and the device can effectively solve the problems of wrong segmentation or failure caused by low accuracy and unclear segmentation boundary of the detection and segmentation of the long-distance target and/or the large-size target.

Description

Instance segmentation model construction method, system, electronic equipment and storage medium

Technical Field

The invention belongs to the technical field of perception of automatic driving image processing, and particularly relates to an example segmentation model construction method, an example segmentation model construction system, electronic equipment and a storage medium.

Background

Visual perception systems for automatically driving automobiles involve a number of visual tasks including object detection, classification, segmentation, localization, object tracking, depth estimation, and the like. Target motion prediction and gesture estimation realized based on target detection results are the precondition of realizing functions such as collision avoidance control, motion track planning and the like of an automatic driving automobile; the object detection and the instance segmentation in the road scene mainly judge the object category attribute and accurately position the object position. General road scenes can be mainly divided into two categories, namely, a taking category containing specific countable objects such as pedestrians, vehicles of different types and the like, and a stuff category containing texture similar areas and difficult countable areas. The example segmentation can be regarded as the expansion of target detection, mainly aims at the taking object, and realizes the pixel-by-pixel segmentation of different types of targets with different numbers on the basis of detecting targets. Unlike semantic segmentation, the objects of the same class in the instance segmentation need to be treated differently, and the number of objects of the same class is variable, so that the instance segmentation problem cannot be solved simply through a pixel-by-pixel classification concept similar to the semantic segmentation, such as the perception task comparison of different scenes shown in fig. 1, fig. 1 (a) original image, fig. 1 (b) image processed by semantic segmentation, fig. 1 (c) image processed by object detection, and fig. 1 (d) image processed by instance segmentation.

Taking the common road scene of automatic driving as an example, the case segmentation of objects such as densely distributed long-distance and/or large-size crowd, vehicles and the like sometimes has unclear segmentation boundaries, even segmentation errors and failures due to the fact that local textures of images are close or occur, as in the case of case segmentation aging shown in fig. 2, the original image and the local amplification thereof are shown in fig. 2 (a), and the case segmentation result and the local amplification thereof are shown in fig. 2 (b). Compared with the method for learning the positions and the sizes of the mask and the detection frame of the region where the target is located from single two-dimensional image information, along with the depth characteristics introduced by the image processing based on the deep learning, the positions of different objects in the three-dimensional scene are distinguishable, and the regions with similar textures and larger depth value difference are not necessarily the same object. In particular, depth estimation information of scenes such as vehicle queues, pedestrians and the like with shielding relation in a road environment can further guide and assist an instance segmentation task. However, few studies in the prior art focus on the feasibility and effectiveness of depth estimation information for example segmentation model performance improvement.

Therefore, how to apply depth estimation information guidance to perform segmentation to overcome the problems of detection of a long-distance target and/or a large-size target, segmentation errors or failures caused by low accuracy and unclear segmentation boundaries is a problem to be solved by those skilled in the art.

Disclosure of Invention

In order to solve the technical problems, the invention provides an example segmentation model construction method, an example segmentation model construction system, electronic equipment and a storage medium.

In a first aspect, the invention provides a method for constructing an instance segmentation model, which comprises the following steps:

training a monocular depth estimation model by adopting a training framework of a self-supervision monocular depth estimation network;

extracting a ResNeXt50 backbone network based on FPN characteristics of the monocular depth estimation model, and performing group normalization and weight normalization processing on the backbone network to obtain a preprocessing backbone network;

adding a feature pyramid network into the monocular depth estimation model;

constructing a depth information guide module based on the improved spatial attention;

loading the depth information between the feature pyramid network and the preprocessing backbone network through the depth information guiding module so as to enable the depth information extracted by the preprocessing backbone network to be reflected into pyramid features with different scales along an up-sampling and fusion channel of the feature pyramid network;

and correcting the interested region and the synthesized fixed region according to the pyramid characteristics with the depth information, and combining the monocular depth estimation model to construct a target instance segmentation model.

Preferably, the self-supervision monocular depth estimation network comprises a monocular depth estimation network and a pose estimation network, wherein a backbone network of the monocular depth estimation network adopts a standard ResNet-50 network.

Preferably, the monocular depth estimation network generates a multi-scale relative depth feature map through a depth estimation prediction head in

stages

4, 3, 2 and 1 respectively, wherein the relative depth ranges from 0 to 1.

Preferably, the decoder of the standard ResNet-50 network comprises a fused convolution of a series of upsampling and fused skip connections; the pose estimation network carries out a plurality of convolutions and activations on the feature map output by the last layer of the encoder of the standard ResNet-50 network so as to predict transformation matrix parameters between two frames of feature maps.

Preferably, the feature pyramid network outputs F at four layers layer1, layer2, layer3, layer4 of the ResNeXt50 backbone network ₁ 、F ₂ 、F ₃ 、F ₄ Based on the channel conversion, up-sampling fusion generates the channel number C _out And adding an output P based on the last layer output of the ResNeXt50 backbone network ₅ 。

Preferably, the depth information guiding module is used for guiding the characteristics from the ResNeXt50 backbone network

And features from the monocular depth estimation network ∈>

And performing linear transformation to obtain respective subspace representations.

Preferably, a learnable weight parameter α is added to the depth information guiding module, and α is initialized to 0 when the self-supervision monocular depth estimation network training starts, and the depth information guiding module is equivalent to the transverse convolution of the ResNeXt50 backbone network.

In a second aspect, the invention provides an example segmentation model construction system, comprising:

the training module is used for training the monocular depth estimation model by adopting a training framework of the self-supervision monocular depth estimation network;

the preprocessing module is used for extracting a ResNeXt50 backbone network based on the FPN characteristics of the monocular depth estimation model, and carrying out group normalization and weight normalization processing on the backbone network to obtain a preprocessed backbone network;

the adding module is used for adding the feature pyramid network into the monocular depth estimation model;

a construction module for constructing a depth information guide module based on the improved spatial attention;

the fusion module is used for loading the depth information between the characteristic pyramid network and the preprocessing backbone network through the depth information guiding module so as to enable the depth information extracted by the preprocessing backbone network to be reflected into pyramid characteristics with different scales along an up-sampling and fusion channel of the characteristic pyramid network;

and the construction module is used for correcting the region of interest and the synthesized fixed region according to the pyramid characteristics with the depth information and combining the monocular depth estimation model so as to construct a target instance segmentation model.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the method for building an instance segmentation model according to the first aspect when executing the computer program.

In a fourth aspect, embodiments of the present application provide a storage medium having stored thereon a computer program which, when executed by a processor, implements the instance segmentation model construction method as set forth in the first aspect.

Compared with the prior art, the method, the system, the electronic equipment and the storage medium for constructing the example segmentation model have the following beneficial effects:

1. the method comprises the steps of adding a multi-scale spatial attention module at an input end of feature pyramid generation of an instance segmentation backbone network, selectively weighting original instance segmentation features by means of depth information by calculating similarity relation between the depth features and features extracted by the instance segmentation backbone network, so that scene structure information is highlighted in pyramid features extracted by the instance segmentation backbone network, and more assistance is provided for instance segmentation large object detection, remote target detection and segmentation.

2. Compared with the existing model, the built instance segmentation model remarkably improves the detection and segmentation accuracy of a long-distance target and/or a large-size target by utilizing visual depth information, and simultaneously improves the accuracy of the instance segmentation boundary of a low-texture region; better support is provided for accurate perception of remote target tracking and target position in an automatic driving application scene.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram of a perceived task comparison effect for different scenarios in the prior art;

FIG. 2 is a schematic diagram of an example split aging case of the prior art;

FIG. 3 is a flowchart of an example segmentation model construction method provided in embodiment 1 of the present invention;

FIG. 4 is a diagram of a self-monitoring monocular depth estimation network training framework provided in embodiment 1 of the present invention;

fig. 5 is a detail structure diagram of a ResNeXt50 backbone network provided in embodiment 1 of the present invention;

fig. 6 is a schematic diagram of a DGM module according to embodiment 1 of the present invention;

fig. 7 is a diagram of a SA-FPN backbone network according to embodiment 1 of the present invention;

FIG. 8 is a diagram of the overall architecture of the object example segmentation model provided in embodiment 1 of the present invention;

fig. 9 is a graph showing the lifting effect of depth information on an example segmentation boundary in experimental verification provided in embodiment 1 of the present invention;

FIG. 10 is a block diagram of an example segmentation model construction system corresponding to the method of embodiment 1 provided in embodiment 2 of the present invention;

fig. 11 is a schematic hardware structure of an electronic device according to embodiment 3 of the present invention.

Reference numerals illustrate:

10-a training module;

20-a pretreatment module;

30-adding a module;

40-constructing a module;

a 50-fusion module;

60-building a module;

70-bus, 71-processor, 72-memory, 73-communication interface.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the disclosed aspects may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. I.e. the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

Taking the common road scene of automatic driving as an example, the example segmentation of objects such as densely distributed crowd, vehicles and the like sometimes has unclear segmentation boundaries, even segmentation errors and failures due to the approach of local textures of images. In response to the above difficulties, there are currently methods that combine depth estimation with instance segmentation.

Among them, the current mainstream example segmentation can be mainly divided into two methods, namely bottom-up and top-down. The top-down method follows the idea of detecting first and then dividing, namely, the boundary box of the object is obtained through target detection first, and the in-box area is divided through a division network to obtain an example division mask. The bottom-up method is to acquire the instance segmentation mask by using a method of segmentation-first clustering-second clustering, which often needs an operation of grouping pixels to generate an instance segmentation result, and the performance of the method is weaker than that of the common instance segmentation.

The monocular depth estimation is one of the basic problems of computer vision, and the task of monocular depth estimation can be mainly divided into a supervised mode and an unsupervised mode at present. The supervised monocular depth estimation learns the mapping relation of the two-dimensional image to the relative depth value in an end-to-end mode, is limited by the natural deficiency of 2D to 3D mapping, and recent methods attempt to improve the performance of the supervised monocular depth by adding geometric information prior. Unsupervised monocular depth estimation is one of the most successful examples of unsupervised learning at present. Under the unsupervised framework, the 3D packing-unpacking operation is used for replacing the pooling (downsampling) and upsampling operation, so that more image detail information representation is reserved in the encoding-decoding process, and the effect of unsupervised monocular depth estimation is further improved.

In order to better utilize the relevance among the multitasking and information assistance, many scene understanding deep learning methods integrate multiple subtask modules such as semantic segmentation, depth estimation and the like. However, for example segmentation of objects such as densely distributed long-distance and/or large-size crowd, vehicles and the like of road scenes, the existing method for combining depth estimation and example segmentation has the problems of insufficient ability of accurately detecting the objects and predicting fine-grained example masks and difficulty in capturing scene structures and contexts. The present application is presented based on this.

Example 1

Specifically, fig. 3 is a schematic flow chart of an example segmentation model construction method according to the present embodiment.

As shown in fig. 3, the example segmentation model construction method of the present embodiment includes the steps of:

s101, training a monocular depth estimation model by adopting a training framework of a self-supervision monocular depth estimation network.

The self-supervision monocular depth estimation network comprises a monocular depth estimation network and a pose estimation network. Preferably, the backbone network of the monocular depth estimation network employs a standard ResNet-50 network, and the corresponding decoder mainly comprises a series of upsampling (UpConv _i ) And fusion jumping connection (SkiFusion convolution of p Connection (FConv) _i ). The decoder of the monocular depth estimation network generates multi-scale relative depth maps through the depth estimation pre-header (Depth Prediction Head) at

stages

4, 3, 2, 1, respectively, with relative depths ranging from [0,1 ]]The final activation function for each pre-header is thus Sigmoid.

Further, the pose estimation network needs to regress parameters of a pose transformation matrix with high abstraction, and a plurality of convolutions and activation operations are directly carried out on the last layer of output feature diagram of the standard ResNet-50 network to reduce the number of feature channels, so that transformation matrix parameters between two frames are finally predicted. It should be noted that the input of the pose estimation network is the result of combining two image channels, and when the res net pre-training network is used, the problem that the sizes of the first layer convolution parameters are not matched occurs. The convolution kernel size of the ResNet first layer is Cm×Cout×K×K (K is the convolution kernel size), and when two pictures are combined and input, C is obtained _in Doubling is required, here the idea of copying parameters proposed by Godard et al, i.e. according to C _in The dimension replicates the parameter by a factor of 2 and then divides the weight of the entire convolution kernel by 2.

S102, extracting a ResNeXt50 backbone network based on FPN characteristics of the monocular depth estimation model, and performing group normalization and weight normalization processing on the backbone network to obtain a preprocessing backbone network.

The data size of the adopted data set Cityscapes is larger, the picture width and height are 2048 and 1024 respectively, and after a data enhancement method such as random clipping is adopted, the clipping size cannot be too small so as not to lose excessive picture global information, so that when distributed training is adopted, the number of pictures which can be processed once on each GPU is small. While the Batch normalization (Batch Normalization, BN) employed in the original ResNeXt model works well in many computer vision applications, it has very limited effectiveness for small Batch (Mini Batch) training, especially where only 1-2 images per GPU are used for training. This is mainly due to the fact that when training batches get smaller, it is difficult to make an accurate estimate of the distribution based on a small number of samples that follow a particular distribution, and the effect of efficient batch normalization is lost. Therefore, the embodiment adopts group normalization processing, wherein the group normalization is to group the characteristic channel dimension (C) on the basis of original batch normalization, the normalization operation is carried out in the group, all batches are not normalized on each channel, the operation is not dependent on batch dimension (N) any more, and the method is helpful for small batch training.

Further, self-supervised depth estimation requires constraining training with a priori features of the depth map and reconstructed similarities. The self-supervision monocular depth estimation framework adopts two Loss function constraints, namely luminosity Loss (Photometric Loss) and depth smoothing Loss (smoothloss), and particularly as shown in fig. 4, the self-supervision monocular depth estimation framework adopts the following formula to characterize: l=0.15l _Smooth +L _Photometric The method comprises the steps of carrying out a first treatment on the surface of the Wherein, luminosity loss L _Smooth By combining the luminance constancy and spatial smoothness prior knowledge used in classical dense matching algorithms, the photometric difference between the resampled reference frame and the real target frame is used as part of an unsupervised loss function. Depth smoothing loss L _Photometric Is a depth continuity constraint on non-boundary regions. The smoothness constraint utilizes the continuity feature priori of the depth distribution of the non-boundary region to constrain the depth estimation result.

And S103, adding a feature pyramid network into the monocular depth estimation model.

Wherein the characteristic pyramid network outputs F at four layers layer1, layer2, layer3 and layer4 of the ResNeXt50 backbone network ₁ 、F ₂ 、F ₃ 、F ₄ Based on the channel conversion, up-sampling fusion generates the channel number C _out And adding an output P based on the last layer output of the ResNeXt50 backbone network ₅ The method comprises the steps of carrying out a first treatment on the surface of the As shown in particular in fig. 5.

S104, constructing a depth information guiding module based on the improved spatial attention.

Wherein the depth information guiding module is used for guiding the characteristics from the ResNeXt50 backbone network

And features from the monocular depth estimation network ∈>

And performing linear transformation to obtain respective subspace representations, and simultaneously, reducing the dimension of the channel, so as to avoid excessive calculation amount consumption in calculating the similarity matrix. Specifically, in the depth guidance module of the present embodiment, in order to enable the example split backbone network to sufficiently capture the long-range relationship between the depth feature and the example split feature during training, the depth information guidance module DGM is designed by improving the spatial attention (Spatial Attention) module, and the specific structure is shown in fig. 6. In addition, a leachable weight parameter alpha is added in the depth information guiding module, and alpha is initialized to 0 when the self-supervision monocular depth estimation network training starts, and the depth information guiding module is equivalent to the transverse convolution of the ResNeXt50 backbone network. The addition of the learnable weight parameter alpha gives the DGM module a certain self-adaptability, especially at the beginning of training; on the other hand, the automatic parameter optimization search can be performed without manually adjusting network parameters.

It should be noted that the Dual attention variant of the self-attention mechanism shows excellent performance in semantic segmentation applications, and by introducing spatial attention modules and channel attention modules, the network can encode more extensive context information as local features, thereby enhancing its representation capability. In addition to the cross-task spatial attention depth information guiding module, a backbone network is proposed that improves the instance segmentation performance by using spatial self-attention: spatial self-attention FPN (SA-FPN) is shown in particular in fig. 7. Considering that the attention mechanism has higher demand on computing resources and the resolution of the picture data itself for instance segmentation on the Cityscapes is higher, adding the self-attention module on a larger scale can further compress the training batch, and the hardware demand cannot be satisfied. Meanwhile, the high-level semantic information contained in the feature map output at the last layer of the backbone network is more abundant, and the self-attention module is added on the basis, so that the relevance between the high-level semantic information can be better mined, and the method is helpful for improving the performance of target detection and instance segmentation.

S105, loading depth information extracted by the preprocessing backbone network between the feature pyramid network and the preprocessing backbone network through the depth information guiding module so as to enable the depth information extracted by the preprocessing backbone network to be reflected into pyramid features with different scales along an up-sampling and fusion channel of the feature pyramid network.

S106, correcting the interested region and the synthesized fixed region according to the pyramid features with the depth information, and combining the monocular depth estimation model to construct a target instance segmentation model.

The example segmentation model mainly comprises two parts, namely RPN (Region Proposal Network) and RoISH (RoISegmentation Head). Wherein the RPN mainly uses the concept of an "Anchor" (Anchor) proposed in the fast R-CNN, which is a certain number of predefined bounding boxes generated at each location on the pyramid feature, the RPN network achieves a correction of the predefined "Anchor" by predicting the offset of the predefined bounding box to the real bounding box and the likelihood of a detection target within the box, giving a final region of interest proposal (Rol RegionProposal). The RoISH is a region with a fixed size which is segmented in the corresponding pyramid feature of the FPN backbone network according to the central position and the wide and high prediction results of all the regions of interest obtained in the first stage, so as to better process and unify the target detection and segmentation of different scales. Then, the classification prediction of the target and the segmentation result prediction of the target are respectively carried out on the basis of the proposal of the region of interest with the uniform size. Specifically, the overall architecture of the target instance segmentation model of the present embodiment is shown in fig. 8, in which DGM (Depth Guide Module) is a depth information guiding module, DGFPN (Depth GuidedFeature Pyramid Network) is a Feature Pyramid (FPN) network based on depth information guiding, and depth features of different scales led out by the depth estimation network are respectively led into the instance segmentation backbone network FPN of corresponding scales via the DGM module.

The following experimental verification was performed with respect to the example segmentation model constructed in the above steps.

1.1, data set and evaluation index

And selecting the automatic driving scene perception data sets Cityscapes and Kitti as experimental data. The Cityscapes dataset provides true annotations including semantic segmentation, instance segmentation, etc., and the dataset contains 5000 finely annotated Cityscapes fine and 20000 coarsely annotated Cityscapes coarse. The Kitti dataset contains semantic segmentation labels on the three-dimensional point cloud, depth estimation labels synthesized by the laser radar data, and a small number of optical flow estimation labels. All experiments adopted richer image enhancement during training, and the method comprises random clipping, random overturning, image luminosity adjustment, chromaticity increment and the like according to a IoU minimum threshold value.

In the example segmentation task, the Cityscapes data set is calculated by using a segmentation mask region Mpred and a corresponding label Mgt as an evaluation index; specific evaluation indexes are shown in Table 1.

TABLE 1 example segmentation and target detection index specification

1.2 experimental parameters

All experiments referred to herein were based on Python 3.8 implementation, version 1.8.2 for the deep learning tool pythoch. The experimental hardware platform is Ubuntu18.04 LTS server provided with 8 RTX2070 Ti graphic cards, and the model of the server CPU is Intel (R) Xeon (R) Platinum 7163CPU@2.50GHz.

The self-supervision learning training super-parameter setting of the monocular depth estimation network mainly refers to Monodepth2. The optimizer adopts Adam, the learning rate is set to be 0.0001, the training Batch Size (Batch Size) is 12, the training time length is 20, the step length of the adopted step-type attenuation learning rate planner (StepLR Scheduler) is set to be 15, and the attenuation factor y=0.1; the smoothing loss weight in the total depth estimation loss function is set to α=0.001.

The depth information guided example segmentation network employs an AdamW optimizer in which the weight decay coefficients are empirically set to 5 x 10 ^-4 . Meanwhile, a step-type attenuation learning rate planner (StepLR Scheduler) is adopted in training, the attenuation factor y=0.2, and the attenuation step sizes are set to be [16, 34, 40, 43 ] according to experience]。

1.3 analysis of Experimental results

Depth estimation pre-training model depth estimation effects were tested on the Kitti eigen test set after training using monocular Kitti data, as shown in table 2, the pre-training model was comparable to the classical model Monodepth2 in terms of self-supervised depth estimation with similar training.

TABLE 2 Kittieigen test set self-supervised monocular depth estimation contrast

For the instance segmentation model, after training on the Cityscapes fine training set is completed, the performance of the proposed Multi-scale depth information guided instance segmentation model Multi-DG-CMR (Multi-scaleDepth-guided Cascade Mask R-CNN) and the corresponding classical instance segmentation baseline model CMR (cascademask R-CNN) is first tested on the Cityscapes fine test test set, see table 3.

TABLE 3 comparison of the effects of Cityscapes Fine test set instance segmentation

Table 3 contains the baseline model CMR and the proposed depth information guided example segmentation model Multi-DG-CMR, and also compares the effect of the classical example segmentation model MR (Mask R-CNN) based on the "top-down" approach when there is pre-training of the COCO dataset, and the effect of the test set based on the ERFNet example segmentation model based on the "bottom-up" approach. From table 3, it can be seen that the depth guidance mechanism proposed herein improves various indexes including mAP and mAP100m in the example segmentation task, and also improves the classical Mask R-CNN model pre-trained by the COCO dataset to a certain extent.

It is particularly pointed out that compared with the detection precision of the Multi-DG-CMR in the range of 100m and the detection precision of the Multi-DG-CMR in the range of 50m, the detection precision of the remote target is improved after the depth estimation guide module (DGM) is introduced. This is mainly understood to be that, in the improved spatial attention module in the DGM, the relative depth information in the depth features gives a greater weight to the remote target, so as to improve the attention of the remote target in the network training process.

In addition to the average precision index, the precision of the above model over each instance segmentation class is also compared herein, as in table 4.

TABLE 4 comparison of average accuracy of segmentation (Mask mAP) for each class of examples of the Cityscape fine test set

It can be seen from Table 4 that the proposed Multi-DG-CMR example segmentation model provides a greater boost over CMR in the larger size categories of cars (Car), trucks (Truck), buses (Bus), etc., and performs quite well in the other categories. The overall average improvement of the models in table 3 above comes primarily from the improvement in detection accuracy for these classes of targets.

One major constraint of the self-supervised training of the depth estimation network is the smoothing loss of the original gradient attention, which only emphasizes the smoothing that constrains the non-boundary region, while the boundary region can only rely on another loss function constraint. It is also true that most depth estimation result detail information cannot be as fine as the original RGB image, and that the boundary information for large-sized objects remains more, and that the assistance for instance segmentation is more pronounced.

As shown in fig. 9, in the CMR method, since the top area of the car is close to the background, a boundary error is generated, and after depth information is introduced, the boundary information of the depth estimation information in the top area of the car gives a certain constraint to the example segmentation network, so that the result of the Multi-DG-CMR method completely segments the top of the car.

Example 2

This embodiment provides a block diagram of a system corresponding to the method described in embodiment 1. FIG. 2 is a block diagram of an example segmentation model construction system, as shown in FIG. 2, according to an embodiment of the present application, including:

a training module 10 for training a monocular depth estimation model using a training framework of a self-supervising monocular depth estimation network;

the preprocessing module 20 is configured to extract a ResNeXt50 backbone network based on FPN features of the monocular depth estimation model, and perform group normalization and weight normalization processing on the backbone network to obtain a preprocessed backbone network;

an adding module 30 for adding a feature pyramid network to the monocular depth estimation model;

a construction module 40 for constructing a depth information guide module based on the improved spatial attention;

the fusion module 50 is configured to load depth information extracted by the preprocessing backbone network between the feature pyramid network and the preprocessing backbone network through the depth information guiding module, so that depth information extracted by the preprocessing backbone network is reflected into pyramid features with different scales along an upsampling and fusion channel of the feature pyramid network;

a construction module 60, configured to modify a region of interest and a synthesized fixed region according to the pyramid feature with the depth information, and combine the monocular depth estimation model, so as to construct a target instance segmentation model.

The above-described respective modules may be functional modules or program modules, and may be implemented by software or hardware. For modules implemented in hardware, the various modules described above may be located in the same processor; or the above modules may be located in different processors in any combination.

Example 3

The example segmentation model construction method described in connection with fig. 3 may be implemented by an electronic device. Fig. 11 is a schematic diagram of the hardware structure of the electronic device according to the present embodiment.

The electronic device may include a processor 71 and a memory 72 storing computer program instructions.

In particular, the processor 71 may include a Central Processing Unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC), or may be configured to implement one or more integrated circuits of embodiments of the present application.

Memory 72 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 72 may comprise a Hard Disk Drive (HDD), floppy Disk Drive, solid state Drive (Solid State Drive, SSD), flash memory, optical Disk, magneto-optical Disk, tape, or universal serial bus (Universal Serial Bus, USB) Drive, or a combination of two or more of the foregoing. The memory 72 may include removable or non-removable (or fixed) media, where appropriate. The memory 72 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 72 is a Non-Volatile memory. In particular embodiments, memory 72 includes Read-Only Memory (ROM) and random access Memory (Random Access Memory, RAM). Where appropriate, the ROM may be a mask-programmed ROM, a programmable ROM (Programmable Read-Only Memory, abbreviated PROM), an erasable PROM (Erasable Programmable Read-Only Memory, abbreviated EPROM), an electrically erasable PROM (Electrically Erasable Programmable Read-Only Memory, abbreviated EEPROM), an electrically rewritable ROM (Electrically Alterable Read-Only Memory, abbreviated EAROM), or a FLASH Memory (FLASH), or a combination of two or more of these. The RAM may be Static Random-Access Memory (SRAM) or dynamic Random-Access Memory (Dynamic Random Access Memory DRAM), where the DRAM may be a fast page mode dynamic Random-Access Memory (Fast Page Mode Dynamic Random Access Memory FPMDRAM), extended data output dynamic Random-Access Memory (Extended Date Out Dynamic Random Access Memory EDODRAM), synchronous dynamic Random-Access Memory (Synchronous Dynamic Random-Access Memory SDRAM), or the like, as appropriate.

Memory 72 may be used to store or cache various data files that need to be processed and/or communicated, as well as possible computer program instructions for execution by processor 71.

The processor 71 implements the example division model construction method of embodiment 1 described above by reading and executing the computer program instructions stored in the memory 72.

In some of these embodiments, the electronic device may also include a communication interface 73 and a bus 70. As shown in fig. 11, the processor 71, the memory 72, and the communication interface 73 are connected to each other via the bus 70 and perform communication with each other.

The communication interface 73 is used to enable communication between various modules, devices, units and/or units in embodiments of the application. Communication interface 73 may also enable communication with other components such as: and the external equipment, the image/data acquisition equipment, the database, the external storage, the image/data processing workstation and the like are used for data communication.

Bus 70 includes hardware, software, or both, coupling the components of the device to one another. Bus 70 includes, but is not limited to, at least one of: data Bus (Data Bus), address Bus (Address Bus), control Bus (Control Bus), expansion Bus (Expansion Bus), local Bus (Local Bus). By way of example, and not limitation, bus 70 may include a graphics acceleration interface (Accelerated Graphics Port), abbreviated AGP, or other graphics Bus, an enhanced industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) Bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an industry standard architecture (Industry Standard Architecture, ISA) Bus, a wireless bandwidth (InfiniBand) interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a micro channel architecture (Micro Channel Architecture, abbreviated MCa) Bus, a peripheral component interconnect (Peripheral Component Interconnect, abbreviated PCI) Bus, a PCI-Express (PCI-X) Bus, a serial advanced technology attachment (Serial Advanced Technology Attachment, abbreviated SATA) Bus, a video electronics standards association local (Video Electronics Standards Association Local Bus, abbreviated VLB) Bus, or other suitable Bus, or a combination of two or more of these. Bus 70 may include one or more buses, where appropriate. Although embodiments of the present application describe and illustrate a particular bus, the present application contemplates any suitable bus or interconnect.

The electronic device may execute the example segmentation model construction method of embodiment 1 of the present application based on the obtained example segmentation model construction identification system.

In addition, in combination with the example segmentation model construction method of the above embodiment 1, the embodiment of the present application may provide a storage medium for implementation. The storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement the example segmentation model construction method of embodiment 1 described above.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. An instance segmentation model construction method is characterized by comprising the following steps:

adding a feature pyramid network into the monocular depth estimation model;

2. The instance segmentation model construction method according to claim 1, wherein the self-supervising monocular depth estimation network comprises a monocular depth estimation network and a pose estimation network, wherein a backbone network of the monocular depth estimation network employs a standard res net-50 network.

3. The instance segmentation model construction method according to claim 2, wherein the monocular depth estimation network generates a multi-scale relative depth feature map through depth estimation prediction heads at stages 4, 3, 2, 1, respectively, wherein the relative depth ranges from 0 to 1.

4. The instance segmentation model construction method according to claim 2, wherein the decoder of the standard res net-50 network comprises a fused convolution of a series up-sampling and a fused skip connection; the pose estimation network carries out a plurality of convolutions and activations on the feature map output by the last layer of the encoder of the standard ResNet-50 network so as to predict transformation matrix parameters between two frames of feature maps.

5. The instance division model construction method according to claim 1, wherein the feature pyramid network outputs F at four layers layer1, layer2, layer3, layer4 of the ResNeXt50 backbone network ₁ 、F ₂ 、F ₃ 、F ₄ Based on the channel conversion, up-sampling fusion generates the channel number C _out And adding an output P based on the last layer output of the ResNeXt50 backbone network ₅ 。

6. The instance segmentation model construction method according to claim 1, wherein the depth information guide module is configured to guide features from the ResNeXt50 backbone network

And features from the monocular depth estimation network ∈>

7. The method of claim 1, wherein a learnable weight parameter α is added to the depth information guiding module, and α is initialized to 0 at the beginning of the self-supervised monocular depth estimation network training, the depth information guiding module being equivalent to a lateral convolution of the ResNeXt50 backbone network.

8. An instance segmentation model construction system, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the instance segmentation model construction method of any one of claims 1-7 when the computer program is executed.

10. A storage medium having stored thereon a computer program, which when executed by a processor implements the instance segmentation model construction method according to any one of claims 1-7.