CN113537004A

CN113537004A - Double-pyramid multivariate feature extraction network of image, image segmentation method, system and medium

Info

Publication number: CN113537004A
Application number: CN202110747532.1A
Authority: CN
Inventors: 杨大伟; 任凤至; 毛琳; 张汝波
Original assignee: Dalian Minzu University
Current assignee: Dalian Minzu University
Priority date: 2021-07-01
Filing date: 2021-07-01
Publication date: 2021-10-22
Anticipated expiration: 2041-07-01
Also published as: CN113537004B

Abstract

The image double-pyramid multi-feature extraction network, the image segmentation method, the image segmentation system and the medium belong to the field of deep learning image processing and comprise four input features, an example feature pyramid, a semantic feature pyramid and two output features, wherein the two output features comprise an example feature pyramid output feature and a semantic feature pyramid output feature. The method solves the problem that the traditional feature extraction method cannot meet the feature requirements of the multi-thread task, can provide detailed example target feature information for the target identification task and rich semantic logic feature information for the semantic analysis task, and greatly improves the precision of the multi-thread task.

Description

Double-pyramid multivariate feature extraction network of image, image segmentation method, system and medium

Technical Field

The invention belongs to the field of deep learning image processing, and particularly relates to a double-pyramid multivariate feature extraction method capable of providing two types of features for a multitask model.

Background

Digital image analysis technology plays an important role in the current society, and machine vision becomes an important research content in various industries. At present, the development of machine vision technology has gradually abandoned the scheme of the traditional manual design algorithm of digital image processing, and the deep learning is used instead, and the convolutional neural network is taken as a representative so as to achieve the analysis result with high accuracy. In the patent "feature extraction model and feature extraction method capable of sufficiently retaining image features" (publication number: CN110659653A), it is proposed to perform lossless feature extraction operation on input images with arbitrary resolution to solve the problem of insufficient information amount in the later analysis due to the fact that the backbone network continuously discards feature information. The patent "a method for extracting image features by using a low-complexity scale pyramid" (publication number: CN108537235A) discloses a method for extracting image features by using a low-complexity scale pyramid, which proposes that five groups of image blocks which are generated by filtering an image and form a scale pyramid are divided into two parts for different processing, and then the two parts of processing results are merged to form a final feature point list.

For the existing convolutional neural network model, the backbone feature extraction network is originated from the initial image classification network, and the traditional feature extraction network is only suitable for network frameworks with single task requirements, such as target detection, semantic segmentation and the like.

However, with the development of the deep learning computer vision field, the goal of multi-task integration in the deep neural network is increasingly required. Each task in the multi-thread task often has different purposes, and the requirements of different tasks on characteristics are greatly different according to different purposes, so that the traditional characteristic extraction method cannot meet the different requirements of the multi-thread task on the characteristics. Therefore, in the deep learning multithreading network, the problem that the traditional feature extraction method cannot meet the feature requirement of the multithreading task becomes an urgent solution.

Disclosure of Invention

In order to solve the problem that the traditional feature extraction method cannot meet the feature requirements of a multi-thread task, the invention provides the following technical scheme: a double-pyramid multi-feature extraction network of an image is composed of four input features, an example feature pyramid, a semantic feature pyramid and two output features, wherein the two output features are composed of example feature pyramid output features and semantic feature pyramid output features.

Further, in the above-mentioned case,

the example feature pyramid consists of four layers of example features, three up-sampling modules, three addition fusion modules, four identical standard convolution layers and one merging module;

the example feature pyramid constructs four layers of example features along a top-down path:

inputting a fourth layer of example features of which the features 1 form an example feature pyramid, then, one path of the fourth layer of example features enters an up-sampling module for size amplification, and the other path of the fourth layer of example features enters a merging and fusing module through a standard convolution layer to wait for merging and fusing;

the input features 2 and the amplified and transformed fourth-layer example features jointly enter an addition fusion module for feature fusion, feature fusion results form third-layer example features of an example feature pyramid, and then one path of the third-layer example features enters an up-sampling module for size amplification; the other path enters a merging and fusing module through the standard convolution layer to wait for merging and fusing;

the input features 3 and the third layer of example features subjected to the up-sampling amplification transformation jointly enter an addition fusion module for feature fusion, feature fusion results form example feature pyramid second layer of example features, and then one path of the second layer of example features enters an up-sampling module for size amplification; the other path enters a merging and fusing module through the standard convolution layer to wait for merging and fusing;

the input features 4 and the second layer of example features subjected to the up-sampling amplification conversion enter an addition fusion module together for feature fusion, feature fusion results form first layer of example features of an example feature pyramid, and then the first layer of example features enter a merging fusion module through a standard convolution layer to wait for merging and fusion processing;

the merging module merges the four example feature information waiting for merging and merging processing, outputs a merging and merging result as the output feature of the example feature pyramid, and forms one of two output features of the double-pyramid multi-feature extraction network;

the semantic feature pyramid consists of four layers of semantic features, three cavity convolution layers, three addition fusion modules, four standard convolution layers and one merging module;

the semantic feature pyramid constructs four layers of semantic features along a bottom-up path:

inputting a first layer of semantic features of a semantic feature pyramid formed by the features 4, then, enabling one path of the first layer of semantic features to enter a cavity convolution layer for size reduction, enabling the other path of the first layer of semantic features to enter a merging module through a standard convolution layer, and waiting for merging and merging;

the input features 3 and the reduced and transformed first-layer semantic features jointly enter an addition fusion module for feature fusion, feature fusion results form a second-layer semantic feature of a semantic feature pyramid, one path of the second-layer semantic feature enters a cavity convolution layer for size reduction, and the other path of the second-layer semantic feature enters a merging module through a standard convolution layer to wait for merging and fusion processing;

the input features 2 and the reduced and transformed second-layer semantic features jointly enter an addition fusion module for feature fusion, feature fusion results form a third-layer semantic feature of a semantic feature pyramid, one path of the third-layer semantic feature enters a cavity convolution layer for size reduction, and the other path of the third-layer semantic feature enters a merging module through a standard convolution layer to wait for merging and fusion processing;

the input features 1 and the reduced and transformed third-layer semantic features jointly enter an addition fusion module for feature fusion, feature fusion results form a fourth-layer semantic feature of a semantic feature pyramid, and then the fourth-layer semantic feature enters a merging module through a standard convolution layer to wait for merging and fusion processing.

And the merging module merges the four semantic feature information to be merged and fused, outputs a merged and fused result as the output feature of the semantic feature pyramid, and forms one of the two output features of the double-pyramid multi-feature extraction network.

Further, in the above-mentioned case,

the 4 input features of the double pyramid multi-feature extraction network are four results of feature rough extraction on the same input image.

The input feature 1 in the 4 input features of the double pyramid multi-element feature extraction network is a three-dimensional matrix with the size [256 × 25 × 38 ]; input feature 2 is a three-dimensional matrix with size [256 x 50 x 76 ]; input feature 3 is a three-dimensional matrix with dimensions [256 x 100 x 152 ]; the input features 4 are three-dimensional matrices with dimensions [256 x 200 x 304 ].

The upsampling module in the example feature pyramid expands the feature size input to the module by a factor of two.

Example feature pyramid output features are three-dimensional matrices with dimensions of [256 × 25 × 38 ].

The void convolutional layer in the semantic feature pyramid reduces the feature size input into the convolutional layer by a factor of two.

The semantic feature pyramid output features are three-dimensional matrices with dimensions [256 × 200 × 304 ].

An image segmentation method comprises the following steps:

step 1: reading a data set image, roughly extracting features, and obtaining a three-dimensional matrix with the size of [256 × 25 × 38] as an input feature 1, a three-dimensional matrix with the size of [256 × 50 × 76] as an input feature 2, a three-dimensional matrix with the size of [256 × 100 × 152] as an input feature 3, and a three-dimensional matrix with the size of [256 × 200 × 304] as an input feature 4;

step 2: transmitting the input features 1, the input features 2, the input features 3 and the input features 4 obtained in the step 1 to an example feature pyramid to obtain an example target feature matrix with the size [256 × 25 × 38 ];

and 3, step 3: inputting the example target feature matrix in the step 2 into a regional candidate network, and then generating a structure through full connection and a mask to obtain a segmentation result of the example target in the panorama;

and 4, step 4: transmitting the input features 1, 2, 3 and 4 obtained in the step 1 to a semantic feature pyramid to obtain a semantic feature matrix with the size [256 × 25 × 38 ];

and 5, step 5: inputting the semantic feature matrix in the step 4 into a full convolution structure to obtain a panoramic semantic segmentation result;

and 6, step 6: and merging and fusing the segmentation result of the instance target in the step 3 and the semantic segmentation result in the step 4 through a panoramic fusion structure to generate a panoramic segmentation result.

A computer system comprising a processor and a memory, the processor executing code in the memory to implement the method.

A computer storage medium storing a computer program for execution by hardware to implement the method.

Has the advantages that: the invention provides a double-pyramid multi-feature extraction network capable of providing two types of features, wherein the network provides detailed example target feature information for a task with target identification as a key point, provides rich semantic logic feature information for a task with semantic analysis as a key point, and can greatly improve the precision of a multi-thread task. The method is suitable for the multitask integration model of visual environment perception such as unmanned driving, mobile robots and the like.

Drawings

FIG. 1 is a schematic overall framework of the process

FIG. 2 is a schematic diagram of an example feature pyramid

FIG. 3 is a schematic diagram of a semantic feature pyramid

FIG. 4 is a panoramic view of an outdoor scene in example 1

FIG. 5 is a panoramic view of an indoor scene in example 2

FIG. 6 is a traffic scene panorama segmentation chart in example 3

Detailed Description

The invention is described in further detail below with reference to the following detailed description and accompanying drawings:

1. technical scheme

Deep learning network tasks can be divided into two broad categories: firstly, the identification of the target in the image comprises target detection, target tracking and the like; and secondly, performing semantic analysis on the whole image, including semantic segmentation and the like. In order to meet the requirements of the multithread task on the characteristics, the invention provides a double-pyramid multivariate characteristic extraction network capable of providing two different types of characteristics. The double-pyramid multi-feature extraction network comprises an example feature pyramid and a semantic feature pyramid. The example feature pyramid is used for acquiring detailed feature information of an example target in the image and can be used in the fields of target detection and the like; the semantic feature pyramid is used for acquiring rough feature information such as semantic positions in the image, is used for semantic analysis, and is suitable for the fields of semantic segmentation and the like.

2. Double pyramid multivariate feature extraction network

Double pyramid multivariate feature extraction network definition; the double-pyramid multi-feature extraction network is composed of four input features, an example feature pyramid, a semantic feature pyramid and two output features.

Wherein the four input features consist of input feature 1, input feature 2, input feature 3 and input feature 4; the two output features consist of an example feature pyramid output feature and a semantic feature pyramid output feature.

(1) Example feature pyramid

Definition 1: the example feature pyramid is composed of four layers of example features, three up-sampling modules, three additive fusion modules, 4 identical standard convolution layers and a merging module.

From a construction geometry perspective, the example feature pyramid constructs four layers of example features along a top-down path.

and the merging module merges the four example feature information waiting for merging and merging processing, outputs a merging and merging result as the output feature of the example feature pyramid, and forms one of the two output features of the double-pyramid multi-feature extraction network.

(2) Semantic feature pyramid

Definition 2: the semantic feature pyramid consists of four layers of semantic features, three cavity convolution layers, three additive fusion modules, 4 standard convolution layers and one merging module.

From the aspect of forming a geometric form, the semantic feature pyramid constructs four layers of semantic features along a path from bottom to top.

3. Constraint conditions

(1) The 4 input features of the double pyramid multi-feature extraction network are four results of feature rough extraction on the same input image.

(2) The input feature 1 in the 4 input features of the double pyramid multi-element feature extraction network is a three-dimensional matrix with the size [256 × 25 × 38 ]; input feature 2 is a three-dimensional matrix with size [256 x 50 x 76 ]; input feature 3 is a three-dimensional matrix with dimensions [256 x 100 x 152 ]; the input features 4 are three-dimensional matrices with dimensions [256 x 200 x 304 ].

(3) The upsampling module in the example feature pyramid expands the feature size input to the module by a factor of two.

(4) Example feature pyramid output features are three-dimensional matrices with dimensions of [256 × 25 × 38 ].

(5) The void convolutional layer in the semantic feature pyramid reduces the feature size input into the convolutional layer by a factor of two.

(6) The semantic feature pyramid output features are three-dimensional matrices with dimensions [256 × 200 × 304 ].

4. Principle analysis

The 4 input features of the double pyramid multi-feature extraction network are different feature forms obtained by roughly extracting the same image. The input feature 1 is the smallest in size, the feature information of the contained example target is the most abundant, and the abundance degrees of the input feature 2, the input feature 3 and the input feature 4 are sequentially decreased; the size of the input feature 4 is the largest, the feature information of the included semantic logic is the most abundant, and the richness of the input feature 3, the input feature 2 and the input feature 1 is gradually decreased.

(1) The example feature pyramid has rich example target features

The example feature pyramid takes the input feature 1 as a first-layer example feature to obtain the most detailed example target feature information, then amplifies the feature and continuously transmits the feature downwards, so that the example target feature is continuously strengthened and becomes more remarkable, and then is stored and output through the merging module.

(2) Semantic feature pyramid has rich semantic features

The semantic feature pyramid uses the input feature 4 as a first layer of semantic features to obtain the most abundant semantic logic feature information, and then sends the features into the hole convolution layer for scaling conversion and continuous downward transmission. The hole convolution layer enhances the characteristics of position logic and the like in the image by expanding the receptive field of the convolution kernel, so that the input first-layer semantic characteristics and the second-layer and third-layer semantic characteristics in the downward transmission process of the semantic characteristic pyramid are further enhanced, the semantic characteristics are continuously enhanced and become more obvious, and then the semantic characteristics are stored and output by the merging module.

5. Advantageous effects

(1) Providing two types of features

The present invention can provide two different types of features for a multitasking model. Rich instance characteristic information can be provided for tasks such as object detection and the like which take an instance target as a key point to identify an object; specific semantic feature information can be provided for tasks with global semantic analysis as a key point, such as semantic segmentation and the like.

(2) Adapted to multi-tasking models

Each task in the multitask network model often has a specific achievement goal, and the requirements of different tasks on features are greatly different according to different purposes. The two types of features provided by the invention can meet different requirements of the multi-thread task on the features.

(3) Model suitable for panorama segmentation

As a network model with multi-task integration, panorama segmentation needs to realize two different task targets, namely semantic segmentation on a panorama and instance segmentation on an instance target in the panorama. The two types of characteristics provided by the invention can well meet the requirements of the panoramic segmentation model on the characteristic information. The semantic features which are required by the semantic segmentation task and can provide position logic information for segmentation can be provided by a semantic feature pyramid in the invention; example target features needed by the example segmentation task to provide example target detail information for segmentation can be provided by the example feature pyramid in the present invention. The double-pyramid multi-element feature extraction network provides rich and comprehensive image features for the panoramic segmentation model, and can greatly improve the panoramic segmentation precision.

(4) Suitable for unmanned driving technology

The invention is a computer vision environment perception technology, is suitable for the field of unmanned driving, can extract example target information of pedestrians, vehicles, buildings and the like in the driving environment and information of semantic positions of the whole driving environment, provides comprehensive characteristic information for a network model, and provides important safety guarantee for normal driving.

(5) Be suitable for public transport monitored control system

The method effectively identifies the pedestrians, the vehicles and the road environment, meets the requirements of the road traffic scene, and provides an auxiliary means for safe driving for drivers; by means of the precision and the speed of the invention, the characteristic information can be effectively extracted aiming at the illegal vehicles, pedestrians who do not guard on traffic rules and accidents in traffic environment, thereby providing favorable conditions for the next identification work and improving the working efficiency of the public monitoring system.

The logic schematic of the method is shown in fig. 1, and the specific implementation steps of the algorithm are as follows:

step 1: reading a data set image, roughly extracting features through an arbitrary feature network, and obtaining a three-dimensional matrix with the size of [256 × 25 × 38] as an input feature 1, a three-dimensional matrix with the size of [256 × 50 × 76] as an input feature 2, a three-dimensional matrix with the size of [256 × 100 × 152] as an input feature 3, and a three-dimensional matrix with the size of [256 × 200 × 304] as an input feature 4;

and 3, step 3: inputting the example target feature matrix in the step 2 into a candidate network of the area, and then generating a structure through full connection and a mask to obtain a segmentation result of the example target in the panorama.

and 5, step 5: and (4) inputting the semantic feature matrix in the step (4) into a full convolution structure to obtain a panoramic semantic segmentation result.

Example 1:

the implementation example is to input the outdoor activity scene into the network model and perform panoramic segmentation on all objects in the outdoor scene. The outdoor scene panorama segmentation result is shown in fig. 4.

Example 2:

in the embodiment, an indoor life scene is input into a network model, and all objects in the indoor scene are subjected to panoramic segmentation. The indoor scene panorama segmentation result is shown in fig. 5.

Example 3:

the implementation example is that a road traffic scene is input into a network model, and example targets such as pedestrians and vehicles and non-example targets such as roads and sky in the traffic scene are subjected to panoramic segmentation. The traffic scene panorama segmentation result is shown in fig. 6.

Claims

1. A double-pyramid multi-feature extraction network of an image is characterized by comprising four input features, an example feature pyramid, a semantic feature pyramid and two output features, wherein the two output features comprise an example feature pyramid output feature and a semantic feature pyramid output feature.

2. The double-pyramid multivariate feature extraction network of images of claim 1, wherein,

3. The double-pyramid multivariate feature extraction network of images of claim 1, wherein,

4. An image segmentation method comprises the following characteristic steps:

5. A computer system comprising a processor and a memory, the processor executing code in the memory to implement the method of claim 4.

6. A computer storage medium storing a computer program for execution by hardware to implement the method of claim 4.