CN113592894B

CN113592894B - Image segmentation method based on boundary box and co-occurrence feature prediction

Info

Publication number: CN113592894B
Application number: CN202110999575.9A
Authority: CN
Inventors: 陆佳炜; 朱冰倩; 陈纬鉴; 董振兴; 姜钦凯; 朱明杰; 程振波
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-08-29
Filing date: 2021-08-29
Publication date: 2024-02-02
Anticipated expiration: 2041-08-29
Also published as: CN113592894A

Abstract

The invention relates to an image segmentation method based on boundary frames and co-occurrence feature prediction, which comprises the steps of firstly, determining a detection frame range of a main body by using an FCOS method, a multi-scale strategy and a centrality idea; secondly, determining that the main body part designs the context features rich in the context sensing pyramid feature extraction module, combining a channel attention mechanism model module after feature mapping of the context sensing pyramid feature extraction module and a space attention mechanism model module after low-level feature mapping, and monitoring generation of significant boundary positioning information by using cross entropy loss; predicting the co-occurrence characteristic of the image, measuring the similarity of the co-occurrence characteristic and the target characteristic to learn the probability of the co-occurrence characteristic, adding context prior information to enhance the robustness in the complex scene, and then accurately dividing the target by an edge process.

Description

Image segmentation method based on boundary box and co-occurrence feature prediction

Technical Field

The invention relates to the technical field of information, in particular to an image segmentation method based on boundary boxes and co-occurrence feature prediction.

Background

Image segmentation is a technique and process of dividing an image into several specific regions with unique properties and presenting objects of interest. It is a key step from image processing to image analysis. The existing image segmentation methods are mainly divided into the following categories: a threshold-based segmentation method, a region-based segmentation method, an edge-based segmentation method, a segmentation method based on a specific theory, and the like. Image segmentation techniques have been studied in a number of areas, for example: such as being an effective tool for studying visual perception by students in various fields such as psychology, physiology, computer science and the like. Second, image processing is in ever-increasing demand in large-scale applications such as military, remote sensing, weather, and the like. Although theoretically, the application is wide, most image segmentation methods have defects in practical application, such as ambiguous semantics of segmentation results, inaccurate segmentation boundaries and the like.

Early algorithms mostly extract low-level feature information of images to perform foreground segmentation, the low-level features include textures, colors, positions, object contours and the like of the images, but the low-level features have great limitations, for example, effective and accurate segmentation cannot be achieved due to different object types with similar colors or textures in the images. With the development of the neural network, researchers gradually add advanced semantic features on the original basis, for example, training and predicting images by using a full convolutional neural network (FCN) and a Deep Supervision Network (DSN), so that the images can be learned and extracted from multiple scales and multiple layers; and extracting high-level features of the image by utilizing the VGG-net network, and connecting the high-level features and the low-level features to judge the significance of each region of the image and the like. However, the above method cannot well utilize the context information in the image, and even if the algorithm is integrated with the global context feature, the unreliable result can be obtained because the dependency relationship of the context cannot be clarified when the complex scene information is faced. And the same kind of object in the picture is not necessarily in the spatially coherent region, and for this problem, researchers have researched the co-occurrence feature between targets to improve the robustness of identifying the object, namely, identify the kind of object according to the feature information that appears with the target feature at the same time. In summary, the neural network technology is utilized to obtain the advanced semantic feature information on the basis of obtaining the low-level feature information such as the color, the texture and the like of the image, and the co-occurrence feature containing the context prior information is integrated, so that the method is a better image segmentation method at present.

With the continuous progress of the image segmentation technology, the accuracy and precision of the image segmentation technology are improved, so that the image segmentation technology can be better applied to actual life, and the image segmentation technology also depends on the continuous development and updating of the field of computer vision.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an image segmentation method based on boundary frames and co-occurrence feature prediction. In determining the body part, FCOS, multi-scale policy and centrality methods are specifically introduced. And carrying out boundary box prediction on all points in the target object box, and inhibiting the boundary box detected with low quality by applying a centrality method, thereby solving the problem of non-ideal picture effects of similar main body and background colors and similar complex features. The determining main body segmentation method uses the combination of low-layer characteristic information and high-layer characteristic information to confirm the image foreground main body, and uses cross entropy loss to monitor the generation of the significant boundary positioning information. The accurate edge segmentation method combines the obtained image information of spectral extinction and the advanced semantic information obtained by the convolutional neural network, and forms a Laplace matrix through a graph structure to obtain a feature vector, so that the details and the characteristics of the edge are more emphasized, and in addition, the co-occurrence feature model is integrated to improve the accuracy of semantic segmentation. And finally, multiplying boundary information obtained by dividing the main body by the eigenvector point of the Laplacian matrix to finally obtain a result image layer. According to the invention, the edge detail characteristics can be more accurately realized on the basis of confirming the main body contour, and the accuracy of foreground segmentation is improved. The image segmentation method has the advantages that high-level characteristic information is introduced, universality of an image segmentation application scene is improved, and image foreground segmentation of a complex background is easier to process.

The invention provides the following technical scheme:

an image segmentation method based on bounding box and co-occurrence feature prediction is characterized by comprising the following steps:

1) Introducing an FCOS method, and determining a main body by using a multi-scale strategy and a centrality method:

1.1 Inputting an image and a real boundary box, determining and outputting a main body target boundary box range by using a multi-scale strategy and a centrality method based on a first-order full convolution target detection FCOS method;

1.2 Inputting an image based on a main body target boundary frame range, and adopting a context awareness pyramid feature extraction module CPFE for multi-scale high-level feature mapping to obtain rich context features; CPFE takes Conv3-3, conv4-3 and Conv5-3 in VGG-16 network architecture as basic high-level characteristics, and VGG-16 comprises 13 convolution layers and three full connection layers;

1.3 After the context-aware pyramid features are extracted, a channel attention mechanism model is added and used, so that channels which show high response to the salient objects are assigned with larger weights; weighting the context awareness pyramid features by using a channel attention mechanism model to output new high-level features;

1.4 Acquiring low-level features of natural images, taking Conv1-2 and Conv2-2 in a VGG-16 network architecture as basic low-level features as input, and focusing on boundaries between salient objects and backgrounds by adopting a spatial attention mechanism model, so as to be beneficial to generating effective low-level features for containing boundary information;

1.5 Fusing the high-level features weighted by the channel attention mechanism model and the low-level feature output weighted by the space attention mechanism model, and supervising the generation of the significant boundary positioning information by using cross entropy loss; according to the positioning information, outputting a gray level image of the foreground outline of the image;

2) Co-occurrence feature classification:

firstly, giving an input image, and extracting a convolution feature map by utilizing a pre-trained CNN; then split into three branches: firstly, utilize feedforward networkAnd->Converting the feature map into representations of a target vector and a co-occurrence vector, and estimating probabilities of co-occurrence features based on the similarities; secondly, using a psicoside context with a co-occurrence probability to capture through an aggregate co-occurrence feature module ACF; thirdly, capturing global characteristics by using average pooling;

3) Edge refinement:

3.1 Firstly, after inputting an image, collecting image information characteristics; the information features of an image come mainly from two aspects: firstly, based on spectral extinction, obtaining non-local color relation information from an input image from the perspective of spectral analysis; secondly, using a convolutional neural network for scene analysis to generate high-level semantic relation information;

3.2 Combining non-local color relation information and advanced semantic relation information of the image, establishing an image layer, and revealing semantic objects and soft transition relations among the semantic objects in feature vectors of the Laplace matrix;

3.3 Extracting feature vectors corresponding to 100 minimum feature values of the Laplace matrix L, and then clustering the feature vectors by using k-means; after the edge precision processing process is finished, outputting an image layer formed by a Laplacian matrix;

4) Determining a main body result, a co-occurrence feature classification result and an edge precision result, and fusing:

4.1 First, processing results of determining the body portion; performing binarization processing on the gray level map obtained through the output result of the step 1.5), and reserving the main body outline and the obvious main body white area;

4.2 Then, processing the result of the co-occurrence feature; obtaining a matrix corresponding to each class from the results of the co-occurrence features, respectively carrying out dot multiplication on the matrix and the gray level diagram, and determining the class of the main body class, traversing and calculating an intersection part of the result matrix obtained by calculating each class and the gray level diagram, wherein the main body class is the largest intersection among all classes, and reserving the intersection between the class and the gray level diagram as co_map;

4.3 Then, processing the result of the edge refinement portion; traversing the whole matrix set, finding out the matrix of the class with the maximum transparency of each pixel point, considering the matrix as the class to which the matrix belongs, and setting the transparency in other matrixes except the class to be 0;

4.4 And finally, carrying out dot multiplication on the Laplace matrix set output in the step 4.3) and the gray level diagram of the saliency detection result respectively, determining the needed reserved and intersection parts among reserved classes, and combining all reserved recorded parts to obtain the final needed foreground main body part;

4.5 After the fusion process is finished, outputting a foreground part of the image.

Further, the specific process of the step 1.1) is as follows:

1.1.1 First, performing pixel-by-pixel regression prediction by using an FCOS method, wherein the prediction strategy is as follows:

bounding boxWherein->Is the coordinates of the upper left corner of the bounding box,,>coordinates of a lower right corner of the bounding box, where i represents an i-th bounding box in the image; the pixel position is (m, n), vector a ^* ＝(l ^* ,t ^* ,r ^* ,b ^* )，l ^* 、t ^* 、r ^* 、b ^* Representing the distances of the pixels to the left, upper, right and lower sides of the bounding box, respectively; establishing bitsSet (m, n) and bounding box B _i After correlation, the training regression target for that location may then be expressed as:

since the FCOS algorithm performs pixel-by-pixel regression based on points in the target object frame, the regression targets are all positive samples, so the exp function is used to stretch the regression targets, i.e. i obtained by the above formula ^* 、t ^* 、r ^* 、b ^* The value is taken into an exp function as a variable, and the output value is respectively a value obtained by stretching distance values from corresponding pixels to the left side, the upper side, the right side and the lower side of the boundary box;

1.1.2 Secondly, solving the problem of intractable ambiguity caused by overlapping of the true value frames by using a multi-scale strategy, namely, which bounding box a certain position in the overlapping should return to; in order to better utilize the multi-scale features, the feature layer at each scale defines the scope of the bounding box regression, which comprises the following specific steps:

1.1.2.1): defining five layers of feature graphs { P1, P2, P3, P4, P5} to meet the training of a multi-scale strategy, wherein each layer meets different parameters H multiplied by W/s, wherein H and W correspond to the height and width of the feature graphs, and s is the downsampling rate of the feature graphs; wherein, the parameters corresponding to P1 are 100×128/8, the parameters corresponding to P2 are 50×64/16, the parameters corresponding to P3 are 25×32/32, the parameters corresponding to P4 are 13×16/64, and the parameters corresponding to P5 are 7×8/128; the regression targets in each layer are then calculated: l (L) ^* 、t ^* 、r ^* 、b ^* ；

1.1.2.2): judgment max (l) ^* 、t ^* 、r ^* 、b ^* )>d _i Or max (l) ^* 、t ^* 、r ^* 、b ^* )<d _i-1 Whether or not satisfied, wherein max represents taking the maximum value;d _i is the maximum distance that the feature level i needs to return, corresponds to five hierarchical feature graphs { P1, P2, P3, P4, P5} defined in the first step, corresponds to 5 feature levels, corresponds to i values of 1, 2, 3, 4, 5, and sets d ₀ 、d ₁ 、d ₂ 、d ₃ 、d ₄ 、d ₅ 0, 64, 128, 256, 512, and ≡infinity;

1.1.2.3): if the judgment formula in step 1.1.2.2) is satisfied, setting the judgment formula as a negative sample, and not performing regression prediction on the boundary box;

1.1.3 Finally, using a centrality method; because the FCOS algorithm uses a pixel-by-pixel regression strategy, a plurality of low-quality center points and prediction boundary boxes with more offsets can be generated while the recall rate is improved, a centrality method is used for inhibiting the low-quality detected boundary boxes, and no super-parameters are introduced into the strategy; the centrality method adds a loss to the network, and the loss ensures that the predicted bounding box is as close to the center as possible;

wherein max represents a maximum value, and min represents a minimum value;

from the above formula it can be derived that: the centrality is a concept with measurement function, the value is between 0 and 1, and the centrality in the above formula is calculated first ^* The obtained center is used for ^* Training using cross entropy loss; the cross entropy Loss is a difference function used for measuring between two probability distributions in mathematics, and can measure information Loss between a predicted value and a real label value, and the expression of the cross entropy Loss function Loss is as follows:

n represents that there are n sets of labels, e is the probability distribution of the ith real label,a probability distribution representing the i-th tag prediction;

centerness ^* as a means ofThe value is input into the above formula and is found as center ^* ∈[0.5，1]When (I)>Taking 1; when center is present ^* E [0,0.5 ]), +.>Taking 0, and calculating the Loss obtained after each training; when the Loss is smaller than a certain given threshold, namely the information Loss between the predicted value and the real label value meets the expected requirement, ending the training; when Loss is greater than a given threshold, l is reduced ^* 、t ^* 、r ^* 、b ^* I.e. the bounding box is closer to the center, the center is recalculated ^* And a corresponding Loss value;

further, the specific process of the step 1.2) is as follows:

1.2.1 In order to make the finally extracted high-level features meet the scale invariance and shape invariance features, adopting porous convolution with different expansion rates, and further setting the expansion rates to be 3, 5 and 7 respectively to capture context information;

1.2.2 By cross-channel connection, stitching feature maps from different porous convolutional layers with 1 x 1 dimension-reduction features; then, three different-scale features are obtained by using context sensing information, the three different-scale features are combined in pairs, and each two smaller-scale features are up-sampled to obtain larger-scale features, so that three-scale advanced features are output; the up-sampling is also called image interpolation, namely new elements are inserted between pixel points by adopting an interpolation algorithm on the basis of original image pixels, so that an original image is amplified;

1.2.3 Finally, the up-sampled advanced features are combined through cross-channel connection to be used as the output of the context sensing pyramid feature extraction module.

Further, the specific process of the step 1.3) is as follows:

1.3.1 First, the high-level features f after the context-aware pyramid feature extraction is performed ^h ∈R ^W×H×C Is unfolded intoWherein->The representation represents a high-level feature f ^h R represents a set of spatial positions, W represents a width of a dimension, H represents a height of the dimension, and C represents a total number of channels; then average pooling +.>To obtain channel feature vector v ^h Wherein the purpose of the averaging pooling is to reduce the error of the increase of the variance of the estimated value due to the limited neighborhood size, thereby facilitating the preservation of more background information of the image, the result of the averaging pooling S _j The method is characterized by comprising the following steps:

wherein T represents a sequence threshold for selecting an activation value to participate in pooling, R _j A pooling field in the jth feature map, i representing the index value of the activation value in this pooling field, r _i And g _i The sequence bit and the activation value of the activation value i are respectively represented;

1.3.2 Then, the channel characteristic vector v obtained in the last step ^h Outputting to the ReLU layer through the full connection layer FC;

1.3.3 Channel feature vectors are then mapped to [0,1 ] using Sigmoid operations ]In between, normalization processing is completed and a ca value is obtained, namely each layer of the high-layer characteristicsA weight matrix; therefore ca=f (v ^h ，W)＝σ ₁ (fc ₂ (δ(fc ₁ (v ^h ，W ₁ ))，W ₂ ) Wherein W is ₁ 、W ₂ Sigma, a parameter of the channel attention mechanism ₁ Referring to sigmoid operation, fc refers to fully connected layer, delta refers to ReLU function; the full connection layer is a calculation layer which plays a role of a classifier in the convolutional neural network; the ReLU layer refers to a calculation layer containing a ReLU function, and the ReLU function is an activation function commonly used in an artificial neural network;

1.3.4 Finally, outputting high-level features obtained by weighting the context-aware pyramid featuresWherein represents dot product;

。

further, the specific process of the step 1.4) is as follows:

1.4.1 Input high-level features weighted by context-aware pyramid featuresCapturing a spatial point of interest; to obtain global information without increasing the parameters, two convolution layers are used, one kernel being 1×k and the other kernel being k×1; two convolution layers parallel processing input +.>The convolution layer output value with a kernel of 1 xk is set to C ₁ The convolution layer output value with kernel 1 xk is set to C ₂ The method comprises the steps of carrying out a first treatment on the surface of the Thus C ₁ And C ₂ The following expression is satisfied:

wherein,parameters referring to the spatial attention mechanism, conv1 and conv2 refer to convolutional layers with kernels 1×k and k×1, respectively;

1.4.2 A) output value C of convolution layer with kernel of 1 xk ₁ And a convolution layer output value C with a kernel of 1 xk ₂ After addition, map to [0,1 ] by using Sigmoid operation]The normalization processing is completed and the sa value is obtained; thus (2)Wherein sigma ₂ A sigmoid operation representing the current step;

1.4.3 Low-level features f ^l ∈R ^W×H×C Wherein R represents a set of spatial locations, W represents a width of a dimension, H represents a height of the dimension, and C represents a total number of channels; by weighting f with sa ^l Obtaining final weighted low-level features;

。

further, the specific process of the step 1.5) is as follows:

the Laplace operator is used first to obtain the true boundary and saliency map of the network output, and then the cross entropy loss L is used _B To supervise the generation of salient object boundaries;

wherein the Laplace operator is a second order differential operator in an n-dimensional Euclidean space, and is defined as the divergence delta z of the gradient; since the second derivative can be used to detect edges, the laplace operator is used to derive significant object boundaries; the Laplace operator is given by the following formula, where h and y are set to be standard Cartesian coordinates of a plane, and z represents a curve function;

thus, by using the laplace operator, the true boundary Δy and the saliency map Δq of the network output can be obtained;

The cross entropy loss formula is as follows:

wherein Y represents a set of real boundary maps, size (Y) represents the total number of the set of real boundary maps, i represents the ith group, ΔY _i To use the real boundary map of the ith group represented by the Laplacian, ΔQ _i A saliency map of the i-th set of network outputs represented by the laplacian.

Further, the specific process of the step 2) is as follows:

2.1 Calculating the probability of co-occurrence features):

2.1.1): consider the input convolutional neural network CNN feature map as N channel dimensional features X, i.e., x= { X ₁ ，.，x _i ，.，x _N }，x _i An input feature representing a position i, where i e {1, …, N }; set target feature x _t Co-occurrence feature x of (2) _c The probability of (2) is:

wherein p (x) _c |x _t ) Representing co-occurrence feature x _c Is also expressed as a likelihood function ofI.e. X takes X _t The probability at time is->s(x _c ，x _t ) Is co-occurrence feature x _c And target feature x _t Similarity function between->I.e. representing an exponential function; the similarity function s has the expression +.>Wherein->And->Is the target feature x _t And co-occurrence feature x _c Is represented by a point multiplication, T represents a vector transposition; the vector representation is composed of +.>And->Give, wherein->And->Is obtained by using feedforward network learning; similarly, s (x) _i ，x _t ) Input feature x, denoted as position i _i And target feature x _t Similarity function between the two, the expression is +.>Wherein->Is the input feature x of location i _i Vector representations of (a);

2.1.1): computing probability distribution of co-occurrence features using Softmax with spatial similarity, computing probability distribution p (x) _c |x _t ) The' formula is as follows:

wherein pi is ^k Is the a priori weight of the kth component, K represents the vector to be vectorAnd->Divided into K subcomponents s ^k (x _c ，x _t ) Is the k-th component corresponding to co-occurrence feature x _c And target feature x _t Similarity functions betweenObtaining target feature x _t And co-occurrence feature x _c The vector representations corresponding to the kth sub-component are +.>And->Simultaneous-representing dot product, T representing vector transpose, < >>Expressed in the form of an exponential function; similarly, s ^k (x _i ，x _t ) Input feature x representing a k-th component corresponding position i _i And target feature x _t Similarity function between the two, the expression isWherein->Is the input feature x of location i _i A vector representation corresponding to the kth subcomponent; whereas the prior weight +.>Wherein->Responsible for capturing context information, w _k Is a variable which can be learned, N represents N channels of the characteristic diagram, and x _i Input feature representing position i, +.>Is the target feature x _i Represents a point multiplication, T represents a vector transposition, exp represents an exponential function based on a natural number e;

2.2 Building an aggregate appearance feature module ACF):

the aggregation co-occurrence feature module aggregates cross-space context information by applying co-occurrence probability through a self-attention mechanism; the specific formula is as follows:

wherein z is _t Is the aggregate feature output of target t, p (x _c |x _t ) ' is the co-occurrence feature probability omega in the output result of the last step _c Representing a corresponding c-channel input feature x _c Is composed of omega _c ＝Ψ(x _c ) Obtained, x _c Represented as a co-occurrence feature,learning by using a feedforward network;

2.3 Average pooling feature):

capturing features with an averaging pooling layer, attaching a 1 x 1 convolution layer to each feature location; the co-occurrence feature module is expanded by the method; average pooled result S _j The method is characterized by comprising the following steps:

wherein T represents a rank threshold for selecting an activation value to participate in pooling; r is R _j A pooling field denoted in the jth feature map, i denotes an index value of the activation value in this pooling field; r is (r) _i And g _i The order bit and the activation value representing the activation value i, respectively.

Further, the specific process of the step 3.1) is as follows:

3.1.1 Obtaining a non-local color relationship): to represent the relationship between a large range of pixel pairs, a low-level, non-local color relationship is built, and the building process mainly has two key points: firstly, 2500 super pixels are generated by using super pixel segmentation SLIC; secondly, evaluating affine relation of each super pixel and all super pixels within a radius of 20% of the size corresponding to the image; for two superpixels s and t segmented by a distance of less than 20% of the image size, a non-local color relationship of their centroids is defined Wherein c _s ，c _t ∈[0，1]Is the mean color value of the superpixel, erf is a Gaussian error function, a _c ，b _c Is the rate at which the radial relationship term controls the drop and becomes a threshold of 0;

3.1.2 Obtaining advanced semantic relationship information): the meaning of the semantic relationship expression is to encourage grouping of pixels belonging to the same scene object and to prevent grouping of pixels from different objects; training a semantic segmentation network on a COCO-Stuff data set by adopting deep Lab-ResNet-101 as a feature extractor; deepLab-ResNet-101 consists of a DeepLab model with a ResNet-101 backbone. COCO-Stuff is a public dataset based on DeepLab training, the image includes 91 kinds of targets, and feature vectors of two superpixels s, t are applied to represent high-level semantic relation between superpixels Wherein (1)>Mean eigenvectors representing s and t, erf is a Gaussian error function, a _s And b _s The rate at which the parameter control function drops and the threshold value that becomes negative;

further, the specific process of forming the laplace matrix in the step 3.2) is as follows:

non-local color relationships by relationships between two groups of previously acquired pixelsAnd advanced semantic relation->Constructing a Laplace matrix L by combining the principle of least square optimization;

Wherein W is _L Is a matrix containing all pixel pair approximations, W _c Is a matrix containing non-local color relationships, W _s Is a matrix containing semantic relationships, σ _s 、σ _C Is a parameter controlling the influence of the corresponding matrix, is set to 0.01, and D is a diagonal matrix.

Further, the specific process of the step 4.4) is as follows:

4.4.1 The result of the edge precision processing part is a two-dimensional matrix set, and the pixel point number values of the classes are recorded respectively; if the processed result is the pixel point belonging to the class, the value is available; if the pixel points do not belong to the class, the value is set to 0, and whether the pixel belongs to the class or not is specifically judged, and the transparency is determined.

4.4.2 Then, respectively carrying out dot multiplication on the matrixes in the two-dimensional matrix set and a gray level diagram for determining the result of the main body part, wherein the intersection of the results is provided with a numerical value, and the non-intersection is 0; traversing the matrix after the dot multiplication to obtain the number of pixel points with numerical values, namely, the size of the area of intersection can be regarded as m; traversing to obtain the area size of class as small and the area size of the foreground main body of the gray level map as big;

let bl be the ratio of intersection area to class area, expressed as:

let BL be the ratio of the intersection area to the foreground subject area, expressed by the expression:

Therefore, if the value of bl is larger, the class is considered to be basically part of the foreground, and all records of the class are reserved; otherwise, judging the value of BL, and reserving an intersection part when the value of BL exceeds a set range;

4.4.3 All the parts of the reserved record are combined, and the co-map part reserved by the co-occurrence feature is taken as an intersection, so that the result output by the last image is obtained.

Compared with the prior art, the method has the main advantages that:

1) The accuracy of foreground segmentation edge processing is improved to a great extent;

2) And the bottom layer features and the high layer features of the image are considered, feature fusion with different scales is adopted, and the degree of contribution of the features to the saliency is considered, so that the detection accuracy is improved.

3) Automatically picking up a main body part in the image;

4) The FCOS, the multi-scale strategy and the centrality idea are introduced, more foreground samples can be utilized for training, and the position regression of the main body detection frame is more accurate;

5) Various objects can be detected by introducing FCOS, multi-scale strategy and centrality ideas, including image scenes such as main body and background color close, shielding, high overlapping and the like, so that the application range of the method is improved;

6) The accuracy of identifying similar objects in the image is improved, so that the accuracy of segmentation is improved.

Detailed Description

The invention is further described below with reference to examples.

The invention relates to an image segmentation method based on a boundary box and co-occurrence feature prediction, which combines a main body determining process for positioning a salient region and an edge accurate process for accurately segmenting a target with co-occurrence features. Firstly, determining a detection frame range of a main body by using an FCOS method, a multi-scale strategy and a centrality idea; secondly, determining that the main body part designs the context features rich in the context sensing pyramid feature extraction module, combining a channel attention mechanism model module after feature mapping of the context sensing pyramid feature extraction module and a space attention mechanism model module after low-level feature mapping, and monitoring generation of significant boundary positioning information by using cross entropy loss; predicting the co-occurrence characteristics of the image, measuring the similarity of the co-occurrence characteristics and the target characteristics to learn the probability of the co-occurrence characteristics, adding context prior information to enhance the robustness in the complex scene, then constructing an aggregate co-occurrence characteristic module, adding the average pooling connection for obtaining the global characteristics to obtain an image classification result, and taking the image classification result as one of the basis of the accurate edges to achieve the effect of improving the accuracy of image classification; and then, an edge process is accurately performed, non-local color features in the image are obtained based on a spectral extinction technology, advanced semantic features are obtained through a residual error network, and the two are combined through a Laplacian matrix to classify pixel points in the image, so that the effect of accurately dividing the target is achieved. Finally, the results of the two processes are fused.

Examples:

an image segmentation method based on bounding box and co-occurrence feature prediction comprises the steps of introducing an FCOS method and determining a subject process by using a multi-scale strategy and a centrality method. 2. And (5) a co-occurrence feature classification process. 3. Edge refinement and a combination of both.

The FCOS-introduced method, using a multi-scale policy and centrality method to determine the subject procedure is as follows:

1.1 inputting an image and a real bounding box, and determining and outputting a subject target bounding box range based on an FCOS method by using a multi-scale strategy and a centrality method. The CVPR party where the FCOS (FCOS: fully Convolutional One-Stage Object Detection, first order full convolution target detection) method is 2019 proposes an image segmentation method.

1.1.1 first, the FCOS method is adopted to perform pixel-by-pixel regression prediction, and the prediction strategy is as follows.

Bounding boxWherein->Is the coordinates of the upper left corner of the bounding box,,>is the coordinates of the lower right corner of the bounding box, where i represents the ith bounding box in the image. The pixel position is (m, n), vector a ^* ＝(l ^* ，t ^* ，r ^* ，b ^* )，l ^* 、t ^* 、r ^* 、b ^* Representing the distances of the pixels to the left, upper, right and lower sides of the bounding box, respectively. Build position (m, n) and bounding box B _i After correlation, then the training regression target for that location may be expressed as

Since the FCOS algorithm performs pixel-by-pixel regression based on points in the target object frame, the regression targets are all positive samples, so the exp function is used to stretch the regression targets, i.e. i obtained by the above formula ^* 、t ^* 、r ^* 、b ^* The values are taken as variables into an exp function, and the output values are respectively the distance values from the corresponding pixels to the left side, the upper side, the right side and the lower side of the boundary boxValues after line stretching. This operation makes the final feature space larger and the degree of recognition stronger. Wherein exp functions are exponential functions based on natural constants e in higher mathematics.

1.1.2 secondly, using a multi-scale strategy, the problem of intractable ambiguity caused by overlapping of the true value boxes, i.e. which bounding box should be regressed at a certain position in the overlap, is solved. In order to better utilize the multi-scale features, the feature layer at each scale defines the scope of the bounding box regression, which comprises the following specific steps:

the first step: the feature map, defined as five levels { P1, P2, P3, P4, P5}, is defined to satisfy the training of a multi-scale strategy, each level satisfying different parameters hxw/s, where H and W correspond to the height and width of the feature map, s being the downsampling rate of the feature map (sampling a sample sequence once at intervals of a number of samples, such that the new sequence is the downsampling of the original sequence). Wherein, the parameters corresponding to P1 are 100×128/8, the parameters corresponding to P2 are 50×64/16, the parameters corresponding to P3 are 25×32/32, the parameters corresponding to P4 are 13×16/64, and the parameters corresponding to P5 are 7×8/128. The regression targets in each layer are then calculated: l (L) ^* 、t ^* 、r ^* 、b ^* 。

And a second step of: judgment max (l) ^* 、t ^* 、r ^* 、b ^* )＞d _i Or max (l) ^* 、t ^* 、r ^* 、b ^* )＜d _i-1 Whether or not satisfied, wherein max represents taking the maximum value; (here d) _i Is the maximum distance that the feature level i needs to return, corresponds to five hierarchical feature graphs { P1, P2, P3, P4, P5} defined in the first step, corresponds to 5 feature levels, corresponds to i values of 1, 2, 3, 4, 5, and sets d ₀ 、d ₁ 、d ₂ 、d ₃ 、d ₄ 、d ₅ 0, 64, 128, 256, 512, and ≡).

And a third step of: if the judgment formula in the second step is satisfied, the judgment formula is set as a negative sample, and regression prediction is not performed on the boundary box.

Since objects of different sizes are assigned to different feature layers for regression and most of the overlap occurs between objects of widely differing sizes, multi-scale prediction can greatly alleviate the predictive performance in the case of overlapping target frames.

1.1.3 finally, the centrality method is used. Since FCOS algorithm uses a pixel-by-pixel regression strategy, while raising recall, many low quality center points and more offset prediction bounding boxes are generated, so the centrality method is used to suppress these low quality detected bounding boxes, and the strategy does not introduce any hyper-parameters. The centrality method adds a penalty to the network that ensures that the predicted bounding box is as close to the center as possible.

Wherein max represents a maximum value, and min represents a minimum value.

From the above formula it can be derived that: centrality is understood to be a concept with a measuring effect, which takes on values between 0 and 1. First find the center in the above formula ^* The obtained center is used for ^* Training was performed using cross entropy loss. The centrality may reduce the weight of bounding boxes away from the center of the object. Therefore, these low quality bounding boxes are likely to be filtered out, thereby significantly improving detection performance. Where cross entropy loss is a mathematical function used to measure the difference between two probability distributions, the loss of information between the predicted value and the true label value can be measured. The cross entropy Loss function Loss is expressed as follows:

n represents that there are n sets of labels, e is the probability distribution of the ith real label,representing the probability distribution of the i-th tag predictor.

centerness ^* As a means ofThe value is input into the above formula and is found as center ^* ∈[0.5，1]When (I)>Taking 1; when center is present ^* E [0,0.5 ]), +.>Taking 0, and calculating the Loss obtained after each training. When the Loss is smaller than a certain given threshold, namely the information Loss between the predicted value and the real label value meets the expected requirement, ending the training; when Loss is greater than a given threshold, l is reduced ^* 、t ^* 、r ^* 、b ^* I.e. the bounding box is closer to the center, the center is recalculated ^* And a corresponding Loss value.

1.2 inputting an image based on a subject target bounding box range, and adopting a Context aware pyramid feature extraction module (CPFE, context-aware Pyramid Feature Extraction) for multi-scale high-level feature mapping to obtain rich Context features. Wherein the CPFE model is an existing model, proposed in the CVPR2019 conference. CPFE takes Conv3-3, conv4-3 and Conv5-3 in VGG-16 network architecture as basic high-level features, wherein VGG-16 is a 16-layer deep convolutional neural network developed by researchers of the university of oxford computer vision group and Google company, VGG-16 contains 13 convolutional layers (five convolutional blocks, each comprising 2-3 convolutional layers) and three fully-connected layers. Conv3-3 represents the third convolution layer inside the third convolution block, and Conv4-3 and Conv5-3 represent the third convolution layer inside the fourth convolution block and the third convolution layer inside the fifth convolution block, respectively.

The specific process is as follows:

a) In order for the final extracted high-level features to meet the scale invariance and shape invariance features, porous convolutions (Atrous Convolution) of different expansion rates are employed, and expansion rates of 3, 5 and 7 are further set to capture context information, respectively. The porous convolution introduces a "dilation rate" parameter into the convolution layer that defines the spacing of the values as the convolution kernel processes the data. The advantage of the multi-hole convolution is that a larger receptive field can be obtained, denser data is obtained, and the effect of small object identification segmentation is improved.

b) Feature maps from different porous convolutional layers are stitched with 1 x 1 dimension-reduction features by cross-channel connection. Then, three different-scale features (Conv 3-3, conv4-3 and Conv5-3 are basic high-level features) are obtained by using context awareness information (feature information output by the side of the VGG-16 network), the three different-scale features are combined in pairs, and each two smaller-scale features are up-sampled to obtain larger-scale features, so that three-scale high-level features are output; the up-sampling is also called image interpolation, i.e. new elements are inserted between pixel points by adopting an interpolation algorithm on the basis of original image pixels, so that the original image is amplified.

c) Finally, the high-level features obtained by up-sampling are combined through cross-channel connection and used as the output of the context perception pyramid feature extraction module.

1.3 after the context aware pyramid feature extraction, a channel attention mechanism model is added and used to assign more weight to channels that exhibit high response to salient objects. New high-level features are output by weighting the context-aware pyramid features using the channel attention mechanism model.

Wherein the channel attention mechanism model is as follows:

a) Firstly, extracting the context-aware pyramid features and then extracting the high-level features f ^h ∈R ^W×H×C Is unfolded intoWherein->The representation represents a high-level feature f ^h R represents a set of spatial positions, W represents a width of a dimension, H represents a height of the dimension, and C represents a total number of channels. Then average pooling +.>To obtain channel feature vector v ^h Wherein the purpose of the averaging pooling is to reduce the error of the increase of the variance of the estimated value due to the limited neighborhood size, thereby facilitating the preservation of more background information of the image, the result of the averaging pooling S _j The method is characterized by comprising the following steps:

where T represents a rank threshold for the activation values selected to participate in pooling. R is R _j The pooling field in the jth feature map is represented, and i represents the index value of the activation value in this pooling field. r is (r) _i And g _i The order bit and the activation value representing the activation value i, respectively.

b) Then, the channel characteristic vector v obtained in the last step is used for ^h Outputting to a ReLU layer through a Full Connection (FC) layer; the Fully Connected (FC) layer is the computational layer in the convolutional neural network that acts as a classifier. The ReLU layer refers to a computational layer that contains ReLU functions, which are commonly used activation functions in artificial neural networks.

c) Subsequently, the channel feature vector is mapped to [0,1 ] by using Sigmoid operation ]Between, normalization is done and the ca value is obtained (i.e., each layer of the high-level featuresA weight matrix). Therefore ca=f (v ^h ，W)＝σ ₁ (fc ₂ (δ(fc ₁ (v ^h ，W ₁ ))，W ₂ ) Wherein W is ₁ 、W ₂ Sigma, a parameter of the channel attention mechanism ₁ Refers to sigmoid operation, FC refers to FC layer, delta refers to ReLU function.

d) Finally, the high-level features obtained after the context awareness pyramid features are weighted are outputWherein represents dot product.

。

1.4 acquiring low-level features of natural images, taking Conv1-2 and Conv2-2 (Conv 1-2 represents a second convolution layer in a first convolution block and Conv2-2 represents a second convolution layer in a second convolution block) in a VGG-16 network architecture as basic low-level features as inputs. The low-level features of natural images generally contain abundant foreground and complex background details, but excessive detail information can cause noise. The spatial attention mechanism model is thus employed to focus more on the boundary between salient objects and background, helping to generate efficient low-level features for containing more boundary information.

Wherein the spatial attention mechanism model is as follows:

a) Inputting high-level features obtained by weighting context-aware pyramid featuresTo capture spatial points of interest. To obtain global information without increasing the parameters, two convolution layers are used, one kernel being 1×k and the other kernel being k×1. Two convolution layers parallel processing input +. >The convolution layer output value with a kernel of 1 xk is set to C ₁ The convolution layer output value with kernel 1 xk is set to C ₂ . Thus C ₁ And C ₂ The following expression is satisfied:

wherein,referring to parameters of the spatial attention mechanism, conv1 and conv2 refer to convolutional layers with kernels 1×k and k×1, respectively.

b) Output value C of convolution layer with kernel of 1 Xk ₁ And a convolution layer output value C with a kernel of 1 xk ₂ After addition, map to [0,1 ] by using Sigmoid operation]And (5) finishing normalization processing and obtaining the sa value. Thus (2)Wherein sigma ₂ A sigmoid operation representing the current step.

c) Low-level features f ^l ∈R ^W×H×C Where R represents the set of spatial locations, W represents the width of the dimension, H represents the height of the dimension, and C represents the total number of channels. By weighting f with sa ^l The final weighted low-level features are obtained.

。

1.5 fusing the high-level features weighted by the channel attention mechanism model and the low-level feature output weighted by the spatial attention mechanism model together, using cross entropy loss to monitor the generation of significant boundary positioning information. And outputting a gray level image of the foreground outline of the image according to the positioning information.

The Laplace operator is used first to obtain the true boundary and saliency map of the network output, and then the cross entropy loss L is used _B To supervise the generation of salient object boundaries.

Where the laplace operator is a second order differential operator in n-dimensional euclidean space, defined as the divergence (Δz) of the gradient. Since the second derivative can be used to detect edges, the laplace operator is used to derive significant object boundaries. The Laplacian is given by the following formula, where h and y are the standard Cartesian coordinates of a plane, and z represents a curve function.

Thus, by using the laplace operator, the true boundary (denoted by Δy) and the saliency map (denoted by Δq) of the network output can be obtained.

The cross entropy loss formula is as follows:

The co-occurrence feature classification process is as follows:

first, given an input image, a convolutional feature map is extracted using a pre-trained CNN (CNN is a convolutional neural network, i.e., a type of feedforward neural network that includes convolutional computation and has a deep structure, which is one of the representative algorithms for deep learning). Then split into three branches: firstly, utilize feedforward network And->Converting the feature map into a representation of a target vector and a co-occurrence vector (i.e., representing features of the target object and the co-occurrence object in the form of vectors), and estimating probabilities of co-occurrence features based on the similarities; secondly, using a psicose context with a co-occurrence probability by an aggregate co-occurrence feature module (ACF); third, global characteristics are captured using average pooling. The feedforward network refers to layered arrangement of neurons, each neuron is only connected with neurons of a previous layer, receives output of the previous layer and outputs the output to a next layer, and feedback does not exist among layers. By co-occurrence is meant that object A and object B tend to appear in the same scene, so one of the basis for judging B object is co-occurrenceWhether object a appears in the scene. The co-occurrence feature is a feature represented by the co-occurrence object, and the co-occurrence probability is a probability distribution of occurrence of the co-occurrence feature.

a) The probability of co-occurrence features is calculated. Because of the uncertainty of the co-occurrence feature, this is divided into two steps: firstly, learning the possibility of the co-occurrence feature and the target feature by measuring the similarity of the co-occurrence feature and the target feature; and secondly, calculating probability distribution of the co-occurrence features by using Softmax with spatial similarity. Since it is difficult to distinguish some complex categories by global semantic information alone, context priors are incorporated in the modeling process.

The specific first step is as follows: consider an input CNN (convolutional neural network) feature map as N channel dimensional features X, i.e., x= { X ₁ ，.，x _i ，.，x _N }，x _i The input feature is represented with a position i, where i e {1, …, N }. Set target feature x _t Co-occurrence feature x of (2) _c The probability of (2) is:

wherein p (x) _c |x _t ) Representing co-occurrence feature x _c Is also expressed as a likelihood function ofI.e. X takes X _t The probability at time is->s(x _c ，x _t ) Is co-occurrence feature x _c And target feature x _t Similarity function between->I.e. an exponential function. The similarity function s has the expression +.>Wherein->And->Is the target feature x _t And co-occurrence feature x _c Is represented by a point multiplication, and T represents a vector transposition. The vector representation is composed of +.>And->Give, wherein->And->Is obtained by using feedforward network learning; similarly, s (x) _i ，x _t ) Input feature x, denoted as position i _i And target feature x _t Similarity function between the two, the expression is +.>Wherein->Is the input feature x of location i _i Is a vector representation of (c).

The specific second step: computing probability distribution of co-occurrence features using Softmax with spatial similarity, computing probability distribution p (x) _c |x _t ) The' formula is as follows:

wherein pi is ^k Is the a priori weight of the kth component, K represents the vector to be vectorAnd->Divided into K subcomponents s ^k (x _c ，x _t ) Is the k-th component corresponding to co-occurrence feature x _c And target feature x _t Similarity functions betweenObtained (target feature x) _t And co-occurrence feature x _c The vector representations corresponding to the kth sub-component are +.>And->Simultaneous-representing dot product, T representing vector transpose), -N>Expressed in the form of an exponential function; similarly, s ^k (x _i ，x _t ) Input feature x representing a k-th component corresponding position i _i And target feature x _t Similarity function between the two, the expression is(/>Is the input feature x of location i _i Vector representation corresponding to the kth subcomponent). Whereas the prior weight +.>Wherein->Responsible for capturing context information, w _k Is a variable which can be learned, N represents N channels of the characteristic diagram, and x _i Input feature representing position i, +.>Is the target feature x _i Represents a point multiplication, T represents a vector transposition, exp represents an exponential function based on a natural number e.

b) And constructing an aggregate appearance feature module ACF. The aggregate concurrency feature module aggregates cross-space context information by self-paying mechanism and concurrency probability. The specific formula is as follows:

wherein z is _t Is the aggregate feature output of target t, p (x _c |x _t ) ' is the co-occurrence probability, ω, in the output result of the previous step a) _c Representing a corresponding c-channel input feature x _c Is composed of omega _c ＝Ψ(x _c ) Obtained, x _c Represented as a co-occurrence feature,obtained by using feed forward network learning.

c) And (5) averaging pooling characteristics. Features were captured with an averaging pooling layer and attached to each feature location with a 1 x 1 convolution layer. The co-occurrence feature module is expanded by this method. Average pooled result S _j The method is characterized by comprising the following steps:

The edge refinement process is as follows:

3.1 first, after inputting an image, image information features are collected. The information features of an image come mainly from two aspects: firstly, based on spectral extinction, obtaining non-local color relation information (texture and color information) from an input image from the perspective of spectral analysis; and secondly, using a convolutional neural network for scene analysis to generate high-level semantic relation information.

3.1.1 obtaining a non-local color relationship. To represent the relationship between a larger range of pixel pairs, a low-level, non-local color relationship is constructed. The construction process mainly has two key points: firstly, 2500 super pixels are generated by using super pixel Segmentation (SLIC) (wherein the super pixel segmentation is an image segmentation technology proposed in 2003, pixels are grouped by using similarity of features among pixels, and a small number of super pixels are used for replacing a large number of pixels to express the features of a picture, so that the complexity of image post-processing is greatly reduced); and secondly, evaluating affine relation of each super pixel and all super pixels within a radius of 20% of the size corresponding to the image. For two superpixels s and t segmented by a distance of less than 20% of the image size, a non-local color relationship of their centroids is defined Wherein c _s ，c _t ∈[0，1]Is the mean color value of the superpixel, erf is a Gaussian error function, a _c ，b _c Is the radial relation term that controls the rate of decrease and the threshold that becomes 0.

And 3.1.2 obtaining high-level semantic relation information. The meaning of the semantic relationship expression is to encourage grouping of pixels belonging to the same scene object and to prevent grouping of pixels from different objects. Training of the semantic segmentation network was performed on the COCO-Stuff dataset using DeepLab-ResNet-101 as the feature extractor. The ResNet-101 is named as a depth residual error network, the VGG-19 network is referred, the ResNet-101 is modified on the basis of the ResNet-101, and a residual error unit is added through a short circuit mechanism, so that the complexity of the depth network and the difficulty of training are reduced. Deep LThe ab-ResNet-101 consisted of a DeepLab model with ResNet-101 backbone. COCO-Stuff is a public dataset based on DeepLab training, the image includes 91 classes of targets, the dataset mainly solves 3 problems: object detection, context between objects. Using feature vectors of two superpixels s, t to represent high-level semantic relationships between superpixelsWherein (1)>Mean eigenvectors representing s and t, erf is a Gaussian error function, a _s And b _s The rate at which the parameter control function drops and the threshold value that becomes negative.

And 3.2, combining non-local color relation information and advanced semantic relation information of the image, establishing an image layer, and revealing semantic objects and soft transition relations among the semantic objects in feature vectors of the Laplace matrix.

The process of forming the laplace matrix is as follows:

non-local color relationships by relationships between two groups of previously acquired pixelsAnd advanced semantic relation->And constructing the Laplace matrix L by combining the principle of least square optimization.

Wherein W is _L Is a matrix containing all pixel pair approximations, W _c Is a matrix containing non-local color relationships, W _s Is a matrix containing semantic relationships, σ _S 、σ _C Is a parameter for controlling the influence of the corresponding matrix, is set to 0.01, and D is a diagonalA matrix.

And 3.3, extracting feature vectors corresponding to 100 minimum feature values of the L matrix, and processing the feature vectors by using k-means clustering. The k-means clustering algorithm is a clustering algorithm based on division, and k is used as a parameter to divide a plurality of data objects into k clusters, so that the clusters have higher similarity and the clusters have lower similarity. And after the edge precision processing process is finished, outputting an image layer formed by the Laplacian matrix.

Determining a main body result, a co-occurrence feature classification result and an edge precision result, and fusing:

4.1 first, the process determines the result of the body portion. This cannot be combined with the results of the edge refinement portion because the output of the body portion is determined to be the body contour border. Therefore, the gray scale obtained by the output result of step 1.5 is binarized, and the obvious white area of the main body is reserved, not just one main body contour. Wherein image binarization is a process of setting the gray value of a pixel point on an image to 0 or 255, that is, displaying a clear black-and-white effect on the whole image. Binarization of the image greatly reduces the amount of data in the image, thereby highlighting the contours of the object.

And 4.2, processing the result of the co-occurrence feature, obtaining a matrix corresponding to each class in the result of the co-occurrence feature, respectively carrying out dot multiplication on the matrix and the gray level map, aiming at determining the class of the main class, traversing the result matrix obtained by calculating each class and the gray level map, calculating an intersection part in the result matrix, namely the main class with the largest intersection in all classes, and keeping the intersection of the class and the gray level map as co_map.

4.3 then, the results of the edge refinement portion are processed. Because the output of the edge refinement portion is an image layer composed of a laplacian matrix, its representation is in the form of a collection of a batch of two-dimensional matrices. Wherein a pixel on the layer has multiple transparencies, i.e. belongs to multiple classes, which cannot be combined with the results of the edge refinement. Therefore, each pixel point is required to be determined to belong to only one class, and the determination mode is performed through traversal. The traversal process is: traversing the whole matrix set, finding out the matrix of the class with the maximum transparency of each pixel point, considering the matrix as the class to which the pixel point belongs, and setting the transparency in other matrixes except the class to be 0.

And 4.4, finally, respectively carrying out dot multiplication on the Laplace matrix set output in the step 4.3 and the gray level diagram of the saliency detection result, and determining the needed reservation and the intersection part between the reservation classes. Combining all the parts of the reserved record is the final desired foreground body part. The method comprises the following steps:

4.4.1 processing the result of the edge precision part to be a two-dimensional matrix set, and respectively recording the pixel point number value condition of the class. If the processed result is the pixel point belonging to the class, the value is available; if the pixel points do not belong to the class, the value is assigned to 0. And whether the pixel belongs to the class is determined by the transparency.

4.4.2 then, respectively carrying out dot multiplication on the matrixes in the two-dimensional matrix set and the gray level diagram of the result of determining the main body part, wherein the intersection of the result is provided with a numerical value, and the non-intersection is 0. The matrix after the dot multiplication is traversed to obtain the number of pixel points with numerical values, namely the size of the area of intersection can be regarded as m. And traversing to obtain the area size of the class as small and the area size of the foreground main body of the gray level map as big.

Let b1 denote the ratio of intersection area to class area, expressed as:

Therefore, if the value of b1 is larger, the class is considered to be basically part of the foreground, and all records of the class are reserved; and otherwise, judging the value of the BL, and reserving the intersection part when the value of the BL exceeds a certain range.

4.4.3 all parts of the reserved record are combined. And taking intersection of the co-map part reserved by the co-occurrence feature, and outputting a result of the last image.

And 4.5, determining that the fusion process of the main body result and the edge precision result is finished, and outputting a foreground part of the image.

Claims

1. An image segmentation method based on bounding box and co-occurrence feature prediction is characterized by comprising the following steps:

2) Co-occurrence feature classification:

first, an input image is given, using a pre-selectionFirstly, extracting a convolution feature diagram by a trained CNN; then split into three branches: firstly, utilize feedforward networkAnd->Converting the feature map into representations of a target vector and a co-occurrence vector, and estimating probabilities of co-occurrence features based on the similarities; secondly, using a psicoside context with a co-occurrence probability to capture through an aggregate co-occurrence feature module ACF; thirdly, capturing global characteristics by using average pooling;

3) Edge refinement:

2. The image segmentation method based on bounding box and co-occurrence feature prediction according to claim 1, wherein the specific procedure of step 1.1) is as follows:

bounding boxWherein->For the coordinates of the upper left corner of the bounding box, +.>Coordinates of a lower right corner of the bounding box, where i represents an i-th bounding box in the image; the pixel position is (m, n), vector a ^* ＝(l ^* ,t ^* ,r ^* ,b ^* )，l ^* 、t ^* 、r ^* 、b ^* Representing the distances of the pixels to the left, upper, right and lower sides of the bounding box, respectively; build position (m, n) and bounding box B _i Correlation ofAfter linking, the training regression target for that location can be expressed as:

1.1.2.1): defining five layers of feature graphs { P1, P2, P3, P4, P5} to meet the training of a multi-scale strategy, wherein each layer meets different parameters H multiplied by W/s, wherein H and W correspond to the height and width of the feature graphs, and s is the downsampling rate of the feature graphs; the regression targets in each layer are then calculated: l (L) ^* 、t ^* 、r ^* 、b ^* ；

1.1.2.2): judgment max (l) ^* 、t ^* 、r ^* 、b ^* )＞d _i Or max (l) ^* 、t ^* 、r ^* 、b ^* )＜d _i-1 Whether or not satisfied, wherein max represents taking the maximum value; d, d _i The maximum distance that the feature level i needs to return corresponds to the five hierarchical feature graphs { P1, P2, P3, P4, P5} defined in the first step, corresponds to 5 feature levels, corresponds to i values of 1, 2, 3, 4, 5,set d ₀ 、d ₁ 、d ₂ 、d ₃ 、d ₄ 、d ₅ 0, 64, 128, 256, 512, and ≡infinity;

wherein max represents a maximum value, and min represents a minimum value;

centerness ^* as a means ofThe value is input into the above formula and is found as center ^* ∈[0.5，1]When (I)>Taking 1; when center is present ^* E [0,0.5 ]), +.>Taking 0, and calculating the Loss obtained after each training; when the Loss is smaller than a certain given threshold, namely the information Loss between the predicted value and the real label value meets the expected requirement, ending the training; when Loss is greater than a given threshold, l is reduced ^* 、t ^* 、r ^* 、b ^* I.e. the bounding box is closer to the center, the center is recalculated ^* And a corresponding Loss value.

3. The image segmentation method based on bounding box and co-occurrence feature prediction according to claim 1, wherein the specific procedure of step 1.2) is as follows:

1.2.2 By cross-channel connection, stitching feature maps from different porous convolutional layers with 1 x 1 dimension-reduction features; then, three different-scale features are obtained by using context sensing information, the three different-scale features are combined in pairs, the smaller-scale features in every two are up-sampled, and three-scale advanced features are output; the up-sampling is also called image interpolation, namely new elements are inserted between pixel points by adopting an interpolation algorithm on the basis of original image pixels, so that an original image is amplified;

4. The image segmentation method based on bounding box and co-occurrence feature prediction according to claim 1, wherein the specific procedure of step 1.3) is as follows:

1.3.1 First, the high-level features f after the context-aware pyramid feature extraction is performed ^h ∈R ^W×H×C Is unfolded intoWherein->The representation represents a high-level feature f ^h R represents a set of spatial positions, W represents a width of a dimension, H represents a height of the dimension, and C represents a total number of channels; then average pooling +. >To obtain channel feature vector v ^h Wherein the purpose of the averaging pooling is to reduce the error of the increase of the variance of the estimated value due to the limited neighborhood size, thereby facilitating the preservation of more background information of the image, the result of the averaging pooling S _j The method is characterized by comprising the following steps:

1.3.3 Channel feature vectors are then mapped to [0,1 ] using Sigmoid operations]In between, normalization processing is completed and a ca value is obtained, namely each layer of the high-layer characteristicsA weight matrix; therefore ca=f (v ^h ，W)＝σ ₁ (fc ₂ (δ(fc ₁ (v ^h ，W ₁ ))，W ₂ ) Wherein W is ₁ 、W ₂ Sigma, a parameter of the channel attention mechanism ₁ Referring to sigmoid operation, fc refers to fully connected layer, delta refers to ReLU function; the full connection layer is a calculation layer which plays a role of a classifier in the convolutional neural network; the ReLU layer refers to a calculation layer containing a ReLU function, and the ReLU function is an activation function commonly used in an artificial neural network;

。

5. the image segmentation method based on bounding box and co-occurrence feature prediction according to claim 1, wherein the specific procedure of step 1.4) is as follows:

。

6. the image segmentation method based on bounding box and co-occurrence feature prediction according to claim 1, wherein the specific procedure of step 1.5) is as follows:

the cross entropy loss formula is as follows:

7. The image segmentation method based on bounding box and co-occurrence feature prediction according to claim 1, wherein the specific procedure of step 2) is as follows:

2.1 Calculating the probability of co-occurrence features):

2.1.1): consider the input convolutional neural network CNN feature map as N channel dimensional features X, i.e., x= { X ₁ ，...，x _i ，...，x _N }，x _i An input feature representing a position i, where i e { 1..N }; set target feature x _t Co-occurrence feature x of (2) _c The probability of (2) is:

wherein p (x) _c |x _t ) Representing co-occurrence feature x _c Is also expressed as a likelihood function ofI.e. X takes X _t The probability at the time iss(x _c ，x _t ) Is co-occurrence feature x _c And target feature x _t Similarity function between->I.e. representing an exponential function; the similarity function s has the expression +.>Wherein->And->Is the target feature x _t And co-occurrence feature x _c Is a vector representation of · represents a dot product, +.>Representing the vector transpose; the vector representation is composed of +.>And->Given, whereinAnd->Is obtained by using feedforward network learning; similarly, s (x) _i ，x _t ) Input feature x, denoted as position i _i And target feature x _t Similarity function between the two, the expression is +.>Wherein->Is the input feature x of location i _i Vector representations of (a);

wherein pi is ^k Is the a priori weight of the kth component, K represents the vector to be vectorAnd->Divided into K subcomponents s ^k (x _c ，x _t ) Is the k-th component corresponding to co-occurrence feature x _c And target feature x _t Similarity functions between the two, wherein the similarity functions are all +.>Obtaining target feature x _t And co-occurrence feature x _c The vector representations corresponding to the kth sub-component are +.>And->Simultaneous-representing dot product->Representing vector transpose->Expressed in the form of an exponential function; similarly, s ^k (x _i ，x _t ) Input feature x representing a k-th component corresponding position i _i And target feature x _t Similarity function between the two, the expression is +.>Wherein->Is the input feature x of location i _i A vector representation corresponding to the kth subcomponent; whereas the prior weight +.>Wherein->Responsible for capturing context information, w _k Is a variable which can be learned, N represents N channels of the characteristic diagram, and x _i Input feature representing position i, +.>Is the target feature x _i Represents a point multiplication, T represents a vector transposition, exp represents an exponential function based on a natural number e;

2.2 Building an aggregate appearance feature module ACF):

Wherein z is _t Is the aggregate feature output of target t, p (x _c |x _t ) ' is the co-occurrence feature probability omega in the output result of the last step _c Representing a corresponding c-channel input feature x _c Is composed of omega _c ＝Ψ(x _c ) The product can be obtained by the method,x _c represented as a co-occurrence feature,learning by using a feedforward network;

2.3 Average pooling feature):

8. The image segmentation method based on bounding box and co-occurrence feature prediction according to claim 1, wherein the specific procedure of step 3.1) is as follows:

3.1.1 Obtaining a non-local color relationship): the construction of a low-level non-local color relationship mainly comprises two key points: firstly, 2500 super pixels are generated by using super pixel segmentation SLIC; secondly, evaluating affine relation of each super pixel and all super pixels within a radius of 20% of the size corresponding to the image; for two superpixels s and t segmented by a distance of less than 20% of the image size, a non-local color relationship of their centroids is defined Wherein c _s ，c _t ∈[0，1]Is the mean color value of the superpixel, erf is a Gaussian error function, a _c ，b _c Is the rate at which the radial relationship term controls the drop and becomes a threshold of 0;

3.1.2 Obtaining advanced semantic relationship information: the meaning of the semantic relationship expression is to encourage grouping of pixels belonging to the same scene object and to prevent grouping of pixels from different objects; training a semantic segmentation network on a COCO-Stuff data set by adopting deep Lab-ResNet-101 as a feature extractor; deepLab-ResNet-101 consists of a DeepLab model with a ResNet-101 backbone; COCO-Stuff is a public dataset based on DeepLab training, the image includes 91 kinds of targets, and feature vectors of two superpixels s, t are applied to represent high-level semantic relation between superpixels Wherein (1)>Mean eigenvectors representing s and t, erf is a Gaussian error function, a _s And b _s The rate at which the parameter control function drops and the threshold value that becomes negative.

9. The image segmentation method based on bounding box and co-occurrence feature prediction according to claim 1, wherein the specific process of forming the laplace matrix in step 3.2) is as follows:

non-local color relationships by relationships between two groups of previously acquired pixels And advanced semantic relation->Constructing a Laplace matrix L by combining the principle of least square optimization;

wherein W is _L Is to include all pixel pairs are approximately offMatrix of lines W _c Is a matrix containing non-local color relationships, W _s Is a matrix containing semantic relationships, σ _S 、σ _C Is a parameter controlling the influence of the corresponding matrix, is set to 0.01, and D is a diagonal matrix.

10. The image segmentation method based on bounding box and co-occurrence feature prediction according to claim 1, wherein the specific procedure of step 4.4) is as follows:

4.4.1 The result of the edge precision processing part is a two-dimensional matrix set, and the pixel point number values of the classes are recorded respectively; if the processed result is the pixel point belonging to the class, the value is available; if the pixel points do not belong to the class, the value is set to 0, and whether the pixel belongs to the class or not is specifically judged, and the transparency is determined;

Let bl be the ratio of intersection area to class area, expressed as:

therefore, if the value of bl exceeds the set range, the class is considered to be basically part of the foreground, and all records of the class are reserved; otherwise, judging the value of BL, and reserving an intersection part when the value of BL exceeds a set range;