CN114419322A

CN114419322A - Image instance segmentation method and device, electronic equipment and storage medium

Info

Publication number: CN114419322A
Application number: CN202210321177.6A
Authority: CN
Inventors: 刘聪
Original assignee: Feihu Information Technology Tianjin Co Ltd
Current assignee: Feihu Information Technology Tianjin Co Ltd
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2022-04-29
Anticipated expiration: 2042-03-30
Also published as: CN114419322B

Abstract

The invention provides an image instance segmentation method, an image instance segmentation device, electronic equipment and a storage medium, wherein an image to be segmented is input into a target SOLOV2 model, and the image to be segmented is segmented through a target SOLOV2 model to obtain a final instance segmentation graph; the target SOLOV2 model comprises a ResNext101 network, an FPN network, a prediction network and an ARM module, and the process of performing image segmentation on the image to be segmented through the target SOLOV2 model comprises the following steps: performing feature extraction on an image to be segmented through a ResNext101 network through an FPN network to obtain target shallow layer features and target deep layer features, and performing fusion processing on the target shallow layer features and the target deep layer features to obtain target high-resolution mask features; and carrying out boundary information enhancement processing on the initial example segmentation graph obtained by carrying out example segmentation processing on the target high-resolution mask features through an ARM module by utilizing the high-resolution mask features and a prediction network to obtain a final example segmentation graph.

Description

Image instance segmentation method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of image segmentation technologies, and in particular, to an image instance segmentation method and apparatus, an electronic device, and a storage medium.

Background

The barrage is that when watching the video, a large amount of comments that present with the subtitle form, and present the comment for spectator with the video together, but when the barrage content was too much, can cause serious shelter from the video image, influence the video and watch experience, if select directly to close the barrage, watch experience not good, and set for the display of barrage top, when the barrage quantity was too much, can lead to the barrage not relevant with the live content for a short time. Therefore, most video websites adopt an example-based segmentation mode to perform intelligent character anti-blocking processing on the bullet screen.

The existing image instance segmentation method is generally divided into a two-stage method and a one-stage method, and the two-stage method generally follows a classic 'detection-first segmentation' strategy. The method has the advantages of high positioning precision, high prediction delay and no real-time effect, and the example segmentation result is influenced by the object detection frame.

The method comprises the steps that an example is divided into two parallel subtasks by a one-stage method, a single-stage network structure is adopted, the network calculation amount is reduced as much as possible, a representative SOLO series method is continuously optimized, the SOTA level in the industry is achieved in the aspects of precision and cost performance of prediction speed, the core idea of the SOLO series is to convert the division problem into a position classification problem, and therefore an anchor (anchor frame) and a bounding box are not needed, and a pixel point of each example is endowed with a category according to the position and the size of the example, so that the effect of dividing an example object is achieved. Specifically, if the center of an object falls within a certain grid, the grid is responsible for predicting the semantic category of the object and assigning a position category to each pixel point, but it treats each pixel in the propofol equally, ignoring the target shape and boundary information.

In summary, in both of the existing example segmentation methods, there is inaccurate segmentation of the example boundary, and the inaccurate segmentation of the example boundary may cause shaking or flickering of a character during video playing, which greatly affects the visual experience of viewers.

Disclosure of Invention

In view of this, the present invention provides an image instance segmentation method, an image instance segmentation apparatus, an electronic device, and a storage medium, so as to solve the problem in the prior art that inaccurate segmentation of an instance boundary causes shaking or flickering of a character during video playing, which greatly affects the visual experience of viewers.

The invention discloses a method for segmenting an image instance in a first aspect, which comprises the following steps:

acquiring an image to be segmented;

inputting the image to be segmented into a target SOLOV2 model, and performing image segmentation on the image to be segmented through the target SOLOV2 model to obtain a final example segmentation map; the target SOLOV2 model is obtained by training a SOLOV2 model to be trained by utilizing an example segmentation data set; the target SOLOV2 model comprises a ResNext101 network, an FPN network, a prediction network and an ARM module, and the process of performing image segmentation on the image to be segmented through the target SOLOV2 model comprises the following steps:

performing feature extraction on the image to be segmented through the ResNext101 network to obtain a target shallow feature and a target deep feature;

fusing the target shallow layer feature and the target deep layer feature through the FPN network to obtain a target high-resolution mask feature;

carrying out instance segmentation processing on the target high-resolution mask features through the prediction network to obtain an initial instance segmentation graph, and inputting the target high-resolution mask features and the initial instance segmentation graph into the ARM module;

and carrying out boundary information enhancement processing by using the high-resolution mask features and the initial example segmentation graph through the ARM module to obtain a final example segmentation graph.

Optionally, the SOLOV2 model to be trained includes a ResNext101 network to be trained, an FPN network to be trained, and a prediction network to be trained, and the SOLOV2 model to be trained is trained by using the example segmentation data set to obtain the target SOLOV2 model, including:

obtaining an instance segmentation data set, wherein the instance segmentation data set comprises a plurality of instance segmentation data;

for each instance segmentation data, inputting the instance segmentation data into a SOLOV2 model to be trained, so that the SOLOV2 model to be trained performs instance segmentation on the instance segmentation data to obtain a first training instance segmentation graph, constructing a first loss function by using the first training instance segmentation graph and a corresponding target instance segmentation graph, and adjusting parameters of the ResNext101 network to be trained, the FPN network to be trained and the prediction network to be trained by using the first loss function until the SOLOV2 model to be trained converges to obtain an initial SOLOV2 model;

constructing an SOLOV2 model by utilizing the initial SOLOV2 model and an ARM module;

for each of said instance segmentation data, inputting said instance segmentation data into said SOLOV2 model;

performing feature extraction on the instance segmentation data through a ResNext101 network in the SOLOV2 model to obtain shallow features and deep features;

fusing the shallow feature and the deep feature through the FPN network to obtain a high-resolution mask feature;

carrying out example segmentation processing on the high-resolution mask features through a prediction network in the SOLOV2 model to obtain a second training example segmentation graph, and inputting the high-resolution mask features and the second training example segmentation graph into the ARM module in the SOLOV2 model;

enhancing the second training example segmentation graph by utilizing the high-resolution mask feature through the ARM module in the SOLOV2 model to obtain a third training example segmentation graph;

and constructing a second loss function by using the third training example segmentation graph and the corresponding target example segmentation graph, and adjusting parameters of a prediction network and an ARM module in the SOLOV2 model by using the second loss function until the SOLOV2 model converges to obtain a target SOLOV2 model.

Optionally, the prediction network includes a category branch and a mask branch, and the performing, by the prediction network, example segmentation processing on the target high resolution mask feature to obtain an initial example segmentation map includes:

performing class prediction on the high-resolution mask graph through the class branches to obtain at least one target class characteristic graph;

and carrying out segmentation processing on each target class characteristic graph through the mask branch to obtain an initial example segmentation graph.

Optionally, the obtaining, by the ARM module, a final instance segmentation map by performing boundary information enhancement processing by using the high resolution mask feature and the initial instance segmentation map includes:

and predicting the target high-resolution mask features by adopting a preset algorithm through the ARM module to obtain target example edge features, and performing boundary information enhancement processing on the initial example segmentation graph by utilizing the target example edge features to obtain a final example segmentation graph.

The second aspect of the present invention discloses an image instance segmentation apparatus, comprising:

the image to be segmented acquiring unit is used for acquiring an image to be segmented;

the target SOLOV2 model is used for carrying out image segmentation on the input image to be segmented to obtain a final example segmentation graph; the target SOLOV2 model is obtained by training a SOLOV2 model to be trained by utilizing an example segmentation data set based on a pre-training unit; the target SOLOV2 model comprises a ResNext101 network, an FPN network, a prediction network and an ARM module;

the ResNext101 network is used for extracting the features of the image to be segmented to obtain a target shallow feature and a target deep feature;

the FPN network is used for fusing the target shallow layer feature and the target deep layer feature to obtain a target high-resolution mask feature;

the prediction network is used for carrying out example segmentation processing on the target high-resolution mask features to obtain an initial example segmentation graph; inputting the target high-resolution mask features and the initial instance segmentation graph into the ARM module;

and the ARM module is used for performing boundary information enhancement processing by using the high-resolution mask features and the initial example segmentation graph to obtain a final example segmentation graph.

Optionally, the SOLOV2 model to be trained includes a ResNext101 network to be trained, an FPN network to be trained, and a prediction network to be trained, and the training unit includes:

an instance division data acquisition unit configured to acquire an instance division data set, wherein the instance division data set includes a plurality of instance division data;

a first training subunit, configured to, for each instance segmentation data, input the instance segmentation data into a to-be-trained SOLOV2 model, so that the to-be-trained SOLOV2 model performs instance segmentation on the instance segmentation data to obtain a first training instance segmentation map, construct a first loss function by using the first training instance segmentation map and a corresponding target instance segmentation map, and adjust parameters of the to-be-trained ResNext101 xt network, the to-be-trained FPN network, and the to-be-trained prediction network by using the first loss function until the to-be-trained SOLOV2 model converges to obtain an initial SOLOV2 model;

the SOLOV2 model construction unit is used for constructing a SOLOV2 model by utilizing the initial SOLOV2 model and an ARM module;

an input unit for inputting the instance division data into the SOLOV2 model for each of the instance division data;

the characteristic extraction unit is used for carrying out characteristic extraction on the instance segmentation data through the ResNext101 network in the SOLOV2 model to obtain a shallow characteristic and a deep characteristic;

the fusion processing unit is used for performing fusion processing on the shallow feature and the deep feature through the FPN network to obtain a high-resolution mask feature;

the first example segmentation unit is used for carrying out example segmentation processing on the high-resolution mask features through the prediction network in the SOLOV2 model to obtain a second training example segmentation graph;

the image enhancement processing unit is used for inputting the high-resolution mask feature and the second training example segmentation graph into the ARM module in the SOLOV2 model, and enhancing the second training example segmentation graph by using the high-resolution mask feature through the ARM module in the SOLOV2 model to obtain a third training example segmentation graph;

and the second training subunit is used for constructing a second loss function by using the third training example segmentation graph and the corresponding target example segmentation graph, and adjusting the parameters of the prediction network and the ARM module in the SOLOV2 model by using the second loss function until the SOLOV2 model converges to obtain a target SOLOV2 model.

Optionally, the prediction network includes a category branch and a mask branch, and the prediction network is configured to perform category prediction on the target high resolution mask feature, and perform segmentation processing on each obtained category feature map to obtain an initial example segmentation map, and is specifically configured to:

performing class prediction on the target high-resolution mask image through the class branches to obtain at least one target class characteristic image; and carrying out segmentation processing on each target class characteristic graph through the mask branch to obtain an initial example segmentation graph.

Optionally, the ARM module that performs boundary information enhancement processing by using the high resolution mask feature and the initial example segmentation graph to obtain a final example segmentation graph is specifically configured to:

In a third aspect, the present invention discloses an electronic device comprising a processor and a memory, wherein the memory is used for storing program codes and data of image instance segmentation, and the processor is used for calling program instructions in the memory to execute an image instance segmentation method as disclosed in the first aspect of the present invention.

A fourth aspect of the present invention discloses a storage medium, which includes a storage program, wherein when the program runs, a device on which the storage medium is located is controlled to execute the image instance segmentation method disclosed in the first aspect of the present invention.

The invention provides an image instance segmentation method, an image instance segmentation device, electronic equipment and a storage medium, wherein a to-be-trained SOLOV2 model is trained by utilizing a public instance segmentation data set in advance to obtain a target SOLOV2 model, wherein the target SOLOV2 model comprises a ResNext101 network, an FPN network, a prediction network and an ARM module. After the image to be segmented is acquired, the acquired image to be segmented can be input into a target SOLOV2 model, so that feature extraction is performed on the image to be segmented through a ResNext101 network in a target SOLOV2 model, and a target shallow feature and a target deep feature are obtained; fusing the target shallow layer features and the target deep layer features through an FPN network to obtain target high-resolution mask features; and finally, the target high-resolution mask features and the initial example segmentation graph are input into an ARM module together, and boundary information enhancement processing is carried out by utilizing the high-resolution mask features and the initial example segmentation graph through the ARM module to obtain a final example segmentation graph. According to the invention, the ARM module is integrated into the target SOLOV2 model, and the ARM module can be used for further enhancing the boundary information of the obtained initial example segmentation graph, so that the boundary of the obtained example segmentation graph is more accurate, and the problem that the visual experience of audiences is greatly influenced because the shaking or the flickering of characters is caused during video playing due to inaccurate segmentation of the example boundary in the prior art is solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of an image example segmentation method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a process for obtaining a target SOLOV2 model by training a SOLOV2 model to be trained using an example segmented data set according to an embodiment of the present invention;

FIG. 3 is a flowchart of a process of image segmentation of an image to be segmented by a target SOLOV2 model according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an image segmentation apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules, or units, and are not used for limiting the order or interdependence of the functions performed by the devices, modules, or units.

It is noted that references to "a", "an", and "the" modifications in the disclosure are exemplary rather than limiting, and that those skilled in the art will understand that "one or more" unless the context clearly dictates otherwise.

Example segmentation: the object instance and its per-pixel segmentation mask need to be predicted. By way of a superficial statement: semantic segmentation does not distinguish between different instances belonging to the same class. For example, when there are multiple cats in the image, the semantic segmentation will predict all pixels of the two cats as a whole as the category "cat".

ResNext101 network: the feature extraction network employed in the SOLOV2 model. In a neural network, especially in the CV field, features of an image are generally extracted first, and image features are aggregated and formed on different image fine granularities, and this part is the root of the whole CV task, because subsequent downstream tasks are based on the extracted image features (such as example segmentation, target detection and the like), this part of the network structure is also called a backhaul.

FPN network: a series of network layers that blend and combine image features and pass the image features to subsequent prediction layers is a key link from the SOLOV2 model. The method carries out reprocessing and reasonable utilization on the important characteristics extracted from the Backbone.

And (3) rolling layers: since the connections between adjacent pixels in an image are relatively close, while the connections between pixels at greater distances are relatively weak. Therefore, each neuron only needs to sense local information, and then global information can be obtained by integrating the local information at a higher layer. The convolution operation is an implementation of local receptive fields, and the convolution operation can be weight-shared to greatly reduce the number of parameters and is widely used.

A pooling layer: the pooling layer is also called as down-sampling, and is usually added between convolution layers to reduce the dimension of the features, remove redundant information, compress the features, reduce the input size of the next layer, further reduce the calculated amount and parameter amount, and accelerate the calculation speed. Meanwhile, global features are concerned more in the dimension reduction process, and important information is reserved, so that overfitting can be relieved to a certain extent.

An active layer: the activation layer is important to enhance the learning ability of the neural network. The method is very helpful for improving the robustness of the model, enhancing the nonlinear expression capability, relieving the disappearance of the gradient, accelerating the convergence of the model and the like.

Referring to fig. 1, a schematic flow chart of an image instance segmentation method provided by an embodiment of the present invention is shown, where the image instance segmentation method specifically includes the following steps:

s101: and acquiring an image to be segmented.

S102: and inputting the image to be segmented into a target SOLOV2 model, and performing image segmentation on the image to be segmented through a target SOLOV2 model to obtain a final example segmentation graph.

The target SOLOV2 model is obtained by training a SOLOV2 model to be trained by utilizing an example segmentation data set; the target SOLOV2 model includes a ResNext101 network, a FPN network, a prediction network, and an ARM module.

In the specific execution process of step 102, after the image to be segmented is obtained, the image to be segmented may be input, and the example segmentation data set is used in advance to train the SOLOV2 model to be trained to obtain a target SOLOV2 model, so that the image example segmentation is performed on the image to be segmented through a ResNext101 network, an FPN network, a prediction network and an ARM module in the target SOLOV2 model to obtain an example segmentation map.

In the embodiment of the present application, the SOLOV2 model to be trained includes a ResNext101 network to be trained, an FPN network to be trained, and a prediction network to be trained. The ResNext101 network to be trained consists of a series of convolutional layers, pooling layers and activation layers; the prediction network to be trained comprises a class branch and a mask branch.

In the embodiment of the present application, a process of training a SOLOV2 model to be trained by using an example segmentation data set to obtain a target SOLOV2 model, as shown in fig. 2, specifically includes the following steps:

s201: an instance segmentation data set is obtained, wherein the instance segmentation data set comprises a plurality of instance segmentation data.

In an embodiment of the present application, the instance segmentation dataset may be an MSCOCO instance segmentation dataset. The method can be selected according to practical application, and the embodiment of the application is not limited.

S202: for each instance segmentation data, the instance segmentation data is input into a SOLOV2 model to be trained, so that the SOLOV2 model to be trained performs instance segmentation on the instance segmentation data to obtain a first training instance segmentation graph, a first loss function is constructed by using the first training instance segmentation graph and a corresponding target instance segmentation graph, and parameters of a ResNext101 network to be trained, an FPN network to be trained and a prediction network to be trained are adjusted by using the first loss function until the SOLOV2 model to be trained converges to obtain an initial SOLOV2 model.

In the specific execution process of step S202, after the instance segmentation data set is acquired, for each instance segmentation data in the instance segmentation data set, the instance segmentation data may be input into the SOLOV2 model to be trained.

Extracting shallow features and deep features of example segmentation data through a ResNext101 network to be trained, and performing feature fusion on the extracted shallow features and deep features by using an FPN network to be trained to obtain uniform high-resolution mask features; carrying out instance segmentation processing on the obtained unified high-resolution mask features by using a prediction network to be trained to obtain a first training instance segmentation graph; and finally, constructing a first loss function according to the first training example segmentation graph and the target example segmentation graph corresponding to the example segmentation data.

And adjusting parameters of the ResNext101 network to be trained, the FPN network to be trained and the prediction network to be trained by using a first loss function until the SOLOV2 model to be trained converges to obtain an initial SOLOV2 model. The initial SOLOV2 model includes, among other things, a ResNext101 network, an FPN network, and a prediction network.

In embodiments of the present application, shallow features of instance segmentation data may be extracted by convolutional layers (e.g., first or second convolutional layers) near the input end of the ResNext101 network to be trained, and deep features of instance segmentation data may be extracted by convolutional layers (e.g., first to last or second to last convolutional layers) near the output end of the ResNext101 network to be trained.

The feature map of the shallow feature has a large size and includes information such as a next color, texture, edge, and the like of the image; the receptive field of the shallow layer features is small, the overlapping area of the receptive field is also small, and more details of the image captured by a subsequent network can be guaranteed. The feature map of the deep features is small in size and contains more abstract information in the image, such as semantic information, image integrity information and the like; the receptive field of the deep features is large, the overlapping area in front of the receptive field is increased, image information is compressed, and image overall information can be well acquired.

Optionally, in the embodiment of the present application, a class prediction is performed on a high-resolution mask map obtained through an FPN network to be trained through a class branch, so as to obtain at least one class feature map; and then, carrying out segmentation processing on each class feature map through a mask branch to obtain a first training example segmentation map. The mask branch comprises a mask kernel branch and a mask characteristic branch.

Specifically, for each class feature map, inputting the class feature map into four continuous 3x3 convolution layers in a mask kernel branch for further feature extraction, and finally generating a final dynamic convolution kernel through convolution with the size of 3x3 xD; the class characteristic diagram sequentially passes through 3 convolution layers of 3x3 in the mask characteristic branch, the group norm and the ReLU activation to obtain the final mask characteristic; and finally, performing point multiplication on the dynamic convolution kernel and the mask features to obtain a first training example segmentation graph finally comprising the foreground and the background.

S203: the initial SOLOV2 model and ARM modules were used to construct the SOLOV2 model.

In the specific process of executing step S203, after the SOLOV2 model to be trained is trained by using the instance segmentation data to obtain an initial SOLOV2 model, the initial SOLOV2 model and the ARM module constructed in advance may be used to construct the SOLOV2 model.

S204: for each instance segmentation data, the instance segmentation data is input into the SOLOV2 model.

In the specific execution of step S204, after the SOLOV2 model is constructed, for each instance of divided data, the instance of divided data is input into the above constructed SOLOV2 model.

S205: carrying out feature extraction on example segmentation data through a ResNext101 network in a SOLOV2 model, and training shallow features and deep features; and the shallow feature and the deep feature are fused through an FPN network to obtain the high-resolution mask feature.

In the specific implementation of step S205, after the instance division data is input into the above-constructed SOLOV2 model, shallow features of the instance division data may be extracted through the convolutional layer (e.g., the first or second convolutional layer) near the input end of the ResNext101 network in the SOLOV2 model, and deep features of the instance division data may be extracted through the convolutional layer (e.g., the first to last or second to last convolutional layer) near the output end of the ResNext101 network.

And sequentially performing a series of 3x3 convolution, ReLU activation and bilinear up-sampling interpolation processing on the extracted shallow features and deep features by using an FPN network to obtain uniform high-resolution mask features.

S206: and carrying out example segmentation processing on the high-resolution mask features through a prediction network in the SOLOV2 model to obtain a second training example segmentation graph.

In the specific process of step S206, after obtaining the high resolution mask map of the example segmentation data, the class branch in the prediction network in the SOLOV2 model may be used to perform class prediction on the high resolution mask map to obtain at least one class feature map; for each class feature map, inputting the class feature map into four continuous 3x3 convolution layers in a mask kernel branch for further feature extraction, and finally generating a final dynamic convolution kernel through convolution with the size of 3x3 xD; the class characteristic diagram sequentially passes through 3 convolution layers of 3x3 in the mask characteristic branch, the group norm and the ReLU activation to obtain the final mask characteristic; and finally, performing point multiplication on the dynamic convolution kernel and the mask features to obtain a second training example segmentation graph finally comprising the foreground and the background. The mask branch comprises a mask kernel branch and a mask characteristic branch.

S207: and inputting the high-resolution mask feature and the second training example segmentation graph into an ARM module in the SOLOV2 model, and performing enhancement processing on the second training example segmentation graph by using the high-resolution mask feature through the ARM module in the SOLOV2 model to obtain a third training example segmentation graph.

In the specific process of step S207, after obtaining the second training example segmentation graph, the high resolution mask feature and the second training example segmentation graph may be input to an ARM module in the SOLOV2 model; and predicting the high-resolution mask features by adopting a preset algorithm through the ARM module to obtain example edge features, and performing boundary information enhancement processing on the initial example segmentation graph by utilizing the example edge features to obtain a third training example segmentation graph.

Specifically, the obtained high-resolution mask features are input into an ARM module in the SOLOV2 model, so that the ARM module in the SOLOV2 model predicts the high-resolution mask features by using a preset algorithm to obtain example edge features. Firstly, the example edge characteristics and the second training example segmentation graph are convolved through 1 x 1 in the ARM module to fuse a plurality of inputs in the example edge characteristics and the second training example segmentation graph and reduce corresponding channel numbers, and the obtained results are processed through 3 parallel 3x3 convolutions with different void rates to generate 3 characteristic spaces under different receptive fields, wherein the characteristic spaces are respectively 3 characteristic spaces under different receptive fieldsE、F、G. WhereinE，F，G｝∈R ^C×H×WThe Reshape isE，F，G｝∈R ^C×NWherein, in the step (A),N=H×W。

secondly, the feature space can be divided intoFPost-transposition and feature spaceEPerforming matrix product operation, and calculating the obtained operation result by applying Softmax function to obtain an attention mapS，S∈R ^N*N. Space of featuresGMake a change and an attention mapSMultiplying, keeping Reshape as original shape, redistributing boundary information of edge feature of related example to the second training example segmentation graph to obtain featureV，V∈R ^C×H×W. Wherein, the Softmax function is applied to calculate the obtained operation result to obtain the attention diagramSSee formula (1); space of featuresGMake a change and an attention mapSMaking a desired multiplication to obtain a featureVSee equation (2).

Finally, the characteristics are measuredVAnd the second training example segmentation graph is added pixel by pixel, so that the expression capability of the boundary information in the second training example segmentation graph is enhanced, and a third training example segmentation result is obtained. It should be noted that the drawings are referred toSIf the weight is larger, the mask boundary features are more similar and are displayed as a highlight foreground area, and if the weight is smaller, the boundary features are different and represent background information.

（1）

Wherein the content of the first and second substances,S _ijis shown asiIs in a position pairjInfluence of individual positions, characteristics of two spatial positions (characteristic space)EAnd a feature spaceF) The more similar, the correlation between themS _ijThe higher.

（2）

Wherein the content of the first and second substances,irepresenting the response of the current spatial location, attention-seekingS _ijAnd the characteristic spaceG _iMultiplication can redistribute the relevant boundary information to the second training example segmentation graph, and finally the information is added with the second training example segmentation graph to obtain the final output, and the final output is combined with the correlation result of the whole graph, so that the expression capability of the boundary information in the original characteristic graph can be enhanced.

It should be noted that the preset algorithm may be an edge detection algorithm canny. Such an algorithm may focus as much as possible on the boundary information of the image to identify the actual edge features.

S208: and constructing a second loss function by using the third training example segmentation graph and the corresponding target example segmentation graph, and adjusting parameters of a prediction network and an ARM module in the SOLOV2 model by using the second loss function until the SOLOV2 model converges to obtain a target SOLOV2 model.

In the specific process of executing step S208, after the third training example segmentation map is obtained, a second loss function may be further constructed by using the obtained third training example segmentation map and the corresponding target example segmentation map, and parameters of the prediction network and the ARM module in the SOLOV2 model are adjusted by using the second loss function until the SOLOV2 model converges, so as to obtain the target SOLOV2 model.

In the embodiment of the application, firstly, an edge detection algorithm canny is additionally adopted to predict the rougher example edge characteristics on the uniform high-resolution mask characteristics output by the FPN through an ARM module so as to provide relatively accurate position information; secondly, fusing the obtained example edge example characteristics with the first example segmentation result, and sending the fused example edge example characteristics into an attention network in an ARM module to capture context information in a self-adaptive manner to obtain a characteristic diagram enhanced by boundary information, and finally obtaining a final (third training example segmentation diagram) based on the enhanced characteristic diagram; after an SOLOV2 model to be trained is trained by using an instance segmentation data set to obtain an SOLOV2 model, in order to further improve the accuracy of an instance edge segmentation result of the SOLOV2 model and shorten the model training time, a gradient separation strategy is adopted, namely, a ResNext101 network and an FPN network in the SOLOV2 model are frozen in the fine adjustment process, namely, only parameters in a prediction network and an ARM module are adjusted, so that the problems of instance truncation and the like caused by the inaccuracy of feature extraction can be effectively reduced.

In the embodiment of the present application, the process of performing image segmentation on an image to be segmented by using a target SOLOV2 model, as shown in fig. 3, specifically includes the following steps:

s301: and performing feature extraction on the image to be segmented through a ResNext101 network to obtain a target shallow feature and a target deep feature.

In the specific implementation of step S301, the target shallow feature of the image to be segmented may be extracted by a convolution layer (e.g., the first or second convolution layer) close to the network input end of ResNext101 in the target SOLOV2 model, and the target deep feature of the image to be segmented may be extracted by a convolution layer (e.g., the last or second convolution layer) close to the network output end of ResNext 101.

S302: and fusing the shallow feature and the deep feature through an FPN (field programmable gate array) network to obtain the target high-resolution mask feature.

In the specific process of executing step S303, a series of 3x3 convolutions, ReLU activations, and bilinear upsampling interpolation processes are sequentially performed on the extracted target shallow layer features and target deep layer features through the FPN network in the target SOLOV2 model, so as to obtain target high resolution mask features.

S303: and carrying out example segmentation processing on the target high-resolution mask features through a prediction network to obtain an initial example segmentation graph, and inputting the high-resolution mask features and the initial example segmentation graph into an ARM module.

In the process of specifically executing step S303, after obtaining the target high-resolution mask map of the image to be segmented, performing class prediction on the target high-resolution mask map by using class branches in a prediction network in the target SOLOV2 model, so as to obtain at least one target class feature map; for each target class feature map, inputting the target class feature map into four continuous 3x3 convolution layers in a mask kernel branch for further feature extraction, and finally generating a target dynamic convolution kernel through convolution with the size of 3x3 xD; the class characteristic diagram sequentially passes through 3 convolution layers of 3x3 in the mask characteristic branch, the group norm and the ReLU activation to obtain a target mask characteristic; and finally, performing point multiplication on the target dynamic convolution kernel and the target mask characteristics to obtain an initial example segmentation graph comprising the foreground and the background. The mask branch comprises a mask kernel branch and a mask characteristic branch.

S304: and performing boundary information enhancement processing by using the high-resolution mask features and the initial example segmentation graph through an ARM module to obtain a final example segmentation graph.

In the specific process of executing step S304, after the initial instance segmentation graph is obtained, the target high resolution mask feature and the initial instance segmentation graph may be input to an ARM module in the target SOLOV2 model; and predicting the target high-resolution mask features by adopting a preset algorithm through the ARM module to obtain target example edge features, and performing boundary information enhancement processing on the initial example segmentation graph by utilizing the target example edge features to obtain an example segmentation graph.

The specific process of obtaining the example segmentation graph by using the high resolution mask feature through the ARM module is the same as the process of obtaining the third training example segmentation graph in step S207, and reference may be made to corresponding contents in step S207, which is not described herein again.

The invention provides an image instance segmentation method, which is characterized in that an instance segmentation data set is utilized in advance to train an SOLOV2 model to be trained to obtain a target SOLOV2 model, wherein the target SOLOV2 model comprises a ResNext101 network, an FPN network, a prediction network and an ARM module. After the image to be segmented is acquired, the acquired image to be segmented can be input into a target SOLOV2 model, so that feature extraction is performed on the image to be segmented through a ResNext101 network in a target SOLOV2 model, and a target shallow feature and a target deep feature are obtained; fusing the target shallow layer features and the target deep layer features through an FPN network to obtain target high-resolution mask features; and finally, the target high-resolution mask features and the initial example segmentation graph are input into an ARM module together, and boundary information enhancement processing is carried out by utilizing the high-resolution mask features and the initial example segmentation graph through the ARM module to obtain a final example segmentation graph. According to the invention, the ARM module is integrated into the target SOLOV2 model, and the ARM module can be used for further enhancing the boundary information of the obtained initial example segmentation graph, so that the boundary of the obtained example segmentation graph is more accurate, and the problem that the visual experience of audiences is greatly influenced because the shaking or the flickering of characters is caused during video playing due to inaccurate segmentation of the example boundary in the prior art is solved.

In correspondence to the image instance segmentation method disclosed in the embodiment of the present invention, referring to fig. 4, an embodiment of the present invention further provides a schematic structural diagram of an image instance segmentation apparatus, where the image instance segmentation apparatus includes:

an image to be segmented acquisition unit 41 for acquiring an image to be segmented;

the target SOLOV2 model 42 is used for carrying out image segmentation on an input image to be segmented to obtain an example segmentation graph; the target SOLOV2 model is obtained by training a SOLOV2 model to be trained by using an example segmentation data set based on a pre-training unit; the target SOLOV2 model comprises a ResNext101 network, an FPN network, a prediction network and an ARM module;

the FPN network is used for fusing the target shallow layer features and the target deep layer features to obtain target high-resolution mask features;

the prediction network is used for carrying out example segmentation processing on the target high-resolution mask features to obtain an initial example segmentation graph; inputting the target high-resolution mask features and the initial example segmentation graph into an ARM module;

The invention provides an image example segmentation device, which is used for training an SOLOV2 model to be trained by utilizing an example segmentation data set in advance to obtain a target SOLOV2 model, wherein the target SOLOV2 model comprises a ResNext101 network, an FPN network, a prediction network and an ARM module. After the image to be segmented is acquired, the acquired image to be segmented can be input into a target SOLOV2 model, so that feature extraction is performed on the image to be segmented through a ResNext101 network in a target SOLOV2 model, and a target shallow feature and a target deep feature are obtained; fusing the target shallow layer features and the target deep layer features through an FPN network to obtain target high-resolution mask features; and finally, the target high-resolution mask features and the initial example segmentation graph are input into an ARM module together, and boundary enhancement processing is carried out by utilizing the high-resolution mask features and the initial example segmentation graph through the ARM module to obtain a final example segmentation graph. According to the invention, the ARM module is integrated into the target SOLOV2 model, and the ARM module can be used for further enhancing the obtained initial example segmentation graph, so that the boundary of the obtained example segmentation graph is more accurate, and the problem that in the prior art, the visual experience of audiences is greatly influenced because the shaking or the flickering of characters is caused during video playing due to inaccurate segmentation of the example boundary is solved.

the first training subunit is used for inputting the example segmentation data into an SOLOV2 model to be trained aiming at each example segmentation data, so that the SOLOV2 model to be trained performs example segmentation on the example segmentation data to obtain a first training example segmentation graph, the first training example segmentation graph and a corresponding target example segmentation graph are used for constructing a first loss function, and parameters of a ResNext101 network to be trained, an FPN network to be trained and a prediction network to be trained are adjusted by using the first loss function until the SOLOV2 model to be trained converges to obtain an initial SOLOV2 model;

the SOLOV2 model building unit is used for building an SOLOV2 model by utilizing an initial SOLOV2 model and an ARM module;

an input unit for inputting the instance division data into the SOLOV2 model for each instance division data;

the characteristic extraction unit is used for carrying out characteristic extraction on the instance segmentation data through a ResNext101 network in the SOLOV2 model to obtain a shallow characteristic and a deep characteristic;

the first example segmentation unit is used for carrying out example segmentation processing on the high-resolution mask features through a prediction network in the SOLOV2 model to obtain a second training example segmentation graph;

the image enhancement processing unit is used for inputting the high-resolution mask feature and the second training example segmentation graph into an ARM module in the SOLOV2 model, and enhancing the second training example segmentation graph by utilizing the high-resolution mask feature through the ARM module in the SOLOV2 model to obtain a third training example segmentation graph;

and the second training subunit is used for constructing a second loss function by using the third training example segmentation graph and the corresponding target example segmentation graph, and adjusting parameters of a prediction network and an ARM module in the SOLOV2 model by using the second loss function until the SOLOV2 model converges to obtain a target SOLOV2 model.

Optionally, the prediction network includes a category branch and a mask branch, and is configured to perform category prediction on the target high-resolution mask feature, and perform segmentation processing on each obtained category feature map to obtain the prediction network of the initial example segmentation map, and specifically configured to:

carrying out category prediction on the target high-resolution mask image through category branches to obtain at least one target category characteristic image; and (4) carrying out segmentation processing on each target class characteristic graph through a mask branch to obtain an initial example segmentation graph.

Optionally, the initial example segmentation graph is enhanced by using the high resolution mask feature to obtain an ARM module of the example segmentation graph, and the ARM module is specifically configured to:

and predicting the target high-resolution mask features by adopting a preset algorithm through an ARM module to obtain target example edge features, and performing boundary information enhancement processing on the initial example segmentation graph by utilizing the target example edge features to obtain a final example segmentation graph.

An embodiment of the present application further provides an electronic device, which includes: the system comprises a processor and a memory, wherein the processor and the memory are connected through a communication bus; the processor is used for calling and executing the program stored in the memory; the memory is used for storing a program used for realizing the image instance segmentation method.

Referring now to FIG. 5, a block diagram of an electronic device suitable for use in implementing the disclosed embodiments of the invention is shown. The electronic devices in the disclosed embodiments of the present invention may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the disclosed embodiments of the present invention.

As shown in fig. 5, the electronic device may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the electronic apparatus are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the electronic device to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates an electronic device having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program containing program code for performing the image segmentation method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program performs the above-described functions defined in the image segmentation method of the disclosed embodiment of the invention when executed by the processing device 501.

Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and the computer-executable instructions are used for executing the image segmentation method.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring an image to be segmented; inputting the image to be segmented into a target SOLOV2 model, and performing image segmentation on the image to be segmented through the target SOLOV2 model to obtain an example segmentation graph; the target SOLOV2 model is obtained by training a SOLOV2 model to be trained by utilizing an example segmentation data set; the target SOLOV2 model comprises a ResNext101 network, an FPN network, a prediction network and an ARM module, and the process of performing image segmentation on the image to be segmented through the target SOLOV2 model comprises the following steps: performing feature extraction on the image to be segmented through the ResNext101 network to obtain a target shallow feature and a target deep feature; fusing the target shallow layer feature and the target deep layer feature through the FPN network to obtain a target high-resolution mask feature; carrying out instance segmentation processing on the target high-resolution mask features through the prediction network to obtain an initial instance segmentation graph, and inputting the target high-resolution mask features and the initial instance segmentation graph into the ARM module; and carrying out boundary information enhancement processing by using the high-resolution mask features and the initial example segmentation graph through the ARM module to obtain a final example segmentation graph.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

It should be noted that the computer readable medium mentioned above in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are merely illustrative, wherein units described as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims

1. A method for image instance segmentation, the method comprising:

acquiring an image to be segmented;

2. The method of claim 1, wherein the SOLOV2 model to be trained comprises a ResNext101 network to be trained, a FPN network to be trained, and a prediction network to be trained, and wherein training the SOLOV2 model to be trained using the instance segmentation dataset to obtain a target SOLOV2 model comprises:

3. The method of claim 2, wherein the prediction network comprises a category branch and a mask branch, and wherein performing, by the prediction network, an instance segmentation on the target high resolution mask feature to obtain an initial instance segmentation map comprises:

4. The method of claim 1, wherein performing, by the ARM module, boundary information enhancement processing using the high resolution mask features and the initial instance partition map to obtain a final instance partition map comprises:

5. An image instance segmentation apparatus, characterized in that the apparatus comprises:

6. The apparatus of claim 5, wherein the SOLOV2 model to be trained comprises a ResNext101 network to be trained, an FPN network to be trained, and a prediction network to be trained, and wherein the training unit comprises:

7. The apparatus according to claim 6, wherein the prediction network comprises a category branch and a mask branch, and the prediction network is configured to perform category prediction on the target high resolution mask feature and perform segmentation processing on each obtained category feature map to obtain an initial instance segmentation map, and is specifically configured to:

8. The apparatus of claim 5, wherein the ARM module that performs boundary information enhancement processing using the high resolution mask feature and the initial instance partition map to obtain a final instance partition map is specifically configured to:

9. An electronic device, comprising a processor and a memory, the memory storing program code and data for image instance segmentation, the processor being configured to invoke program instructions in the memory to perform an image instance segmentation method as claimed in any one of claims 1 to 4.

10. A storage medium, characterized in that the storage medium comprises a stored program, wherein a device on which the storage medium is located is controlled to execute an image instance segmentation method according to any one of claims 1 to 4 when the program is run.