CN114359291A

CN114359291A - Method for training instance segmentation model and instance segmentation method

Info

Publication number: CN114359291A
Application number: CN202111500687.1A
Authority: CN
Inventors: 董斌; 汪天才
Original assignee: Shenzhen Kuangshi Jinzhi Technology Co ltd; Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Current assignee: Shenzhen Kuangshi Jinzhi Technology Co ltd; Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Priority date: 2021-12-09
Filing date: 2021-12-09
Publication date: 2022-04-15

Abstract

The embodiment of the application provides a method for training an example segmentation model and an example segmentation method, wherein the method for training the example segmentation model comprises the following steps: inputting the initial code and the feature map of the training image into a trained network model to obtain a one-dimensional predictive code; acquiring a one-dimensional truth value code; and obtaining a loss value of the current prediction result according to the one-dimensional prediction coding and the one-dimensional truth value coding, and terminating the training process of the network model at least by confirming that the loss value meets the set requirement to obtain the target instance segmentation model. By means of the method and the device, parallel prediction of the frame and the mask can be achieved during example segmentation of the image, and the mask with higher resolution can be characterized with small calculation cost as far as possible.

Description

Method for training instance segmentation model and instance segmentation method

Technical Field

The application relates to the field of image instance segmentation, in particular to a method for training an instance segmentation model and an instance segmentation method.

Background

As a classic visual task, object instance segmentation (or referred to as image instance segmentation) for an image or a video requires a neural network to simultaneously locate the positions of objects on the image and segment segmentation masks corresponding to the objects.

Example segmentation methods related to the related art can be divided into two major categories: the two-stage method (e.g., Mask R-CNN, HTC, etc.) extracts target features and performs foreground/background binary segmentation using a target region feature extraction ROI Align method based on a two-stage target detector according to a target frame, and the single-stage method (e.g., BlendMask, CondInst, etc.) generates a dynamic weight for each target (or referred to as each instance) using dynamic convolution based on a single-stage target detector, thereby generating a target Mask.

For the two-stage approach, which is limited to the target feature extraction method ROI Align (interpolation of different sized target features to fixed sized features (14x14)) and the scaling of the mask of the target to a fixed sized (e.g., 28x28) scale for supervision, small targets cannot be effectively supervised and large target edge information is lost, resulting in performance limitations. For the single-stage method, a global mask method is adopted, namely a global feature map is generated based on the whole image, and then a segmentation mask is generated based on the global feature map in a combined coding or dynamic convolution mode, but the global mask is friendly to small targets, but information is redundant, and the calculation cost of the global feature map is relatively large.

In addition, most of these two-stage or single-stage methods rely on target frame or candidate frame (pro-visual) information, and thus the segmentation result depends on the detection result.

Therefore, how to improve the accuracy of the result of the example segmentation for the image or the video becomes a technical problem to be solved urgently.

Disclosure of Invention

It is an object of embodiments of the present application to provide a method of training an instance segmentation model and an instance segmentation method, the method can realize the parallel prediction of the frame and the mask in the example segmentation of the image, the high-resolution mask is compressed into the low-dimensional vector by some embodiments of the application (namely, the higher-resolution mask is represented by some embodiments of the application with the lowest computation cost as possible), the computation amount is greatly reduced, the network model consisting of the backbone network and the language model is supervised network learning by some embodiments of the application by using the implicit high-resolution truth value, so that the trained network model is favorable for the image example segmentation of the target (for example, the image example segmentation of the small target), and the edge information of each target (or each instance) obtained by segmentation is retained to a great extent, so that the performance and the accuracy are improved.

In a first aspect, an embodiment of the present application provides a method for training an instance segmentation model, where the method includes: inputting an initial code and the training image into a trained network model to obtain a one-dimensional predictive code, wherein the one-dimensional predictive code comprises a target predictive code aiming at each target on the training image; acquiring one-dimensional truth value codes, wherein the one-dimensional truth value codes are obtained by discretely coding the label data of each target on the training image, the one-dimensional truth value codes comprise target truth value codes for each target, and the initial codes are one-dimensional vectors with the same dimension as the one-dimensional truth value codes; and obtaining a loss value of the current prediction result according to the one-dimensional prediction coding and the one-dimensional truth value coding, and terminating the training process of the network model when the loss value is confirmed to meet the set requirement to obtain a target instance segmentation model, wherein the target instance segmentation model comprises a target cyclic neural network and a target backbone network.

Some embodiments of the application perform supervised network learning on a language model of a related technology through a high-resolution truth value (namely, one-dimensional truth value coding) to obtain an image instance segmentation model, which is beneficial to segmentation of small targets and greatly retains edge information, thereby improving performance and accuracy.

In some embodiments, the target truth encoding comprises target box truth encoding, classification truth encoding, and split mask truth encoding; the target predictive coding includes target frame predictive coding, class predictive coding, and split mask predictive coding.

In some embodiments, the obtaining a one-dimensional truth code comprises: and respectively coding the target frame marking data, the classification marking data and the segmentation mask marking data of each target to obtain the one-dimensional true value code.

Some embodiments of the present application provide an example discrete factorization characterization method for annotation data (for example, coding a target box, a category, and a mask respectively), by which a high resolution mask can be compressed into a low dimensional vector, which greatly reduces the amount of computation.

In some embodiments, the annotation data of each target includes the following three types of data: target frame annotation data, classification annotation data and segmentation mask annotation data; wherein the obtaining of the one-dimensional true value code includes: and finishing the coding of various types of marking data at least according to the coding offset respectively set for different types of marking data to obtain the one-dimensional true value coding.

Some embodiments of the present application ensure that the distance between codes of the same category (for example, codes for a plurality of coordinates corresponding to a target frame belong to the same category and codes for a mask belong to the same category) is closer, and the distance between codes belonging to different categories (for example, codes for a target frame and codes for a classification belong to different categories) is farther by introducing a code offset, so that a network can better distinguish various types of codes, and decoding output is facilitated, so that network convergence is better.

In some embodiments, the labeling data of any one target includes target box labeling data, and the target box labeling data includes an upper left corner coordinate value and a lower right corner coordinate value for labeling the target box; the target box truth value coding is obtained by the following coding method: and rounding and quantizing the coordinate values of the upper left corner and the lower right corner to obtain a target frame true value code of any target.

Some embodiments of the present application directly perform coordinate quantization on the coordinates of the target frame labeling data, thereby increasing the speed of encoding the target frame.

In some embodiments, the annotation data for any target comprises classification annotation data; the classification truth value coding is obtained by the following coding method: and obtaining the classification truth value code of any target according to the classification marking data and the rounding quantization operation of any target.

Some embodiments of the present application perform rounding quantization on the classification labeling data to complete the encoding for classification, and quickly obtain a one-dimensional true value encoding.

In some embodiments, the partitioning mask truth value encoding is obtained by an encoding method including: summing the classified marking data and a first coding offset to obtain offset classified marking data; and carrying out the rounding quantization operation on the offset classification labeling data to obtain the classification truth value code.

Some embodiments of the present application provide an operation of introducing a coding offset in a process of coding classification annotation data, so as to achieve that the obtained classification true value coding is separated from other different classes of coding (for example, a classification true value coding value or a target box true value coding value) as much as possible, so that the convergence of a network model is better.

In some embodiments, the annotation data for any target comprises segmentation mask annotation data; the partition mask truth value coding is obtained by the following coding method: and at least carrying out sparse coding and quantization processing on the segmentation mask marking data to obtain N quantized values, and coding the N quantized values as segmentation mask true values of any target, wherein N is an integer greater than or equal to 1.

Some embodiments of the application perform sparse coding on the segmentation mask marking data and perform quantization processing on the coding coefficients, and through the representation form, the high-resolution mask can be compressed into a low-dimensional vector, so that the calculation amount is greatly reduced.

In some embodiments, the segmentation mask annotation data acquisition and the segmentation mask truth value encoding according to any one of the targets are obtained by the following encoding methods: performing frequency domain projection on the segmentation mask marking data of any target according to a discrete cosine transform matrix to obtain a plurality of frequency domain coefficients corresponding to the segmentation mask marking data of any target; selecting N frequency domain coefficients corresponding to the low frequency component from the plurality of frequency domain coefficients as an encoding result; and obtaining the N quantized values according to the coding result and the quantization operation, and coding the N quantized values as the true values of the partition mask of any target, wherein N is an integer greater than or equal to 1.

Some embodiments of the application perform sparse coding on the segmentation mask marking data through the discrete cosine transform matrix, quantize the low-frequency component of the coding coefficient, compress the data processing amount on the premise of ensuring that the target segmentation effect is not affected, and further improve the data processing speed.

In some embodiments, the size of the discrete cosine transform matrix is the same as the size of the partition mask true value code, and the value of each element in the discrete cosine transform matrix is calculated according to a cosine function.

Some embodiments of the present application provide a transform matrix for sparse coding of segmentation mask labeled data, and the discrete cosine transform matrix may complete projection of the segmentation mask labeled data to a frequency domain to further sort out low-frequency components, so that on one hand, data processing amount is compressed, and on the other hand, encoding operation of the segmentation mask labeled data may be completed.

In some embodiments, the N quantization values are obtained by: obtaining an initial quantization result at least according to the coding result and the quantization matrix; and rounding the initial quantization result to obtain the N quantization values.

Some embodiments of the present application complete quantization operations on coding results by introducing a quantization matrix, and obtain a true value code for segmentation mask annotation data.

In some embodiments, the plurality of frequency domain coefficients further includes a high frequency component; wherein the initial quantization result is obtained by the following method: and obtaining the initial quantization result according to the product of the coding result and an inverse matrix corresponding to the quantization matrix, wherein the values of a plurality of elements corresponding to the low-frequency component in the quantization matrix are larger than the values of the elements corresponding to the high-frequency component.

Some embodiments of this application can make the more important low frequency component after the quantization (namely to this target segmentation have the volume of important influence) the error a little bit through setting up the element value size of different positions on the quantization matrix, and cut apart the little error of high frequency component that influences a little to the target and slightly bigger, have promoted the effect of the model that the training obtained.

In some embodiments, the initial quantification result is obtained by: calculating the product of the coding result and the inverse matrix corresponding to the quantization matrix to obtain a quantization coding result; and calculating the sum of the quantization coding result and the second coding offset to obtain the initial quantization result.

Some embodiments of the present application provide an operation of introducing a coding offset in a process of coding segmentation mask annotation data, so as to achieve that a resulting segmentation mask true value code is separated from other different classes of codes (e.g., a classification true value code value or a target box true value code value) as much as possible, so that a network model has better convergence.

In some embodiments, the inputting the initial coding and the training image into the trained network model results in a one-dimensional predictive coding, including: inputting the initialization code into a query module included in the network model to obtain a query vector; inputting the training image into a backbone network included in the network model to obtain image characteristics; inputting the image features and the query vector into a sequence-to-sequence module included in the network model to obtain a cross attention processing result; inputting the cross attention processing result into a self attention processing module included in the network model to obtain a self attention processing result; and inputting the self-attention processing result into a full connection layer included by the network model to obtain the one-dimensional predictive coding.

In some embodiments, the loss value is obtained by the following equation:

L＝n·(4+N+1)

wherein, L is the total length of the one-dimensional truth code, N represents the total number of the targets on the training image, N represents the length of the segmentation mask truth code, Bj represents the target truth code of the jth target, and O_1，jTarget predictive coding characterizing the jth target.

Some embodiments of the present application provide a method for obtaining a training loss value, so that a judgment on whether a training process can be ended is more accurate and reasonable.

In a second aspect, some embodiments of the present application provide an example segmentation method, including: obtaining a feature map of an image to be segmented through a target backbone network obtained by training according to the embodiment of the first aspect; acquiring an initial code corresponding to the image to be segmented; inputting the feature map and the initial code into a target cyclic neural network obtained by training according to any embodiment of the first aspect, and obtaining a one-dimensional code of a segmentation result through the target cyclic neural network; and decoding the one-dimensional codes of the segmentation results to obtain example segmentation results corresponding to the image to be segmented.

Some embodiments of the present application provide a method for instance segmentation using a trained instance segmentation model, which needs to decode a one-dimensional result to output an output result meeting the instance segmentation requirement because the instance segmentation result output by the model is a one-dimensional encoded result.

In some embodiments, the segmentation result one-dimensional encoding includes a target box encoding value, a classification encoding value, and a segmentation mask encoding with each target; wherein, the decoding the one-dimensional encoding of the segmentation result to obtain the example segmentation result corresponding to the image to be segmented comprises: carrying out inverse quantization processing on the target frame coding value to obtain a target frame coordinate value; obtaining a classification value according to the classification coding value and inverse quantization operation; and obtaining a two-dimensional segmentation mask at least according to the segmentation mask coding and the inverse quantization operation.

Some embodiments of the present application need to perform inverse quantization operations on the target frame encoded value and the classification encoded value, respectively, to obtain instance segmentation data expected to be output.

In some embodiments, said deriving a classification value from said classification-coded value and an inverse quantization operation comprises: subtracting a first coding offset from the classified coding value to obtain a non-offset classified coding value; and carrying out the inverse quantization operation on the non-offset classification coding value to obtain the classification value.

Since different types of codes for the same target are spaced, the quantization operation can be performed by first subtracting the code offset to obtain a classification result meeting the output requirement.

In some embodiments, said deriving a two-dimensional segmentation mask from at least said segmentation mask encoding and inverse quantization operations comprises: obtaining an inverse quantization result according to the segmentation mask code and the inverse quantization operation; selecting a plurality of coefficients from the dequantization result to replace a plurality of element values in an initialization matrix to obtain a dequantization matrix, wherein the value of each element in the initialization matrix is obtained by random assignment, and the size of the initialization matrix is determined by the size of an output mask in an instance segmentation result corresponding to any one target; and performing inverse transformation on the inverse quantization matrix according to an inverse discrete cosine transform matrix to obtain the two-dimensional segmentation mask, wherein the inverse discrete cosine transform matrix is an inverse matrix of a discrete cosine transform matrix.

Some embodiments of the present application provide an inverse quantization and inverse discrete cosine transform method to perform a transformation operation on example mask codes, so that the output two-dimensional segmentation mask meets the requirements of example segmentation on the output result.

In some embodiments, the obtaining an inverse quantization result according to the segmentation mask encoding and the inverse quantization operation includes: subtracting a second coding offset from the segmentation mask code to obtain a non-offset example segmentation code; and performing the inverse quantization operation on the non-offset example segmentation code to obtain the inverse quantization result.

Some embodiments of the present application further need to subtract a coding offset that increases an interval between different types of codes before performing an inverse quantization operation on the segmentation mask codes, so that the obtained two-dimensional segmentation mask meets an output requirement of an example segmentation result.

In a third aspect, some embodiments of the present application provide a computer program product comprising a computer program, wherein the computer program when executed by a processor may implement the method according to any of the embodiments comprised in the first or second aspect.

In a fourth aspect, some embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, may implement a method as set forth in any of the embodiments included in the first or second aspect.

In a fifth aspect, some embodiments of the present application provide an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, may implement the method according to any of the embodiments included in the first or second aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic diagram of a network architecture of a training network model according to an embodiment of the present disclosure;

fig. 2 is a second schematic diagram of a network architecture of a training network model according to an embodiment of the present application;

FIG. 3 is a block diagram of one embodiment of a method for training an example segmentation model;

fig. 4 is a schematic process diagram of target truth value encoding of each target included in the one-dimensional truth value encoding obtained by encoding the annotation data according to the embodiment of the present application;

fig. 5 is a schematic diagram of a process of encoding segmentation mask annotation data according to an embodiment of the present application;

FIG. 6 is an architecture diagram of an image instance segmentation model including a target instance segmentation model provided by an embodiment of the present application;

FIG. 7 is a flow chart of an example segmentation method provided by an embodiment of the present application;

fig. 8 is a schematic diagram of a process of decoding one-dimensional codes of targets according to an embodiment of the present application;

fig. 9 is a schematic composition diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

As described in the background section, the example partition frame design method of the related art mainly includes: 1) the Mask R-CNN is based on the Faster R-CNN, the ROI Align is adopted to extract target characteristics, and FCN branches are added to realize example segmentation; 2) CondInst, utilizing the classification branches to generate weights by taking dynamic convolution as examples, and carrying out convolution operation on the weights and the global feature map so as to obtain a segmentation mask of each target; 3) SOLO, the partition masks are encoded into S × S channels, each of which represents a global mask of an instance, thereby implementing instance partitioning. The problems with these example segmentation frameworks are: 1) the ROI Align mechanism limits the resolution of target features, details are lost, and the resolution and quality of masks are limited. 2) The mask is generated depending on the target frame or the candidate frame, and a coupling relation exists.

At least to solve such problems, some embodiments of the present application provide a method for training an instance segmentation model and an instance segmentation method, in which the network model adopted by the embodiments is composed of a language model disclosed in the related art and a backbone network for extracting image features, and since the input and output of the language model and the requirement of the instance segmentation of the image for the input and output are different, some embodiments of the present application provide a novel instance discrete factorization characterization manner (for example, direct quantization is performed on the target box and class involved in the instance segmentation, sparse coding is performed on the segmentation mask corresponding to the instance segmentation by using discrete cosine transform and quantization processing is performed on the coding coefficient), by which the high resolution mask can be compressed into a low-dimensional vector, the computation amount is greatly reduced, and supervised network learning of the network model is performed by using implicit high resolution truth (i.e. one-dimensional truth coding), the method is beneficial to the segmentation of targets (such as small targets) with various sizes, and the edge information is greatly reserved, so that the performance and the accuracy of example segmentation of the image by using a network model (namely, a network consisting of an autoregressive language model and a backbone model) are improved. It should be noted that the object example segmentation model of the present application can realize parallel prediction of the blocks and masks of each object, and characterize the higher resolution masks with as little computation cost as possible.

That is to say, some embodiments of the present application make full use of the transform of the relevant language model and the Sequence-to-Sequence mechanism, so that the network can learn in an autoregressive manner, and parallel prediction of the target block and the segmentation mask is achieved.

The following describes, with reference to fig. 1 to 5, an exemplary process of training a network Model to obtain a target example segmentation Model (the Model includes a target recurrent neural network and a target backbone network) with an example segmentation function for an image, where in some embodiments of the present application, the target recurrent neural network has the same or similar architecture as a language Model, for example, the language Model is an Auto regression Model.

Referring to fig. 1, fig. 1 is a diagram of a network architecture for training an image instance segmentation model, in which: a network model 10, a mask discrete encoding module 200, and a loss value acquisition module 300.

The network model 10 is composed of a language model (e.g., an autoregressive language model) disclosed in the related art and a backbone network, and has the ability to make the network learn in an autoregressive form using a Transformer and Sequence-to-Sequence mechanism. For example, as shown in fig. 2 (the main difference between the diagram and fig. 1 is that a reference architecture diagram of the network model 10 is provided), the network model 10 includes a backbone network 400 and a language model 100, and the language model 100 (corresponding to a recurrent neural network) further includes a Word Embed module 110 (or referred to as a query module), a cross attention processing module 120, a self attention processing module 130 and a full connection layer 140 as shown in fig. 2.

The Word Embed module 110 is configured to further transform the initial encoding into a parametric input form of the adaptation model, providing an adaptive initialization Word embedding vector (Embed).

The cross attention processing module 120 is configured to interact the initialized embedded vector Embed with the feature map, thereby generating features of the target.

The self-attention processing module 130 is configured to Embed all the targets into the vector Embed for interaction, so as to enhance the target Embed belonging to the foreground and weaken the target Embed belonging to the background.

The fully-connected layer 140 is configured as a prediction layer, and maps the target Embed into an output format to obtain a one-dimensional prediction code of all targets.

The mask discrete encoding module 200 is configured to perform discrete encoding on the annotation data of each target on the training image to obtain a one-dimensional true value code of all the targets. It should be noted that this module is only needed in the process of training the network model.

The loss value obtaining module 300 is configured to obtain a loss value according to the one-dimensional predictive coding of all targets and the one-dimensional true-value coding of all targets obtained by each training, so as to adjust the parameter value of the network model 10 according to the loss value until the one-dimensional predictive coding output by the network model 10 meets the requirement, so as to obtain a target instance segmentation model.

It can be seen from fig. 1 that a feature map and an initial code are required to be input into the network model 10 during the training process of the network model 10, wherein fig. 2 provides two functional modules for obtaining the feature map and the initial code. For example, a training image is input into the backbone network 400 of fig. 2 to obtain a feature map (for example, the backbone network 400 includes Resnet or mobilene, etc.), and the feature map obtained from the training image according to the backbone network is input into the cross attention processing module 120 included in the network model 10 to train the network model. The initial mask generating module 500 shown in fig. 2 randomly generates an initial code (the initial code is also a one-dimensional vector having the same dimension as the one-dimensional true value code), and inputs the initial mask into the Word Embed module 110 to train the network model.

The training process for the network model 10 is illustratively set forth below in conjunction with FIG. 3.

As shown in fig. 3, an embodiment of the present application provides a method for training an example segmentation model, where the method includes: s100, inputting the initial code and the training image into a trained network model to obtain a one-dimensional predictive code; s200, acquiring a one-dimensional truth value code; and S300, obtaining a loss value of the prediction result according to the one-dimensional predictive coding and the one-dimensional true value coding, and terminating the training process of the network model when the loss value is confirmed to meet the set requirement to obtain the target instance segmentation model.

Some embodiments of the application perform supervised network learning on a language model of a related technology through a high resolution truth value (namely, one-dimensional truth value coding) to obtain a network model with image instance segmentation, which is beneficial to segmentation of small targets and greatly retains edge information, thereby improving performance and accuracy.

It should be noted that the trained network model referred to in S100, i.e., the network model 10 in fig. 1 or fig. 2, the one-dimensional true value encoding referred to in S200 is obtained by the mask discrete encoding module in fig. 1 or fig. 2 after performing the correlation encoding process described below, and the loss value referred to in S300 is obtained by the loss value obtaining module 300 in fig. 1 or fig. 2 after performing the correlation operation described below.

The implementation of the steps of fig. 3 is exemplarily set forth below.

First, the implementation of S100 is exemplarily set forth.

The one-dimensional predictive coding in S100 includes target predictive coding for each target, the target predictive coding includes target frame predictive coding, classification predictive coding, and division mask predictive coding, and the initial coding is a one-dimensional vector having the same dimension as the one-dimensional predictive coding and the one-dimensional true value coding described in S200, and for example, the initial coding may be understood as including target frame initial coding, classification initial coding, and division mask initial coding for each target, and the values corresponding to these initial coding are obtained by random assignment. That is, the training of the network model 10 allows the trained network model to have a function of example-dividing an input image and outputting a one-dimensional code corresponding to a division result, regardless of whether the dimension of the input initial code or the obtained one-dimensional predictive code is the same as the dimension of the one-dimensional true-value code.

If the network model to be trained employs the architecture shown in fig. 1 and fig. 2, in some embodiments of the present application, S100 exemplarily includes: inputting the initialization code into a query module (namely a Word Embed module) included in the network model to obtain a query vector; inputting the training image into a backbone network included in the network model to obtain image characteristics; inputting the image features and the query vector into a sequence-to-sequence module included in the network model to obtain a cross attention processing result; inputting the cross attention processing result into a self attention processing module included in the network model to obtain a self attention processing result; and inputting the self-attention processing result into a full connection layer included by the network model to obtain the one-dimensional predictive coding.

As an example, an initial code used to train the network model may be randomly initialized based on the dimensionality of the one-dimensional true-value code derived from training image A. For example, the initial coding of the S100 design is:

the dimension of this initial encoding, as noted above to the right of the I equation, is also N (4+ N +1), the same as noted below with respect to the right of the one-dimensional true value encoding equation in the example of S200.

The network model to be trained to which S100 relates is, for example, the network model 10 as described in fig. 1 or fig. 2. It can be known from the above description that the network model may include a Word Embed module 110, a cross attention processing module 120, a self attention processing module 130, and a full connection layer 140, the feature map needs to be input into the poor attention processing module 120 during the process of training the model, the initial code needs to be input into the Word Embed module 110, and the network model during training outputs a one-dimensional predictive code for all targets through the input parameters. It will be appreciated that the determination of whether the training process for the network model can be ended is made by one-dimensional predictive coding and one-dimensional true value coding.

As an example, a training process for the network model 10 of FIG. 2 is illustratively set forth in connection with FIG. 2.

Firstly, inputting the initial code designed in the example S100 into a Word Embed module shown in FIG. 2 to obtain a query vector q: q is Word Embed (I).

Note that since the total number of words (chinese or english) is fixed in the language model 100, the embedded vector Embed is set in advance, but in the visual task, the image is open, and the target number is also changed, only learnable Embed can be used, that is: an embedded vector Embed is initialized and interacts with a feature map obtained based on an image in a sequence-to-sequence module (Seq2Seq) so as to obtain an embedded vector Embed corresponding to the target.

Secondly, the feature extracted from the image a through the backbone network in fig. 2 is denoted as X, and first, the cross attention processing module 120 calculates a cross attention mechanism, and updates q:

q₁＝LN(FC(ReLU(FC(tgt)))+tgt)

where LN represents Layer Normalization, FC represents fully-connected layers, ReLU represents activation function, and d represents vector dimension.

Third, the self-attention mechanism is calculated by the self-attention processing module 130:

q₂＝LN(FC(ReLU(FC(tgt)))+tgt)

and fourthly, obtaining final output through an FC layer to obtain one-dimensional predictive codes O of all targets on the image A:

it should be noted that the first step to the fourth step are only exemplary of the training process, and the sequence of the steps may be adjusted in some embodiments. For example, in some embodiments of the present application, the operations of the first step and the second step may be performed simultaneously, or the operations of the second step may be performed first and then the operations of the first step may be performed. The above-described one-dimensional predictive coding of all objects on the image a obviously includes object predictive coding for each object.

Next, an implementation process of S200 is exemplarily set forth.

As described above, in some embodiments of the present application, the language Model 100 included in the network Model 10 belongs to an Auto regression Model.

The one-dimensional true value code related to S200 is obtained by discretely coding annotation data of a plurality of targets on a training image (e.g., the training image of fig. 2), and the one-dimensional true value code includes a target true value code for each target, and the target true value code for each target includes: target box truth encoding, classification truth encoding, and split mask truth encoding.

For example, the training image includes 5 targets (i.e., 5 instances to be segmented), the one-dimensional true value code related to S200 includes 5 target true value codes, the 5 target true value codes form a one-dimensional vector, and the 5 target true value codes are obtained by respectively coding each of the five targets. The target truth masks corresponding to the 5 targets respectively include: target box truth encoding, classification truth encoding, and split mask truth encoding, and these truth masks also constitute a one-dimensional vector as part of the one-dimensional truth encoding vector.

It should be noted that the order of encoding the target truth value, the classification truth value and the segmentation mask truth value included in each target truth value mask may be adjustable, that is, in some embodiments, the target truth value mask of each target sequentially includes: target box truth encoding, classification truth encoding, and split mask truth encoding. In other embodiments of the present application, the target truth mask for each target comprises, in order: classification truth encoding, target box truth encoding, and partition mask truth encoding. In some embodiments of the present application, the target truth mask of each target sequentially includes: split mask truth encoding, target box truth encoding, and classification truth encoding. That is, the embodiments of the present application do not limit the order of the respective true value codes included in the respective target true value codes (the respective true value codes include: target box true value codes, classification true value codes, and split mask true value codes), and it is understood that the same order of the respective true value codes of the target true value codes is required for the same embodiment.

The labeling data of the multiple targets related to S200 may be obtained by manual labeling, or may be obtained by automatic labeling through a related network model. It can be understood that, in order to obtain the target frame true value code, the classification true value code, and the segmentation mask true value code for each target, the target frame annotation data, the classification annotation data, and the segmentation mask annotation data of each target on the training image need to be obtained in advance. That is, in some embodiments of the present application, S200 illustratively includes: and respectively coding the target frame marking data, the classification marking data and the segmentation mask marking data of each target to obtain the one-dimensional true value code.

Some embodiments of the present application provide an example discrete factorization characterization method for annotation data for image example segmentation, for example, a target frame, a category score, and a mask are respectively encoded, and by this encoding characterization form, a high-resolution mask can be compressed into a low-dimensional vector, which greatly reduces the amount of computation.

In order to make the network model better distinguish between various types of codes (i.e. target box codes for target boxes, classification codes for classification data, and mask codes for segmentation masks), in some embodiments of the present application, the annotation data of each target related to S200 includes the following three types of data: the target box annotation data, the classification annotation data, and the segmentation mask annotation data, S200 includes: and finishing the coding of various types of marking data at least according to the coding offset respectively set for different types of marking data to obtain the one-dimensional true value coding. That is to say, some embodiments of the present application may ensure that, by introducing a coding offset, a distance between codes of the same category (for example, codes of multiple coordinates corresponding to a target frame belong to the same type of code, and codes of a mask belong to the same type of mask) is closer, and a distance between codes belonging to different categories (for example, a code for a target frame and a code for a classification belong to different types of codes) is farther, so that a network may better distinguish between various types of codes, and may also facilitate decoding output, so that network convergence is better.

It should be noted that in some embodiments of the present application, different non-zero coding offsets are set for the coding processes of the classification tagging data and the segmentation mask tagging data, respectively (in this case, the coding offset set for the target frame tagging data may be considered to be 0). In other embodiments of the present application, different non-zero coding offsets may be set for the encoding processes of the target box annotation data and the segmentation mask annotation data, respectively (in this case, the coding offset set for the classification annotation data may be considered to be 0). In some embodiments of the present application, different coding offsets may be set for the coding processes of the target box annotation data and the classification annotation data, respectively (in this case, the coding offset set for the segmentation mask annotation data may be considered to be 0). In some embodiments of the present application, different non-zero coding offsets may also be set for the target box annotation data, the classification annotation data, and the segmentation mask annotation data, respectively. It is to be understood that in some embodiments of the present application, the encoding offset may not be set for the encoding process of each type of annotation data.

The following describes a process of encoding target frame annotation data to obtain target frame true value codes of targets, with reference to an example of annotation data, in which only code offsets (i.e. non-zero code offsets) are set for classification annotation data and segmentation mask annotation data, respectively.

In some embodiments of the present application, the target frame labeling data may be coordinate values of four vertices of the target frame in the set coordinate system, and in some embodiments of the present application, the target frame labeling data may also be coordinate values of any three vertices or diagonal vertices of the target frame in the set coordinate system.

For example, as shown in fig. 4, in some embodiments of the present application, the labeling data of any target includes target box labeling data, where the target box labeling data includes an upper left coordinate value and a lower right coordinate value for labeling the target box, and then S200 exemplarily includes: s310, rounding and quantizing the coordinate values of the upper left corner and the lower right corner to obtain a target frame true value code of any target. Some embodiments of the present application directly perform coordinate quantization on the coordinates of the target frame labeling data, thereby increasing the speed of encoding the target frame.

The following describes the process of encoding the classification labeling data to obtain the classification true value code of each target respectively with reference to the labeling data.

As shown in fig. 4, in some embodiments of the present application, if the annotation data of any target includes classification annotation data, S200 exemplarily includes: and S320, obtaining the classification truth value code of any target according to the classification marking data and the rounding quantization operation of any target. Some embodiments of the present application perform rounding quantization on the classification labeling data to complete the encoding for classification, and quickly obtain a one-dimensional true value encoding.

When a coding offset is set for the classification label data, in some embodiments of the present application, the obtaining of the classification true value of any target according to the classification label data of the any target and the rounding quantization operation in S200 exemplarily includes: summing the classified marking data and a first coding offset to obtain offset classified marking data; and carrying out the rounding quantization operation on the offset classification labeling data to obtain the classification truth value code. Some embodiments of the present application set a coding offset for the number of classified labels, so that after coding the target frame label data (or coding the segmentation mask label data) and coding the classified label data, a certain interval exists between the obtained coded data of different classes.

The following describes a process of encoding segmentation mask annotation data to obtain a segmentation mask true value code of each target, with reference to the annotation data.

It will be appreciated that in some embodiments of the present application, the segmentation mask annotation data is the values of the foreground and background that are manually annotated on the target box where each target is located. For example, the segmentation mask labeling data of a target is obtained by labeling the foreground (i.e., the target) in the target frame where the target is located with 255 and labeling the background part with 0. For example, the segmentation mask labeling data of a target is obtained by labeling the foreground (i.e., the target) in the target frame where the target is located as 0 and labeling the background portion as number 255.

As shown in fig. 4, in some embodiments of the present application, the annotation data of any target includes segmentation mask annotation data, where S200 exemplarily includes: s330, at least performing sparse coding and quantization processing on the segmentation mask marking data to obtain N quantized values, and coding the N quantized values as the segmentation mask truth values of any target, wherein N is an integer greater than or equal to 1. That is to say, some embodiments of the present application perform sparse coding on the segmentation mask labeling data and perform quantization processing on the coding coefficients, and through this representation form, the high resolution mask can be compressed into a low dimensional vector, which greatly reduces the amount of computation.

It should be noted that, in some embodiments of the present application, all coefficients obtained after sparse coding may be quantized to obtain a plurality of quantized values, so that the degree of retention of target information may be ensured to the greatest extent. In some embodiments of the present application, a plurality of quantized values may be obtained by performing quantization processing on a part of coefficients obtained after sparse coding, so that the data processing amount may be reduced, and the speed of data processing may be increased.

For example, in order to reduce the data processing amount, as shown in fig. 5, in some embodiments of the present application, the process of obtaining N quantized values corresponding to any one target according to the segmentation mask labeling data of the any one target, which is referred to by S200, exemplarily includes: s510, performing frequency domain projection on the segmentation mask labeling data of any target according to a discrete cosine transform matrix to obtain a plurality of frequency domain coefficients corresponding to the segmentation mask labeling data of any target; s520, selecting N frequency domain coefficients corresponding to the low frequency component from the plurality of frequency domain coefficients as an encoding result; s530, obtaining the N quantization values according to the coding result and the quantization operation, and coding the N quantization values as the splitting mask truth value of any target, wherein N is an integer greater than or equal to 1. For example, in some embodiments, the size of the discrete cosine transform matrix is the same as the size of the partition mask true value code, and the value of each element in the discrete cosine transform matrix is calculated according to at least a cosine function. It can be understood that in some embodiments of the present application, sparse coding is performed on segmentation mask labeled data through a discrete cosine transform matrix, and low-frequency components of coding coefficients are quantized, so that data processing amount is compressed on the premise of ensuring that target segmentation is not affected, and further, the speed of data processing is increased.

It should be noted that, in some embodiments of the present application, the process of obtaining the N quantized values according to the encoding result and the quantization operation, which is referred to by S200, exemplarily includes: obtaining an initial quantization result at least according to the coding result and the quantization matrix; and rounding the initial quantization result to obtain the N quantization values. Some embodiments of the present application complete quantization operations on coding results by introducing a quantization matrix, and obtain a true value code for segmentation mask annotation data.

In some embodiments of the present application, the values of the elements in the quantization matrix may all be equal, and in other embodiments of the present application, the values of the elements in the quantization matrix have relative sizes in order to increase the amount of valid data. For example, in some embodiments of the present application, the process of deriving an initial quantization result according to at least the encoding result and a quantization matrix involved in S200 exemplarily includes: solving the product of the coding result and the inverse matrix corresponding to the quantization matrix to obtain a quantization coding result; and calculating the sum of the quantization coding result and the second coding offset to obtain the initial quantization result. It should be noted that, when the quantization coding result is obtained in this way, in order to better retain data related to example segmentation and ignore some unimportant edge data, thereby improving the data processing speed, in some embodiments of the present application, the plurality of frequency domain coefficients further include coefficients corresponding to a high frequency component, where values of a plurality of elements corresponding to the low frequency component in the quantization matrix are greater than values of elements corresponding to the high frequency component. For example, the element values in the quantization matrix become larger from top left to bottom right, and the error of quantization of the low frequency coefficient corresponding to the top left corner can be made smaller and the error of quantization of the high frequency coefficient corresponding to the bottom right corner can be made larger by the matrix and the quantization formula. In addition, the meaning and purpose of introducing the second coding offset are the same as the above-described purpose and meaning of introducing the coding offset, that is, some embodiments of the present application introduce the coding offset in the process of coding the segmentation mask annotation data, so that the coding result obtained by the class of annotation data is separated as much as possible from the coding result obtained by the class of annotation data and the annotation data of other classes of the target, so that the convergence of the network model is better. Some embodiments of this application can make the more important low frequency component after the quantization (namely to this target segmentation have the volume of important influence) the error a little bit through setting up the element value size of different positions on the quantization matrix, and cut apart the little error of high frequency component that influences a little to the target and slightly bigger, have promoted the effect of the model that the training obtained.

The process of encoding the annotation data by the mask discrete encoding module 200 is exemplified below with reference to the specific image a by taking a mask with a size of 128 × 128 as an example, and with reference to fig. 2 as an example.

For the above training image a (or referred to as image a), the size of image a is (w, h, 3) (the resolution of the image is w × h, the number of channels is 3), and the label data of the ith target on image a is represented as

Marking the labeling data of all the targets of the image A as B ═ B₁，...，B_n}. Wherein,

target frame annotation data representing the ith target and comprising an upper left corner coordinate and a lower right corner coordinate, c_iLabeling class representing ith target (i.e. classifying label data for ith target), m_iA binary mask representing the ith object (i.e. segmentation mask tag data of the ith object),for example, the binary mask m_iThe size of (A) is as follows: 128x128 and at mask m_iThe upper 0 represents the background and 255 represents the foreground, n being the total number of all objects comprised by the training image a.

The encoding process for the above-mentioned annotation data is as follows.

Annotating coordinates included in data with a target frame

And the classification labeling data are subjected to rounding quantization, wherein round [ 2 ]]Represents the rounding function (the number 2000 in the following equation is an exemplary value of the first coding offset set for the coding process of the classification label data):

for a binary mask m_iIn the encoding process, a discrete cosine transform matrix a needs to be determined, and values of elements in the discrete cosine transform matrix are calculated as follows:

wherein,

it should be noted that, when the mask size output by the example segmentation is 256, the denominator of the corresponding c (i) valued calculation formula is replaced by 256 and 128, and similarly, a reasonable c (i) calculation formula can be designed by those skilled in the art according to specific requirements. It will therefore be appreciated that the model provided by some embodiments of the present application may dynamically adjust the size of the output mask by adjusting at least the values of the coefficients c (i) in the discrete cosine transform matrix.

Secondly, calculating the projection of the segmentation mask marking data (namely, the truth mask) in the frequency domain according to the following formula to obtain a plurality of frequency domain coefficients Fi of the ith target on the image a:

F_i＝A*m_i*A^T。

again, according to F_iThe characteristic of sparsity takes the top left N coefficients (the top left coefficients correspond to low frequency components) as the coding result, and is expressed by the formula:

finally, the N quantization values are obtained from the coding result and the quantization rounding operation, i.e. for F_iThe quantization is performed according to the following formula to obtain N quantized values (the number 1500 in the following formula is an exemplary value of the second coding offset set for the coding process of the segmentation mask annotation data):

wherein Q is a quantization matrix, and the quantization matrix is as follows (as follows, the quantization matrix has a larger value of corresponding elements during the transition from the upper left position to the lower right position, because the upper left corresponds to the low frequency component coefficients of the plurality of frequency domain coefficients, and the lower right corresponds to the high frequency component coefficients of the plurality of frequency domain coefficients):

Q＝[[16，11，10，16，24，40，51，61]*sqrt(N)，

[12，12，14，19，26，58，60，55]*sqrt(N)，

[14，13，16，24，40，57，69，56]*sqrt(N)，

[14，17，22，29，51，87，80，62]*sqrt(N)，

[18，22，37，56，68，109，103，77]*sqrt(N)，

[24，35，55，64，81，104，113，92]*sqrt(N)，

[49，64，78，87，103，121，120，101]*sqrt(N)

[72，92，95，98，112，100，103，99]*sqrt(N)]

in the above manner, the truth mask (i.e. the annotation data of all the targets) of all the targets on the image a is obtained by discrete coding as the following one-dimensional truth code

The dimension of this one-dimensional truth code is N (4+ N +1) as noted below:

it will be appreciated that the input to the masked discrete encoding module of fig. 1 or 2 is B ═ B₁，...，B_nOutput is

Then, the loss value obtaining module 300 of FIG. l or FIG. 2 encodes according to the one-dimensional true value

To determine whether the encoding process can end.

Finally, the implementation of S300 is illustrated.

In S300, it is necessary to identify the difference between the one-dimensional prediction code value and the one-dimensional true code value according to the loss function, and determine whether the training process for the model can be ended at least by the difference. In some embodiments of the present application, when it is determined through the loss function that the difference between the one-dimensional predicted code value and the one-dimensional true code value satisfies the setting requirement, the training of the model is terminated to obtain the target instance segmentation model. In some embodiments of the present application, the training process of the model is terminated to obtain the target instance segmentation model only after the loss value is confirmed to meet the set requirement through the loss function and the set number of cycles is reached.

Taking the example of the image a above as an example, the loss value is obtained by the following loss function formula:

L＝n·(4+N+1)

That is, some embodiments of the present application provide a method for obtaining a training loss value, so that the determination of whether a training process can be ended is more accurate and reasonable.

It can be understood that, through the training process described above, a target instance segmentation model with an instance segmentation function for an image is obtained, and the target instance segmentation model includes a target recurrent neural network and a target backbone network, and the following describes an example segmentation method for any image to be segmented, in combination with the model.

It should be noted that, because the ideal output result after the example segmentation is to obtain the target frame coordinates, classification, and two-dimensional segmentation mask of each target, and the one-dimensional code is output by using the trained target example segmentation model, in some embodiments of the present application, a mask discrete decoding module 600 as shown in fig. 6 is further added, and the one-dimensional code can be converted into the output format of the example segmentation result by using the mask discrete decoding module.

The difference between fig. 6 and fig. 2 is that the target example segmentation model of fig. 6 includes a language model (referred to as a target recurrent neural network 160) and a backbone network (referred to as a target backbone network 410) obtained after training is completed, and the target example segmentation model is obtained after training the network model 10 of fig. 2, and has the same structure as the network model of fig. 2, and the target example segmentation model has the capability of mining a one-dimensional code corresponding to an example segmentation result from an input test image or a to-be-example segmentation image. Fig. 6 includes a mask discrete decoding module 600 that performs the exact opposite operation of the mask discrete encoding of fig. 2 on the predicted one-dimensional encoding to obtain an output format requirement satisfying the example segmentation result, i.e., the target frame coordinate values, the classification values, and the two-dimensional segmentation mask of each target can be output by the decoding operation of the one-dimensional encoding by the mask discrete decoding module 600 of fig. 6.

The process of the example segmentation method is illustratively set forth below in conjunction with fig. 7.

As shown in fig. 7, some embodiments of the present application provide an example segmentation method, including: s710, obtaining a characteristic diagram of an image to be segmented through a target backbone model; s720, acquiring an initial code corresponding to the image to be segmented; s730, inputting the feature diagram and the initial code into a target cyclic neural network obtained by training according to any embodiment of the first aspect, and obtaining a one-dimensional code of a segmentation result through the target cyclic neural network; and S740, decoding the one-dimensional codes of the segmentation results to obtain example segmentation results corresponding to the image to be segmented.

That is, some embodiments of the present application provide a method for instance segmentation using a trained instance segmentation model, which needs to decode a one-dimensional result to output an output result satisfying the instance segmentation requirement since the instance segmentation result output by the model is a one-dimensional encoded result.

S710 may obtain a feature map of an image to be segmented according to the target backbone network 410 shown in fig. 6, and the target backbone network is obtained by training the network model of fig. 1, and S720 obtains an initial mask through the initial mask generating module 500 shown in fig. 6. It can be understood that, the target instance segmentation model obtained through the training process of the above embodiment obtains one-dimensional codes of all target segmentation results on the image to be segmented according to the input image to be segmented and the initial code. The target recurrent neural network of S740 is obtained by training the network model 10 of fig. 1 or fig. 2.

The following exemplarily describes the process of S740 performed by the mask discrete decoding model 600 of fig. 6. It will be appreciated that the masked discrete decoding module of fig. 6 performs the inverse operations performed by the masked discrete encoding module of fig. 2.

In some embodiments of the present application, the segmentation result one-dimensional encoding referred to in S740 includes a target box encoding value, a classification encoding value, and a segmentation mask encoding of each target, and as shown in fig. 8, this S740 exemplarily includes: s741, performing inverse quantization processing on the target frame code value to obtain a target frame coordinate value; s742, obtaining a classification value according to the classification coding value and the inverse quantization operation; and S743, obtaining a two-dimensional segmentation mask at least according to the segmentation mask coding and the inverse quantization operation. Some embodiments of the present application need to perform inverse quantization operations on the target frame encoded value and the classification encoded value, respectively, to obtain instance segmentation data expected to be output.

For example, for the ith target on the image a, the target one-dimensional coding representation of the ith target predicted by the target instance segmentation model is assumed as:

wherein,

the target frame is coded for the target frame,

and characterizing the classified codes, and performing inverse quantization on the two codes respectively by adopting the following formula:

it will be appreciated that this example performs the inverse quantization operation in contrast to the quantization process described above with respect to the image a encoding process. That is, in some embodiments of the present application, the deriving a classification value according to the classification coded value and the inverse quantization operation includes: subtracting a first coding offset (e.g., the number 2000 involved in the above example dequantization process) from the classified coded value to obtain a non-offset classified coded value; and carrying out the inverse quantization operation on the non-offset classification coding value to obtain the classification value. This is because different classes of codes for the same target are spaced, so the quantization operation needs to be performed by first subtracting the code offset to obtain a classification result meeting the output requirement.

In some embodiments of the present application, S743 illustratively comprises: obtaining an inverse quantization result according to the segmentation mask code and the inverse quantization operation; selecting a plurality of coefficients from the dequantization result to replace a plurality of element values in an initialization matrix to obtain a dequantization matrix (for example, a matrix that is the inverse of the quantization matrix in the above example), wherein the values of the elements in the initialization matrix are obtained by random assignment, and the size of the initialization matrix is determined by the size of an output mask in the example segmentation result corresponding to the any one target; the inverse quantization matrix is inverse-transformed according to an inverse discrete cosine transform matrix (e.g., an inverse discrete cosine transform matrix as referred to in the above example), which is an inverse discrete cosine transform matrix of the discrete cosine transform matrix, to obtain the two-dimensional segmentation mask. That is, some embodiments of the present application provide an operation of inverse quantization and inverse discrete cosine transform to convert the example mask code so that the output two-dimensional segmentation mask meets the requirement of the example segmentation on the output result.

If an offset value of the second coding offset is designed for the example coding label, in some embodiments of the present application, the obtaining of the dequantization result according to the segmentation mask coding and the dequantization operation in step S743 includes: subtracting a second coding offset from the segmentation mask code to obtain a non-offset example segmentation code; and performing the inverse quantization operation on the non-offset example segmentation code to obtain the inverse quantization result. That is, some embodiments of the present application need to subtract the coding offset that increases the interval between different types of codes before performing the dequantization operation on the segmentation mask codes, so that the obtained two-dimensional segmentation mask meets the output requirement of the example segmentation result.

For example, the predicted coding coefficients (i.e., the segmentation mask) are first encoded according to the following formula (as opposed to the coding process quantization process corresponding to the example above)

Carrying out inverse quantization:

as opposed to the encoding operations involved in the encoding process. Then, a matrix is initialized

The matrix dimension is (128x128) and the default value of each element on the matrix is 0. Then will be

Is replaced by the first N coefficients

Namely, it is

Finally reshape is (128 ) shape, then the decoded mask (i.e. the corresponding two-dimensional segmentation mask) is

The shape is (128 ). It will be appreciated that the decoding process performs the inverse operation of the encoding process.

Some embodiments of the present application provide a computer program product comprising a computer program, wherein the computer program when executed by a processor may implement any of the embodiments comprised by the methods of fig. 3 or fig. 7 as described above.

Some embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, may implement any of the embodiments included in the methods of fig. 3 or fig. 7, as described above.

Some embodiments of the present application provide an electronic device 800 comprising a memory 810, a processor 820 and a computer program stored on the memory 810 and executable on the processor 820, wherein the processor 820 can implement all embodiments involved in the methods of fig. 3 or fig. 7 when reading the program from the memory 810 through the bus 830 and executing the program.

Processor 820 may process digital signals and may include various computing structures. Such as a complex instruction set computer architecture, a structurally reduced instruction set computer architecture, or an architecture that implements a combination of instruction sets. In some examples, processor 820 may be a microprocessor.

Memory 810 may be used to store instructions that are executed by processor 820 or data related to the execution of instructions. The instructions and/or data may include code for performing some or all of the functions of one or more of the modules described in embodiments of the application. The processor 520 of the disclosed embodiments may be used to execute instructions in the memory 810 to implement the methods shown in fig. 3 or fig. 7. Memory 810 includes dynamic random access memory, static random access memory, flash memory, optical memory, or other memory known to those skilled in the art.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A method of training an instance segmentation model, the method comprising:

inputting an initial code and a training image into a trained network model to obtain a one-dimensional predictive code, wherein the one-dimensional predictive code comprises a target predictive code aiming at each target on the training image;

acquiring one-dimensional truth value codes, wherein the one-dimensional truth value codes are obtained by discretely coding the label data of each target on the training image, the one-dimensional truth value codes comprise target truth value codes for each target, and the initial codes are one-dimensional vectors with the same dimension as the one-dimensional truth value codes;

and obtaining a loss value of the current prediction result according to the one-dimensional prediction coding and the one-dimensional truth value coding, and terminating the training process of the network model at least by confirming that the loss value meets the set requirement to obtain a target instance segmentation model, wherein the target instance segmentation model comprises a target recurrent neural network and a target backbone network.

2. The method of claim 1,

the target truth encoding comprises: target box truth value coding, classification truth value coding and partition mask truth value coding;

the target predictive coding includes target frame predictive coding, class predictive coding, and split mask predictive coding.

3. The method of claim 2, wherein the labeling data of any object includes object box labeling data, the object box labeling data including an upper left coordinate value and a lower right coordinate value for labeling the object box;

wherein,

the target box truth value coding is obtained by the following coding method: and rounding and quantizing the coordinate values of the upper left corner and the lower right corner to obtain a target frame true value code of any target.

4. The method of any of claims 1-3, wherein the annotation data for any object comprises classification annotation data;

wherein,

the classification truth value coding is obtained by the following coding method: and obtaining the classification truth value code of any target according to the classification marking data and the rounding quantization operation of any target.

5. The method according to claims 2-4, characterized in that the classification truth encoding is obtained by an encoding method:

summing the classified marking data and a first coding offset to obtain offset classified marking data;

and carrying out the rounding quantization operation on the offset classification labeling data to obtain the classification truth value code.

6. The method of claims 1-5, wherein the split mask truth encoding is obtained by an encoding method comprising:

performing frequency domain projection on the segmentation mask marking data of any target according to a discrete cosine transform matrix to obtain a plurality of frequency domain coefficients corresponding to the segmentation mask marking data of any target;

selecting N frequency domain coefficients corresponding to the low frequency component from the plurality of frequency domain coefficients as an encoding result;

and obtaining N quantization values according to the coding result and the quantization operation, and coding the N quantization values as a real value of the partition mask of any target, wherein N is an integer greater than or equal to 1.

7. The method of claim 6, wherein the size of the DCT matrix is the same as the size of the true-valued code of the segmentation mask, and wherein the values of the elements of the DCT matrix are calculated at least according to a cosine function.

8. The method of claim 6, wherein the N quantization values are obtained by:

obtaining an initial quantization result at least according to the coding result and the quantization matrix;

and rounding the initial quantization result to obtain the N quantization values.

9. The method of claim 8, wherein the plurality of frequency domain coefficients further includes a high frequency component;

wherein,

the initial quantification result is obtained by the following method:

and obtaining the initial quantization result according to the product of the coding result and an inverse matrix corresponding to the quantization matrix, wherein the values of a plurality of elements corresponding to the low-frequency component in the quantization matrix are larger than the values of the elements corresponding to the high-frequency component.

10. The method of claim 8, wherein the initial quantization result is obtained by:

calculating the product of the coding result and the inverse matrix corresponding to the quantization matrix to obtain a quantization coding result;

and calculating the sum of the quantization coding result and the second coding offset to obtain the initial quantization result.

11. The method of any one of claims 1-10,

inputting the initial code and the training image into a trained network model to obtain a one-dimensional predictive code, wherein the one-dimensional predictive code comprises:

inputting the initialization code into a query module included in the network model to obtain a query vector;

inputting the training image into a backbone network included in the network model to obtain image characteristics;

inputting the image features and the query vector into a sequence-to-sequence module included in the network model to obtain a cross attention processing result;

inputting the cross attention processing result into a self attention processing module included in the network model to obtain a self attention processing result;

and inputting the self-attention processing result into a full connection layer included by the network model to obtain the one-dimensional predictive coding.

12. The method of any one of claims 1-11, wherein the loss value is obtained by the formula:

L＝n·(4+N+1)

wherein L is the total length of the one-dimensional true value codes, and n represents the target on the training imageTotal number, N characterizing the length of the segmentation mask truth code, Bj representing the target truth code of the jth target, O_1，jTarget predictive coding characterizing the jth target.

13. An example splitting method, comprising:

acquiring a feature map of an image to be segmented through a target backbone network obtained through training according to any one of claims 1-12;

acquiring an initial code corresponding to the image to be segmented;

inputting the feature map and the initial code into a target recurrent neural network obtained by training according to any one of claims 1 to 12, and obtaining a one-dimensional code of a segmentation result through the target recurrent neural network;

and decoding the one-dimensional codes of the segmentation results to obtain example segmentation results corresponding to the image to be segmented.

14. The method of claim 13, wherein the segmentation result one-dimensional encoding includes a target frame encoding value, a classification encoding value, and a segmentation mask encoding corresponding to the segmentation result of each target;

wherein,

the decoding the one-dimensional encoding of the segmentation result to obtain an example segmentation result corresponding to the image to be segmented comprises:

carrying out inverse quantization processing on the target frame coding value to obtain a target frame coordinate value;

obtaining a classification value according to the classification coding value and inverse quantization operation;

and obtaining a two-dimensional segmentation mask at least according to the segmentation mask coding and the inverse quantization operation.

15. The method of claim 14, wherein deriving a classification value based on the classification-encoded value and an inverse quantization operation comprises:

subtracting a first coding offset from the classified coding value to obtain a non-offset classified coding value;

and carrying out the inverse quantization operation on the non-offset classification coding value to obtain the classification value.

16. The method of claim 14, wherein said deriving a two-dimensional segmentation mask from at least the segmentation mask coding and dequantizing operations comprises:

obtaining an inverse quantization result according to the segmentation mask code and the inverse quantization operation;

selecting a plurality of coefficients from the dequantization result to replace a plurality of element values in an initialization matrix to obtain a dequantization matrix, wherein the value of each element in the initialization matrix is obtained by random assignment, and the size of the initialization matrix is determined by the size of an output mask in an instance segmentation result corresponding to any one target;

and performing inverse transformation on the inverse quantization matrix according to an inverse discrete cosine transform matrix to obtain the two-dimensional segmentation mask, wherein the inverse discrete cosine transform matrix is an inverse matrix of a discrete cosine transform matrix.

17. The method of claim 16, wherein said deriving an inverse quantization result based on said segmentation mask encoding and said inverse quantization operation comprises:

subtracting a second coding offset from the segmentation mask code to obtain a non-offset example segmentation code;

and performing the inverse quantization operation on the non-offset example segmentation code to obtain the inverse quantization result.

18. A computer program product, characterized in that the computer program product comprises a computer program, wherein the computer program when executed by a processor is adapted to perform the method of any of claims 1-17.

19. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 17.

20. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program is operable to implement the method of any one of claims 1-17.