CN116993976A

CN116993976A - Reference image segmentation model training method and reference image segmentation method

Info

Publication number: CN116993976A
Application number: CN202310877057.9A
Authority: CN
Inventors: 张兆翔; 樊峻秘; 甘睿彤
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2023-07-17
Filing date: 2023-07-17
Publication date: 2023-11-03
Anticipated expiration: 2043-07-17
Also published as: CN116993976B

Abstract

The invention relates to the technical field of computer vision, and provides a reference image segmentation model training method and a reference image segmentation method, wherein a text description corresponding to each target instance in an image sample is firstly input into an initial reference image segmentation model, and a text encoder is used for extracting characteristics of the text description to obtain initial text characteristics; then extracting image features of an image sample by an image encoder, and performing iterative optimization and fusion by adopting a cross attention mechanism operation to respectively obtain optimized target text features and target cross-mode image fusion features; and finally, obtaining a segmentation result through a decoder by using the target cross-modal image fusion characteristic, and carrying out model training by means of the training loss obtained through calculation. The method can effectively improve the segmentation capability of the target reference image segmentation model for low-quality text description, and reduce the occurrence of wrong matching of the confusion target.

Description

Reference image segmentation model training method and reference image segmentation method

Technical Field

The invention relates to the technical field of computer vision, in particular to a reference image segmentation model training method and a reference image segmentation method.

Background

Image segmentation is an important and classical computer vision task and has wide application in the fields of intelligent driving, video analysis, remote sensing monitoring and the like.

The reference image segmentation is to guide a segmentation model to locate a specific target in an image by providing a section of natural language text description aiming at the specific target so as to segment out the corresponding target, and how to accurately express and fuse the characteristic information of two branches of the text and the image is the research focus of the reference image segmentation. However, the feature extraction of the existing reference image segmentation on the text description branch is directly generated based on a pre-trained language model, so that reliable text features for guiding and positioning are difficult to obtain when low-quality text description is faced, and further, the situation of mismatching of a confusion target occurs, the performance of the reference image segmentation model is poor, and the obtained segmentation result is inaccurate.

Disclosure of Invention

The invention provides a reference image segmentation model training method and a reference image segmentation method, which are used for solving the defects in the prior art.

The invention provides a reference image segmentation model training method, which comprises the following steps:

Collecting each target instance in an image sample and text description corresponding to each target instance, inputting the text description to a text encoder in an initial reference image segmentation model, and extracting features of the text description by the text encoder to obtain initial text features;

inputting the image sample and the initial text feature into an image encoder in the initial reference image segmentation model, extracting the initial image feature of the image sample by the image encoder, optimizing the initial text feature by adopting a cross attention mechanism based on the initial image feature to obtain a text optimization feature, fusing the text optimization feature and the initial image feature to obtain a fused image feature, and performing iterative optimization and fusion on the text optimization feature and the fused image feature to obtain a target text feature and a target cross-modal image fusion feature;

inputting the target cross-modal image fusion characteristic to a decoder in the initial reference image segmentation model to obtain a segmentation result output by the decoder, calculating training loss based on the target text characteristic, a target instance label in the image sample and the segmentation result, and performing iterative optimization on structural parameters of the initial reference image segmentation model based on the training loss to obtain a target reference image segmentation model.

According to the training method of the reference image segmentation model provided by the invention, based on the target text characteristics, the target instance labels in the image samples and the segmentation results, the training loss is calculated, and the training loss comprises the following steps:

calculating a segmentation loss based on a target instance tag in the image sample and the segmentation result;

based on target text features corresponding to different target examples in the image sample, calculating consistency loss corresponding to each target example and contrast loss corresponding to each target text feature;

and calculating the training loss based on the segmentation loss, the consistency loss corresponding to each target instance and the contrast loss corresponding to each target text feature.

According to the training method of the cited image segmentation model provided by the invention, based on the segmentation loss, the consistency loss corresponding to each target instance and the contrast loss corresponding to each target text feature, the training loss is calculated, and the method comprises the following steps:

calculating the total consistency loss corresponding to each image sample based on the consistency loss corresponding to each target instance, and calculating the total consistency loss corresponding to each image sample based on the contrast loss corresponding to each target text feature;

And calculating a first weighted summation result of the total consistency loss and the total contrast loss, calculating a second weighted summation result of the first weighted summation result and the segmentation loss, and taking the second weighted summation result as the training loss.

According to the training method of the reference image segmentation model provided by the invention, based on the target text features corresponding to different target examples in the image sample, the contrast loss corresponding to each target text feature is calculated, and the method comprises the following steps:

based on the target text features corresponding to the same target instance in the image sample, calculating positive similarity of the target text features corresponding to the same target instance;

calculating the negative similarity of the target text features corresponding to different target examples based on the target text features corresponding to different target examples in the image sample;

and calculating the contrast loss corresponding to each target text feature based on the positive similarity and the negative similarity.

According to the training method of the reference image segmentation model provided by the invention, the image encoder comprises a plurality of layers of structures which are connected in sequence, each layer of structure comprises a first input, a first output, a second input and a second output, wherein the first output of the former layer of structure is used as the first input of the latter layer of structure, and the second output of the former layer of structure is used as the second input of the latter layer of structure;

The first input of the first layer structure is the initial text feature, the second input of the first layer structure is the initial image feature, the first output of the last layer structure is the target text feature, and the second output of the last layer structure is the target cross-modal image fusion feature;

each layer of structure comprises a coding block, a text perception fusion module, a first residual error module, a second residual error module, a first addition module and a second addition module, wherein the second input of each layer of structure is subjected to coding to obtain alternative image characteristics, the first input of each layer of structure and the alternative image characteristics obtained by the current layer of structure are subjected to optimization on the first input of the current layer of structure by the text perception fusion module by adopting a cross attention mechanism to obtain text optimization characteristics, and the text optimization characteristics are fused with the alternative image characteristics by adopting a cross attention mechanism to obtain fused image characteristics;

the text optimization feature is subjected to the first residual error module and then is subjected to the first addition module with the first input of the current layer structure to obtain a first output of the current layer structure;

and the fusion image features pass through the second residual error module and then the alternative image features pass through the second addition module to obtain second output of the current layer structure.

According to the training method of the reference image segmentation model provided by the invention, the text encoder is a pre-training language model.

According to the training method of the reference image segmentation model provided by the invention, each target instance in the image sample and the text description corresponding to each target instance are collected, and then the method comprises the following steps:

and constructing the image sample, each target instance and a text description triplet corresponding to each target instance, and taking each triplet as a training sample of the initial reference image segmentation model.

The invention also provides a reference image segmentation method, which comprises the following steps:

acquiring a to-be-segmented image and description information corresponding to a target object in the to-be-segmented image;

and inputting the image to be segmented and the description information into a target reference image segmentation model determined by the reference image segmentation model training method, so as to obtain a segmentation result corresponding to the image to be segmented, which is output by the target reference image segmentation model.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the reference image segmentation model training method or the reference image segmentation method according to any one of the above when executing the computer program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a reference image segmentation model training method, or a reference image segmentation method, as described in any of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a reference image segmentation model training method, or a reference image segmentation method, as described in any of the above.

According to the reference image segmentation model training method and the reference image segmentation method, the reference image segmentation model training method comprises the steps of firstly collecting each target instance in an image sample and text description corresponding to each target instance, inputting the text description to a text encoder in an initial reference image segmentation model, and extracting features of the text description by the text encoder to obtain initial text features; then inputting the image sample and the initial text feature to an image encoder in an initial reference image segmentation model, extracting the initial image feature of the image sample by the image encoder, positioning and optimizing the initial text feature by using an image context feature by adopting a cross attention mechanism to obtain a text optimization feature, fusing the text optimization feature and the initial image feature to achieve the aim of reversely optimizing the image feature to obtain a fused image feature, and performing iterative optimization and fusion on the text optimization feature and the fused image feature to obtain a target text feature and a target cross-modal image fusion feature; and finally, inputting the target cross-modal image fusion characteristic to a decoder in the initial reference image segmentation model to obtain a segmentation result output by the decoder, calculating training loss based on the target text characteristic, a target instance label in the image sample and the segmentation result, and performing model training by using the training loss. The method can effectively improve the segmentation capability of the target reference image segmentation model obtained through training on low-quality text description, and reduce the occurrence of the situation of wrong matching of the confusion target.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to these drawings without inventive effort.

FIG. 1 is a flow chart of a reference image segmentation model training method provided by the invention;

FIG. 2 is a schematic diagram of the structure of an image encoder in the reference image segmentation model training method provided by the invention;

FIG. 3 is a schematic structural diagram of a TAF module in the training method of the cited image segmentation model provided by the invention;

FIG. 4 is a schematic diagram of the construction of training samples in the reference image segmentation model training method provided by the invention;

FIG. 5 is a flow chart of the reference image segmentation method provided by the invention;

FIG. 6 is a schematic diagram of a reference image segmentation model training device provided by the invention;

FIG. 7 is a schematic diagram of a reference image segmentation apparatus according to the present invention;

fig. 8 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The features of the invention "first", "second" and the like in the description and in the claims may be used for the explicit or implicit inclusion of one or more such features. In the description of the invention, unless otherwise indicated, the meaning of "a plurality" is two or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

Because the existing reference image segmentation is difficult to obtain reliable text features for guiding positioning when facing low-quality text description, the situation of mismatching and confusion of targets occurs, the performance of a reference image segmentation model is poor, and the obtained segmentation result is inaccurate. Therefore, the embodiment of the invention provides a reference image segmentation model training method.

Fig. 1 is a flow chart of a training method for a reference image segmentation model according to an embodiment of the present invention, as shown in fig. 1, the method includes:

s11, collecting each target instance in an image sample and text description corresponding to each target instance, inputting the text description to a text encoder in an initial reference image segmentation model, and extracting features of the text description by the text encoder to obtain initial text features;

s12, inputting an image sample and initial text features into an image encoder in an initial reference image segmentation model, extracting initial image features of the image sample by the image encoder, optimizing the initial text features by adopting a cross attention mechanism based on the initial image features to obtain text optimized features, fusing the text optimized features and the initial image features by adopting the cross attention mechanism to obtain fused image features, and carrying out iterative optimization and fusion on alternative text features and fused image features to obtain target text features and target cross-modal image fusion features;

s13, inputting the target cross-mode image fusion characteristic to a decoder in the initial reference image segmentation model to obtain a segmentation result output by the decoder, calculating training loss based on the target text characteristic, a target instance label in an image sample and the segmentation result, and performing iterative optimization on structural parameters of the initial reference image segmentation model based on the training loss to obtain the target reference image segmentation model.

Specifically, in the reference image segmentation model training method provided in the embodiment of the present invention, the execution subject is a reference image segmentation model training device, and the device may be configured in a computer, where the computer may be a local computer or a cloud computer, and the local computer may be a computer, a tablet, or the like, and is not limited herein specifically.

First, step S11 is executed to collect each target instance in the image sample and the text description corresponding to each target instance. The image samples may include a plurality of image samples, each of which may include one or more target instances, i.e., objects, and each of which may correspond to one or more text descriptions. For example, two target instances, namely, a person 1 and a person 2, may be included in one image sample, and the text description corresponding to the person 1 includes (1) a left elbow bender; (2) girls at the side of the pool; the text description corresponding to character 2 includes (1) girls brushing their teeth; (2) girls brush their teeth in front.

Thereafter, an initial reference image segmentation model is introduced, which model includes a text encoder, an image encoder, and a decoder. And extracting the characteristics of the text description by using a text encoder in the initial reference image segmentation model to obtain initial text characteristics. The text encoder may be an untrained initial text encoder or a pre-trained language model, i.e., a language model that is pre-trained by a corpus. It will be appreciated that the initial text feature is a rough text feature, and if it is directly used to segment a target instance in an image sample, the effect will be that the segmentation result is inaccurate. Therefore, further acquisition of deep target text features is required.

Then step S2 is performed, the image sample and the initial text features are input to an image encoder in an initial reference image segmentation model, by which the initial image features of the image sample can be extracted. The image encoder may comprise a backbone network with multiple layers, i.e. the backbone network may comprise multiple layers, each layer of the backbone network may comprise coding blocks, which may be used for image feature extraction. Accordingly, the image encoder further includes a processing structure coupled to each layer of the backbone network. The processing structure may have the following functions:

using initial image featuresAnd optimizing the initial text characteristics by adopting a cross attention mechanism to obtain text optimization characteristics. Here, a query can be constructed with initial text featuresThe initial image being characterized by keysSum->Is a cross-attention mechanism operation of (1). Wherein L represents an initial text feature, F represents an initial image feature, < ->And->Each representing a separate 1 x 1 convolutional layer. The optimized text optimization feature can be obtained by combining the normalized exponential function (softmax).

And continuing to adopt a cross attention mechanism to fuse the text optimization feature with the initial image feature, so as to obtain a fused image feature. Here, a cross attention mechanism operation with the initial image feature as the query Q ', the optimized text optimization feature as the key K ' and the value V ' is constructed, and the normalized exponential function (softmax) is combined to obtain the fused image feature after text optimization feature fusion.

The process can be performed through residual iteration in the image encoder, the text optimization feature obtained by the processing structure connected with the last layer of the backbone network in the final image encoder is the target text feature, and the fused image feature fused by the text optimization feature is the target cross-mode image fusion feature.

And finally, executing step S3, inputting the target cross-mode image fusion characteristic to a decoder in the initial reference image segmentation model to obtain a segmentation result output by the decoder, and calculating training loss by utilizing the target text characteristic, the target instance label in the image sample and the segmentation result. The training loss may include a text feature extraction loss, which may be determined by the target text features corresponding to different target instances in each image sample, and a segmentation loss, which may be determined by the target instance labels in the image samples and the segmentation results.

And carrying out iterative optimization on the structural parameters of the initial reference image segmentation model by using the calculated training loss until the preset iteration times or the training loss convergence is reached, so as to obtain the target reference image segmentation model.

According to the cited image segmentation model training method provided by the embodiment of the invention, firstly, text descriptions corresponding to target examples in an image sample are collected, the text descriptions are input to a text encoder in an initial cited image segmentation model, and the text encoder performs feature extraction on the text descriptions to obtain initial text features; then inputting the image sample and the text feature to an image encoder in an initial reference image segmentation model, extracting initial image features of the image sample by the image encoder, positioning and optimizing the initial text features by using an image context feature by adopting a cross attention mechanism to obtain text optimization features, fusing the text optimization features and the initial image features to achieve the aim of reversely optimizing the initial image features to obtain fused image features, and performing iterative optimization and fusion on the text optimization features and the fused image features to obtain target text features and target cross-modal image fusion features; and finally, inputting the target cross-modal image fusion characteristic to a decoder in the initial reference image segmentation model to obtain a segmentation result output by the decoder, calculating training loss based on the target text characteristic, a target instance label in the image sample and the segmentation result, and performing model training by using the training loss. The method can effectively improve the segmentation capability of the target reference image segmentation model obtained through training on low-quality text description, and reduce the occurrence of the situation of wrong matching of the confusion target.

On the basis of the above embodiment, the training method for the reference image segmentation model provided in the embodiment of the present invention calculates training loss based on the target text feature, the target instance tag in the image sample, and the segmentation result, including:

Specifically, when calculating the training loss, the segmentation loss may be calculated first by using the target instance tag in the image sample and the segmentation result. The segmentation loss may use a cross entropy loss function L _ce And (5) performing calculation.

Then, the consistency loss corresponding to each target instance and the contrast loss corresponding to each target text feature can be calculated by using the target text features corresponding to different target instances in each image sample. Wherein the consistency loss is used to ensure a smaller differentiation between target text features of the same target instance and the contrast loss is used to ensure a larger differentiation between target text features of different target instances.

For each target instance O in an image sample _i Its corresponding target text feature may include m _i (m _i 1) strips or more, together forming a target text feature tupleL _ia Representing target instance O _i Corresponding a (a is more than or equal to 1 and less than or equal to m) _i ) The entry mark text feature.

Target instance O _i Corresponding consistency lossCan be calculated by the following formula:

here, L _ib Representing target instance O _i The corresponding b (b not equal to a) th entry identifies the text feature,indicating the total number of times that needs to be averaged.

If only one text description exists in the target instance, i.e. m _i =1, then

For different target instances O in image sample I _a ,O _b ∈{O ₁ ,O ₂ ,…,O _n And can be respectively corresponding to the lengths m _a ,m _b Target text feature tuple G _a ,G _b The method comprises the steps of carrying out a first treatment on the surface of the The target text features subjected to image feature context information optimization supervision refer to the fact that the target text features corresponding to the same target instance have smaller distinction degree, and the target text features corresponding to different target instances have larger distinction degree.

For a target text feature L in image sample I _ai ∈G _a All other features L _aj ∈G _a I.noteq.j forms a positive paired sample with it, and the target text feature L _bj ∈G _b Then a negative paired sample is constructed therewith. By means of the positive pairing sample and the negative pairing sample, the contrast loss corresponding to each target text feature can be calculated.

Further, training loss can be calculated using the segmentation loss, the consistency loss for each target instance, and the contrast loss for each target text feature.

In the embodiment of the invention, the training loss comprises consistency loss and contrast loss, and the object text characteristics describing different object examples can be subjected to distinguishing constraint so as to discover the inherent constraint between the implicit object examples in the image sample and the text description.

On the basis of the above embodiment, the training method for the reference image segmentation model provided in the embodiment of the present invention calculates a contrast loss corresponding to each target text feature based on the target text features corresponding to different target instances in the image sample, including:

Specifically, when the contrast loss corresponding to each target text feature is calculated, the positive similarity of the same target instance in the image sample may be calculated by using the target text feature corresponding to the same target instance in the image sample. The method comprises the following steps:

wherein S is _po＝s Positive similarity. When m is _a When=1, S _pos ＝1。

And then calculating the negative similarity of the target text features corresponding to the different target examples by using the target text features corresponding to the different target examples in the image sample. The method comprises the following steps:

wherein S is _neg Is of positive similarityDegree.

And finally, calculating the contrast loss corresponding to each target text feature by utilizing the positive similarity and the negative similarity. Here, this can be achieved by the following formula:

on the basis of the above embodiment, the training method for the reference image segmentation model provided in the embodiment of the present invention calculates the training loss based on the segmentation loss, the consistency loss corresponding to each target instance, and the contrast loss corresponding to each target text feature, including:

Specifically, when calculating the training loss, the consistency loss corresponding to each target instance may be first utilized to calculate the total consistency loss corresponding to each image sample, that is, the total consistency loss L corresponding to each image sample may be obtained by averaging the consistency losses corresponding to all the target instances included in each image sample _consist . Calculating the total contrast loss corresponding to each image sample by using the contrast loss corresponding to each target text feature, namely obtaining the total contrast loss L corresponding to each image sample by averaging all the target text features corresponding to each image sample _contra . When there is only one target instance in the image sample, the image sample corresponds to the total contrast loss L _contra ＝0。

Thereafter, the training loss can be calculated by the following formula:

L＝L _ce +α·(L _contra +β·L _consist )；

wherein L is _contra +β·L _consist And alpha and beta are weight values respectively for the first weighted summation result.

In the embodiment of the invention, the obtained training loss is more beneficial to model training through double weighted fusion.

On the basis of the above embodiment, the image encoder includes sequentially connected multi-layer structures, each layer structure including a first input, a first output, a second input and a second output, the first output of the former layer structure being the first input of the latter layer structure, the second output of the former layer structure being the second input of the latter layer structure;

Specifically, the image encoder includes a plurality of layers N connected in sequence, and may include, for example, a 4-layer structure N, that is, a 4×n. Only a schematic of a one-layer structure is shown in fig. 2. Each layer structure i of the image encoder comprises a first input L ⁱ First output L ^io Second input V ⁱⁱ And a second output F ^io 。

The first output of the previous layer structure is used as the first input of the subsequent layer structure in the image encoder, and the second output of the previous layer structure is used as the second input of the subsequent layer structure. The first input of the first layer structure is text feature, the second input of the first layer structure is initial image feature, the first output of the last layer structure is target text feature, and the second output of the last layer structure is target cross-modal image fusion feature.

Each layer of structure comprises a coding block (Swin), a text aware fusion (Texture Aware Fusion, TAF) module, a first residual module, a second residual module, a first addition module and a second addition module. In fig. 2 "+" connected to the first residual block indicates the first addition block, and "+" connected to the second residual block indicates the second addition block.

Second input V of each layer structure ⁱⁱ Obtaining alternative image characteristics V through Swin ⁱ First input L of each layer structure ⁱ And alternative image feature V obtained from the current layer structure ⁱ Text optimization features are obtained through TAF modules respectivelyAnd fusion image feature->

Text optimization featuresFirst input L with current layer structure after first residual error module ⁱ Obtaining a first output L of the current layer structure through a first adding module ^io 。

Fusion ofImage featuresAfter passing through the second residual error module, the image is matched with the image characteristic V ⁱ Obtaining a second output F of the current layer structure through a second adding module ^io 。

As shown in fig. 3, a schematic structure of the TAF module is shown, and the input of the TAF module is the candidate image feature V ⁱ :(B,HW,C _i ) And a first input L of the current layer structure ⁱ :(B,C _l T), wherein B is the number of triplets formed by the image sample, each target instance and a text description corresponding to each target instance, and can be greater than or equal to 1, H is the height of the candidate image feature, W is the width of the candidate image feature, C _i The number of channels for the alternative image feature typically includes R, G, B three channels. C (C) _l The maximum length of the feature is optimized for the text, i.e. the maximum number of words contained in the text description, and B is the dimension of the text feature.

V ⁱ (B,HW,C _i ) Conversion to V1 (B, C) _i HW) via 1 x 1 convolutional layers, respectively And a convolution layer of 1 x 1->Obtain V2 (B, C) _l ,HW)、W3(B,HW,C _l )。

L ⁱ :(B,C _l T) are respectively subjected to convolution layers of 1×1And a convolution layer w of 1×1 _l Obtaining L1 (B, T, C) _i )、L2(B,C _l ,T)。

V2(B,C _l HW) and L1 (B, T, C _l ) After matrix multiplication, the matrix is normalized by an exponential function and then is matched with V3 (B, HW, C _l ) Matrix multiplication is performed to obtain L ^CA ，L ^CA Through a convolution layer w of 1 x 1 _cl Obtaining L3 (B, C) _l ,T)，L2(B,C _l T) and L3 (B, C _l T) through element-by-element point multiplication, then throughConvolution layer w of 1×1 _rl Obtaining text optimization features(B,C _l ,T)。

V ⁱ (B,HW,C _i ) Respectively through 1 x 1 convolution layersAnd a convolution layer w of 1×1 _i To obtain V4 and V5 (B, HW, C) _i ). Text optimization feature->(B,C _l T) are each convolved with 1X 1 convolution layer +.>And a convolution layer of 1 x 1->Obtaining L4 (B, C) _i ,T)、L5(B,T,C _i ,)。

V4 and L4 are multiplied by matrix and then normalized by exponential function, and then by L5 (B, T, C _i (ii) to obtain V ^CA (B,HW,C _i )，V ^CA Through a convolution layer w of 1 x 1 _ci Obtaining L6, multiplying L6 and V5 by element point, and then passing through a convolution layer w of 1 multiplied by 1 _fi Obtaining the characteristics of the fused image(B,HW,C _i )。

In the embodiment of the invention, the text perception fusion module, the first residual error module and the first addition module are introduced into the image encoder, so that the text optimization feature and the fusion image feature can be obtained, and the segmentation capability of the target reference image segmentation model for low-quality text description can be ensured.

On the basis of the above embodiment, the training method for the cited image segmentation model provided in the embodiment of the present invention collects each target instance in the image sample and the text description corresponding to each target instance, and then includes:

Specifically, as shown in fig. 4, each acquired image sample may include a plurality of target instances, each of which may have a plurality of textual descriptions. Thus, to facilitate training of the initial reference image segmentation model, after the collection of each target instance in the image sample and the corresponding textual description of each target instance, triples may be constructed from one image sample, one target instance, and one textual description, with each triplet being taken as a training sample of the initial reference image segmentation model.

As shown in fig. 5, on the basis of the above embodiment, the embodiment of the present invention further provides a reference image segmentation method, which includes:

s21, acquiring an image to be segmented and description information corresponding to a target object in the image to be segmented;

s22, inputting the image to be segmented and the description information into a target reference image segmentation model determined by the reference image segmentation model training method provided in the above embodiments, so as to obtain a segmentation result corresponding to the image to be segmented, which is output by the target reference image segmentation model.

Specifically, in the reference image segmentation method provided in the embodiment of the present invention, the execution subject is a reference image segmentation device, and the device may be configured in a computer, where the computer may be a local computer or a cloud computer, and the local computer may be a computer, a tablet, or the like, and is not limited herein specifically.

Step S21 is first executed to obtain an image to be segmented and description information corresponding to a target object in the image to be segmented, where the image to be segmented is an image in which a category of the target object needs to be determined, and the target object included in the image to be segmented may include one or more, for example, people, animals, trees, buildings, roads, and the like. The description information corresponding to each target object may include one or more pieces, which are not particularly limited herein.

Then, step S22 is executed, in which the image to be segmented and the description information are input into the target reference image segmentation model determined by the reference image segmentation model training method provided in the above embodiments, so as to obtain a segmentation result corresponding to the image to be segmented output by the target reference image segmentation model. The segmentation result is a pixel-by-pixel segmentation result, which is used for representing the category of each target object contained in the image to be segmented.

According to the reference image segmentation method provided by the embodiment of the invention, due to the adoption of the target reference image segmentation model, the segmentation result is more accurate and reliable, and the occurrence of wrong matching of the confusion target is reduced.

As shown in fig. 6, on the basis of the above embodiment, an apparatus for training a reference image segmentation model is provided in an embodiment of the present invention, including:

the text coding module 61 is configured to collect each target instance in the image sample and text descriptions corresponding to each target instance, input the text descriptions to a text encoder in an initial reference image segmentation model, and perform feature extraction on the text descriptions by the text encoder to obtain initial text features;

the text image fusion module 62 is configured to input the image sample and the initial text feature to an image encoder in the initial reference image segmentation model, extract, by the image encoder, the initial image feature of the image sample, optimize, based on the initial image feature, the initial text feature by using a cross attention mechanism to obtain a text optimized feature, fuse the text feature optimization with the initial image feature to obtain a fused image feature, and iteratively optimize and fuse the text optimized feature and the fused image feature to obtain a target text feature and a target cross-modal image fusion feature;

The model training module 63 is configured to input the target cross-modal image fusion feature to a decoder in the initial reference image segmentation model, obtain a segmentation result output by the decoder, calculate a training loss based on the target text feature, a target instance tag in the image sample, and the segmentation result, and iteratively optimize structural parameters of the initial reference image segmentation model based on the training loss, so as to obtain a target reference image segmentation model.

On the basis of the above embodiment, the reference image segmentation model training device and the model training module provided in the embodiments of the present invention are specifically configured to:

On the basis of the above embodiment, the reference image segmentation model training device and the model training module provided in the embodiments of the present invention are further specifically configured to:

On the basis of the above embodiment, the reference image segmentation model training device provided in the embodiment of the present invention, the image encoder includes sequentially connected multi-layer structures, each layer structure includes a first input, a first output, a second input and a second output, the first output of the former layer structure is used as the first input of the latter layer structure, and the second output of the former layer structure is used as the second input of the latter layer structure;

On the basis of the above embodiment, the reference image segmentation model training device provided in the embodiment of the present invention, the text encoder is a pre-training language model.

On the basis of the above embodiment, the reference image segmentation model training device provided in the embodiment of the present invention further includes a triplet construction module, configured to:

Specifically, the functions of each module in the cited image segmentation model training device provided in the embodiment of the present invention are in one-to-one correspondence with the operation flow of each step in the above method embodiment, and the achieved effects are consistent.

As shown in fig. 7, on the basis of the above embodiment, there is provided a cited image dividing apparatus in an embodiment of the present invention, including:

an obtaining module 71, configured to obtain an image to be segmented and description information corresponding to a target object in the image to be segmented;

the reference image segmentation module 72 is configured to input the image to be segmented and the description information into a target reference image segmentation model determined by the reference image segmentation model training method provided in the above embodiments, so as to obtain a segmentation result corresponding to the image to be segmented output by the target reference image segmentation model.

Specifically, the functions of each module in the cited image segmentation device provided in the embodiment of the present invention are in one-to-one correspondence with the operation flow of each step in the above method embodiment, and the achieved effects are consistent.

Fig. 8 illustrates a physical structure diagram of an electronic device, as shown in fig. 8, which may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein Processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform the reference image segmentation model training method, or the reference image segmentation method, provided in the embodiments described above.

Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can perform the cited image segmentation model training method, or the cited image segmentation method provided in the foregoing embodiments.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the reference image segmentation model training method, or the reference image segmentation method, provided in the above embodiments.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A reference image segmentation model training method, comprising:

inputting the image sample and the initial text feature into an image encoder in the initial reference image segmentation model, extracting the initial image feature of the image sample by the image encoder, optimizing the initial text feature by adopting a cross attention mechanism based on the initial image feature to obtain a text optimization feature, fusing the text feature optimization and the initial image feature to obtain a fused image feature, and performing iterative optimization and fusion on the text optimization feature and the fused image feature to obtain a target text feature and a target cross-modal image fusion feature;

2. The cited image segmentation model training method of claim 1, wherein calculating a training loss based on the target text feature, a target instance tag in the image sample, and the segmentation result comprises:

3. The cited image segmentation model training method of claim 2, wherein calculating the training loss based on the segmentation loss, a consistency loss for each target instance, and a contrast loss for each target text feature comprises:

4. The cited image segmentation model training method according to claim 2, wherein calculating a contrast loss corresponding to each target text feature based on the target text features corresponding to different target instances in the image sample comprises:

5. The cited image segmentation model training method as set forth in claim 1, wherein the image encoder comprises sequentially connected multi-layered structures, each of the multi-layered structures comprising a first input, a first output, a second input, and a second output, the first output of a previous layer structure being the first input of a subsequent layer structure, the second output of the previous layer structure being the second input of the subsequent layer structure;

6. The cited image segmentation model training method according to any one of claims 1-5, wherein the text encoder is a pre-trained language model.

7. The method for training a cited image segmentation model according to any one of claims 1-5, wherein collecting each target instance in the image sample and a text description corresponding to each target instance comprises:

8. A reference image segmentation method, comprising:

inputting the image to be segmented and the description information into a target reference image segmentation model determined by the reference image segmentation model training method according to any one of claims 1-7, and obtaining a segmentation result corresponding to the image to be segmented output by the target reference image segmentation model.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the reference image segmentation model training method of any one of claims 1-7 or the reference image segmentation method of claim 8 when the computer program is executed.

10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the reference image segmentation model training method according to any one of claims 1-7, or the reference image segmentation method according to claim 8.