CN116993976A - Reference image segmentation model training method and reference image segmentation method - Google Patents

Reference image segmentation model training method and reference image segmentation method Download PDF

Info

Publication number
CN116993976A
CN116993976A CN202310877057.9A CN202310877057A CN116993976A CN 116993976 A CN116993976 A CN 116993976A CN 202310877057 A CN202310877057 A CN 202310877057A CN 116993976 A CN116993976 A CN 116993976A
Authority
CN
China
Prior art keywords
image
target
text
feature
reference image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310877057.9A
Other languages
Chinese (zh)
Other versions
CN116993976B (en
Inventor
张兆翔
樊峻秘
甘睿彤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202310877057.9A priority Critical patent/CN116993976B/en
Publication of CN116993976A publication Critical patent/CN116993976A/en
Application granted granted Critical
Publication of CN116993976B publication Critical patent/CN116993976B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • G06F18/256Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of computer vision, and provides a reference image segmentation model training method and a reference image segmentation method, wherein a text description corresponding to each target instance in an image sample is firstly input into an initial reference image segmentation model, and a text encoder is used for extracting characteristics of the text description to obtain initial text characteristics; then extracting image features of an image sample by an image encoder, and performing iterative optimization and fusion by adopting a cross attention mechanism operation to respectively obtain optimized target text features and target cross-mode image fusion features; and finally, obtaining a segmentation result through a decoder by using the target cross-modal image fusion characteristic, and carrying out model training by means of the training loss obtained through calculation. The method can effectively improve the segmentation capability of the target reference image segmentation model for low-quality text description, and reduce the occurrence of wrong matching of the confusion target.

Description

Reference image segmentation model training method and reference image segmentation method
Technical Field
The invention relates to the technical field of computer vision, in particular to a reference image segmentation model training method and a reference image segmentation method.
Background
Image segmentation is an important and classical computer vision task and has wide application in the fields of intelligent driving, video analysis, remote sensing monitoring and the like.
The reference image segmentation is to guide a segmentation model to locate a specific target in an image by providing a section of natural language text description aiming at the specific target so as to segment out the corresponding target, and how to accurately express and fuse the characteristic information of two branches of the text and the image is the research focus of the reference image segmentation. However, the feature extraction of the existing reference image segmentation on the text description branch is directly generated based on a pre-trained language model, so that reliable text features for guiding and positioning are difficult to obtain when low-quality text description is faced, and further, the situation of mismatching of a confusion target occurs, the performance of the reference image segmentation model is poor, and the obtained segmentation result is inaccurate.
Disclosure of Invention
The invention provides a reference image segmentation model training method and a reference image segmentation method, which are used for solving the defects in the prior art.
The invention provides a reference image segmentation model training method, which comprises the following steps:
Collecting each target instance in an image sample and text description corresponding to each target instance, inputting the text description to a text encoder in an initial reference image segmentation model, and extracting features of the text description by the text encoder to obtain initial text features;
inputting the image sample and the initial text feature into an image encoder in the initial reference image segmentation model, extracting the initial image feature of the image sample by the image encoder, optimizing the initial text feature by adopting a cross attention mechanism based on the initial image feature to obtain a text optimization feature, fusing the text optimization feature and the initial image feature to obtain a fused image feature, and performing iterative optimization and fusion on the text optimization feature and the fused image feature to obtain a target text feature and a target cross-modal image fusion feature;
inputting the target cross-modal image fusion characteristic to a decoder in the initial reference image segmentation model to obtain a segmentation result output by the decoder, calculating training loss based on the target text characteristic, a target instance label in the image sample and the segmentation result, and performing iterative optimization on structural parameters of the initial reference image segmentation model based on the training loss to obtain a target reference image segmentation model.
According to the training method of the reference image segmentation model provided by the invention, based on the target text characteristics, the target instance labels in the image samples and the segmentation results, the training loss is calculated, and the training loss comprises the following steps:
calculating a segmentation loss based on a target instance tag in the image sample and the segmentation result;
based on target text features corresponding to different target examples in the image sample, calculating consistency loss corresponding to each target example and contrast loss corresponding to each target text feature;
and calculating the training loss based on the segmentation loss, the consistency loss corresponding to each target instance and the contrast loss corresponding to each target text feature.
According to the training method of the cited image segmentation model provided by the invention, based on the segmentation loss, the consistency loss corresponding to each target instance and the contrast loss corresponding to each target text feature, the training loss is calculated, and the method comprises the following steps:
calculating the total consistency loss corresponding to each image sample based on the consistency loss corresponding to each target instance, and calculating the total consistency loss corresponding to each image sample based on the contrast loss corresponding to each target text feature;
And calculating a first weighted summation result of the total consistency loss and the total contrast loss, calculating a second weighted summation result of the first weighted summation result and the segmentation loss, and taking the second weighted summation result as the training loss.
According to the training method of the reference image segmentation model provided by the invention, based on the target text features corresponding to different target examples in the image sample, the contrast loss corresponding to each target text feature is calculated, and the method comprises the following steps:
based on the target text features corresponding to the same target instance in the image sample, calculating positive similarity of the target text features corresponding to the same target instance;
calculating the negative similarity of the target text features corresponding to different target examples based on the target text features corresponding to different target examples in the image sample;
and calculating the contrast loss corresponding to each target text feature based on the positive similarity and the negative similarity.
According to the training method of the reference image segmentation model provided by the invention, the image encoder comprises a plurality of layers of structures which are connected in sequence, each layer of structure comprises a first input, a first output, a second input and a second output, wherein the first output of the former layer of structure is used as the first input of the latter layer of structure, and the second output of the former layer of structure is used as the second input of the latter layer of structure;
The first input of the first layer structure is the initial text feature, the second input of the first layer structure is the initial image feature, the first output of the last layer structure is the target text feature, and the second output of the last layer structure is the target cross-modal image fusion feature;
each layer of structure comprises a coding block, a text perception fusion module, a first residual error module, a second residual error module, a first addition module and a second addition module, wherein the second input of each layer of structure is subjected to coding to obtain alternative image characteristics, the first input of each layer of structure and the alternative image characteristics obtained by the current layer of structure are subjected to optimization on the first input of the current layer of structure by the text perception fusion module by adopting a cross attention mechanism to obtain text optimization characteristics, and the text optimization characteristics are fused with the alternative image characteristics by adopting a cross attention mechanism to obtain fused image characteristics;
the text optimization feature is subjected to the first residual error module and then is subjected to the first addition module with the first input of the current layer structure to obtain a first output of the current layer structure;
and the fusion image features pass through the second residual error module and then the alternative image features pass through the second addition module to obtain second output of the current layer structure.
According to the training method of the reference image segmentation model provided by the invention, the text encoder is a pre-training language model.
According to the training method of the reference image segmentation model provided by the invention, each target instance in the image sample and the text description corresponding to each target instance are collected, and then the method comprises the following steps:
and constructing the image sample, each target instance and a text description triplet corresponding to each target instance, and taking each triplet as a training sample of the initial reference image segmentation model.
The invention also provides a reference image segmentation method, which comprises the following steps:
acquiring a to-be-segmented image and description information corresponding to a target object in the to-be-segmented image;
and inputting the image to be segmented and the description information into a target reference image segmentation model determined by the reference image segmentation model training method, so as to obtain a segmentation result corresponding to the image to be segmented, which is output by the target reference image segmentation model.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the reference image segmentation model training method or the reference image segmentation method according to any one of the above when executing the computer program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a reference image segmentation model training method, or a reference image segmentation method, as described in any of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a reference image segmentation model training method, or a reference image segmentation method, as described in any of the above.
According to the reference image segmentation model training method and the reference image segmentation method, the reference image segmentation model training method comprises the steps of firstly collecting each target instance in an image sample and text description corresponding to each target instance, inputting the text description to a text encoder in an initial reference image segmentation model, and extracting features of the text description by the text encoder to obtain initial text features; then inputting the image sample and the initial text feature to an image encoder in an initial reference image segmentation model, extracting the initial image feature of the image sample by the image encoder, positioning and optimizing the initial text feature by using an image context feature by adopting a cross attention mechanism to obtain a text optimization feature, fusing the text optimization feature and the initial image feature to achieve the aim of reversely optimizing the image feature to obtain a fused image feature, and performing iterative optimization and fusion on the text optimization feature and the fused image feature to obtain a target text feature and a target cross-modal image fusion feature; and finally, inputting the target cross-modal image fusion characteristic to a decoder in the initial reference image segmentation model to obtain a segmentation result output by the decoder, calculating training loss based on the target text characteristic, a target instance label in the image sample and the segmentation result, and performing model training by using the training loss. The method can effectively improve the segmentation capability of the target reference image segmentation model obtained through training on low-quality text description, and reduce the occurrence of the situation of wrong matching of the confusion target.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to these drawings without inventive effort.
FIG. 1 is a flow chart of a reference image segmentation model training method provided by the invention;
FIG. 2 is a schematic diagram of the structure of an image encoder in the reference image segmentation model training method provided by the invention;
FIG. 3 is a schematic structural diagram of a TAF module in the training method of the cited image segmentation model provided by the invention;
FIG. 4 is a schematic diagram of the construction of training samples in the reference image segmentation model training method provided by the invention;
FIG. 5 is a flow chart of the reference image segmentation method provided by the invention;
FIG. 6 is a schematic diagram of a reference image segmentation model training device provided by the invention;
FIG. 7 is a schematic diagram of a reference image segmentation apparatus according to the present invention;
fig. 8 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The features of the invention "first", "second" and the like in the description and in the claims may be used for the explicit or implicit inclusion of one or more such features. In the description of the invention, unless otherwise indicated, the meaning of "a plurality" is two or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.
Because the existing reference image segmentation is difficult to obtain reliable text features for guiding positioning when facing low-quality text description, the situation of mismatching and confusion of targets occurs, the performance of a reference image segmentation model is poor, and the obtained segmentation result is inaccurate. Therefore, the embodiment of the invention provides a reference image segmentation model training method.
Fig. 1 is a flow chart of a training method for a reference image segmentation model according to an embodiment of the present invention, as shown in fig. 1, the method includes:
s11, collecting each target instance in an image sample and text description corresponding to each target instance, inputting the text description to a text encoder in an initial reference image segmentation model, and extracting features of the text description by the text encoder to obtain initial text features;
s12, inputting an image sample and initial text features into an image encoder in an initial reference image segmentation model, extracting initial image features of the image sample by the image encoder, optimizing the initial text features by adopting a cross attention mechanism based on the initial image features to obtain text optimized features, fusing the text optimized features and the initial image features by adopting the cross attention mechanism to obtain fused image features, and carrying out iterative optimization and fusion on alternative text features and fused image features to obtain target text features and target cross-modal image fusion features;
s13, inputting the target cross-mode image fusion characteristic to a decoder in the initial reference image segmentation model to obtain a segmentation result output by the decoder, calculating training loss based on the target text characteristic, a target instance label in an image sample and the segmentation result, and performing iterative optimization on structural parameters of the initial reference image segmentation model based on the training loss to obtain the target reference image segmentation model.
Specifically, in the reference image segmentation model training method provided in the embodiment of the present invention, the execution subject is a reference image segmentation model training device, and the device may be configured in a computer, where the computer may be a local computer or a cloud computer, and the local computer may be a computer, a tablet, or the like, and is not limited herein specifically.
First, step S11 is executed to collect each target instance in the image sample and the text description corresponding to each target instance. The image samples may include a plurality of image samples, each of which may include one or more target instances, i.e., objects, and each of which may correspond to one or more text descriptions. For example, two target instances, namely, a person 1 and a person 2, may be included in one image sample, and the text description corresponding to the person 1 includes (1) a left elbow bender; (2) girls at the side of the pool; the text description corresponding to character 2 includes (1) girls brushing their teeth; (2) girls brush their teeth in front.
Thereafter, an initial reference image segmentation model is introduced, which model includes a text encoder, an image encoder, and a decoder. And extracting the characteristics of the text description by using a text encoder in the initial reference image segmentation model to obtain initial text characteristics. The text encoder may be an untrained initial text encoder or a pre-trained language model, i.e., a language model that is pre-trained by a corpus. It will be appreciated that the initial text feature is a rough text feature, and if it is directly used to segment a target instance in an image sample, the effect will be that the segmentation result is inaccurate. Therefore, further acquisition of deep target text features is required.
Then step S2 is performed, the image sample and the initial text features are input to an image encoder in an initial reference image segmentation model, by which the initial image features of the image sample can be extracted. The image encoder may comprise a backbone network with multiple layers, i.e. the backbone network may comprise multiple layers, each layer of the backbone network may comprise coding blocks, which may be used for image feature extraction. Accordingly, the image encoder further includes a processing structure coupled to each layer of the backbone network. The processing structure may have the following functions:
using initial image featuresAnd optimizing the initial text characteristics by adopting a cross attention mechanism to obtain text optimization characteristics. Here, a query can be constructed with initial text featuresThe initial image being characterized by keysSum->Is a cross-attention mechanism operation of (1). Wherein L represents an initial text feature, F represents an initial image feature, < ->And->Each representing a separate 1 x 1 convolutional layer. The optimized text optimization feature can be obtained by combining the normalized exponential function (softmax).
And continuing to adopt a cross attention mechanism to fuse the text optimization feature with the initial image feature, so as to obtain a fused image feature. Here, a cross attention mechanism operation with the initial image feature as the query Q ', the optimized text optimization feature as the key K ' and the value V ' is constructed, and the normalized exponential function (softmax) is combined to obtain the fused image feature after text optimization feature fusion.
The process can be performed through residual iteration in the image encoder, the text optimization feature obtained by the processing structure connected with the last layer of the backbone network in the final image encoder is the target text feature, and the fused image feature fused by the text optimization feature is the target cross-mode image fusion feature.
And finally, executing step S3, inputting the target cross-mode image fusion characteristic to a decoder in the initial reference image segmentation model to obtain a segmentation result output by the decoder, and calculating training loss by utilizing the target text characteristic, the target instance label in the image sample and the segmentation result. The training loss may include a text feature extraction loss, which may be determined by the target text features corresponding to different target instances in each image sample, and a segmentation loss, which may be determined by the target instance labels in the image samples and the segmentation results.
And carrying out iterative optimization on the structural parameters of the initial reference image segmentation model by using the calculated training loss until the preset iteration times or the training loss convergence is reached, so as to obtain the target reference image segmentation model.
According to the cited image segmentation model training method provided by the embodiment of the invention, firstly, text descriptions corresponding to target examples in an image sample are collected, the text descriptions are input to a text encoder in an initial cited image segmentation model, and the text encoder performs feature extraction on the text descriptions to obtain initial text features; then inputting the image sample and the text feature to an image encoder in an initial reference image segmentation model, extracting initial image features of the image sample by the image encoder, positioning and optimizing the initial text features by using an image context feature by adopting a cross attention mechanism to obtain text optimization features, fusing the text optimization features and the initial image features to achieve the aim of reversely optimizing the initial image features to obtain fused image features, and performing iterative optimization and fusion on the text optimization features and the fused image features to obtain target text features and target cross-modal image fusion features; and finally, inputting the target cross-modal image fusion characteristic to a decoder in the initial reference image segmentation model to obtain a segmentation result output by the decoder, calculating training loss based on the target text characteristic, a target instance label in the image sample and the segmentation result, and performing model training by using the training loss. The method can effectively improve the segmentation capability of the target reference image segmentation model obtained through training on low-quality text description, and reduce the occurrence of the situation of wrong matching of the confusion target.
On the basis of the above embodiment, the training method for the reference image segmentation model provided in the embodiment of the present invention calculates training loss based on the target text feature, the target instance tag in the image sample, and the segmentation result, including:
calculating a segmentation loss based on a target instance tag in the image sample and the segmentation result;
based on target text features corresponding to different target examples in the image sample, calculating consistency loss corresponding to each target example and contrast loss corresponding to each target text feature;
and calculating the training loss based on the segmentation loss, the consistency loss corresponding to each target instance and the contrast loss corresponding to each target text feature.
Specifically, when calculating the training loss, the segmentation loss may be calculated first by using the target instance tag in the image sample and the segmentation result. The segmentation loss may use a cross entropy loss function L ce And (5) performing calculation.
Then, the consistency loss corresponding to each target instance and the contrast loss corresponding to each target text feature can be calculated by using the target text features corresponding to different target instances in each image sample. Wherein the consistency loss is used to ensure a smaller differentiation between target text features of the same target instance and the contrast loss is used to ensure a larger differentiation between target text features of different target instances.
For each target instance O in an image sample i Its corresponding target text feature may include m i (m i 1) strips or more, together forming a target text feature tupleL ia Representing target instance O i Corresponding a (a is more than or equal to 1 and less than or equal to m) i ) The entry mark text feature.
Target instance O i Corresponding consistency lossCan be calculated by the following formula:
here, L ib Representing target instance O i The corresponding b (b not equal to a) th entry identifies the text feature,indicating the total number of times that needs to be averaged.
If only one text description exists in the target instance, i.e. m i =1, then
For different target instances O in image sample I a ,O b ∈{O 1 ,O 2 ,…,O n And can be respectively corresponding to the lengths m a ,m b Target text feature tuple G a ,G b The method comprises the steps of carrying out a first treatment on the surface of the The target text features subjected to image feature context information optimization supervision refer to the fact that the target text features corresponding to the same target instance have smaller distinction degree, and the target text features corresponding to different target instances have larger distinction degree.
For a target text feature L in image sample I ai ∈G a All other features L aj ∈G a I.noteq.j forms a positive paired sample with it, and the target text feature L bj ∈G b Then a negative paired sample is constructed therewith. By means of the positive pairing sample and the negative pairing sample, the contrast loss corresponding to each target text feature can be calculated.
Further, training loss can be calculated using the segmentation loss, the consistency loss for each target instance, and the contrast loss for each target text feature.
In the embodiment of the invention, the training loss comprises consistency loss and contrast loss, and the object text characteristics describing different object examples can be subjected to distinguishing constraint so as to discover the inherent constraint between the implicit object examples in the image sample and the text description.
On the basis of the above embodiment, the training method for the reference image segmentation model provided in the embodiment of the present invention calculates a contrast loss corresponding to each target text feature based on the target text features corresponding to different target instances in the image sample, including:
based on the target text features corresponding to the same target instance in the image sample, calculating positive similarity of the target text features corresponding to the same target instance;
calculating the negative similarity of the target text features corresponding to different target examples based on the target text features corresponding to different target examples in the image sample;
and calculating the contrast loss corresponding to each target text feature based on the positive similarity and the negative similarity.
Specifically, when the contrast loss corresponding to each target text feature is calculated, the positive similarity of the same target instance in the image sample may be calculated by using the target text feature corresponding to the same target instance in the image sample. The method comprises the following steps:
wherein S is po=s Positive similarity. When m is a When=1, S pos =1。
And then calculating the negative similarity of the target text features corresponding to the different target examples by using the target text features corresponding to the different target examples in the image sample. The method comprises the following steps:
wherein S is neg Is of positive similarityDegree.
And finally, calculating the contrast loss corresponding to each target text feature by utilizing the positive similarity and the negative similarity. Here, this can be achieved by the following formula:
on the basis of the above embodiment, the training method for the reference image segmentation model provided in the embodiment of the present invention calculates the training loss based on the segmentation loss, the consistency loss corresponding to each target instance, and the contrast loss corresponding to each target text feature, including:
calculating the total consistency loss corresponding to each image sample based on the consistency loss corresponding to each target instance, and calculating the total consistency loss corresponding to each image sample based on the contrast loss corresponding to each target text feature;
And calculating a first weighted summation result of the total consistency loss and the total contrast loss, calculating a second weighted summation result of the first weighted summation result and the segmentation loss, and taking the second weighted summation result as the training loss.
Specifically, when calculating the training loss, the consistency loss corresponding to each target instance may be first utilized to calculate the total consistency loss corresponding to each image sample, that is, the total consistency loss L corresponding to each image sample may be obtained by averaging the consistency losses corresponding to all the target instances included in each image sample consist . Calculating the total contrast loss corresponding to each image sample by using the contrast loss corresponding to each target text feature, namely obtaining the total contrast loss L corresponding to each image sample by averaging all the target text features corresponding to each image sample contra . When there is only one target instance in the image sample, the image sample corresponds to the total contrast loss L contra =0。
Thereafter, the training loss can be calculated by the following formula:
L=L ce +α·(L contra +β·L consist );
wherein L is contra +β·L consist And alpha and beta are weight values respectively for the first weighted summation result.
In the embodiment of the invention, the obtained training loss is more beneficial to model training through double weighted fusion.
On the basis of the above embodiment, the image encoder includes sequentially connected multi-layer structures, each layer structure including a first input, a first output, a second input and a second output, the first output of the former layer structure being the first input of the latter layer structure, the second output of the former layer structure being the second input of the latter layer structure;
the first input of the first layer structure is the initial text feature, the second input of the first layer structure is the initial image feature, the first output of the last layer structure is the target text feature, and the second output of the last layer structure is the target cross-modal image fusion feature;
each layer of structure comprises a coding block, a text perception fusion module, a first residual error module, a second residual error module, a first addition module and a second addition module, wherein the second input of each layer of structure is subjected to coding to obtain alternative image characteristics, the first input of each layer of structure and the alternative image characteristics obtained by the current layer of structure are subjected to optimization on the first input of the current layer of structure by the text perception fusion module by adopting a cross attention mechanism to obtain text optimization characteristics, and the text optimization characteristics are fused with the alternative image characteristics by adopting a cross attention mechanism to obtain fused image characteristics;
The text optimization feature is subjected to the first residual error module and then is subjected to the first addition module with the first input of the current layer structure to obtain a first output of the current layer structure;
and the fusion image features pass through the second residual error module and then the alternative image features pass through the second addition module to obtain second output of the current layer structure.
Specifically, the image encoder includes a plurality of layers N connected in sequence, and may include, for example, a 4-layer structure N, that is, a 4×n. Only a schematic of a one-layer structure is shown in fig. 2. Each layer structure i of the image encoder comprises a first input L i First output L io Second input V ii And a second output F io
The first output of the previous layer structure is used as the first input of the subsequent layer structure in the image encoder, and the second output of the previous layer structure is used as the second input of the subsequent layer structure. The first input of the first layer structure is text feature, the second input of the first layer structure is initial image feature, the first output of the last layer structure is target text feature, and the second output of the last layer structure is target cross-modal image fusion feature.
Each layer of structure comprises a coding block (Swin), a text aware fusion (Texture Aware Fusion, TAF) module, a first residual module, a second residual module, a first addition module and a second addition module. In fig. 2 "+" connected to the first residual block indicates the first addition block, and "+" connected to the second residual block indicates the second addition block.
Second input V of each layer structure ii Obtaining alternative image characteristics V through Swin i First input L of each layer structure i And alternative image feature V obtained from the current layer structure i Text optimization features are obtained through TAF modules respectivelyAnd fusion image feature->
Text optimization featuresFirst input L with current layer structure after first residual error module i Obtaining a first output L of the current layer structure through a first adding module io
Fusion ofImage featuresAfter passing through the second residual error module, the image is matched with the image characteristic V i Obtaining a second output F of the current layer structure through a second adding module io
As shown in fig. 3, a schematic structure of the TAF module is shown, and the input of the TAF module is the candidate image feature V i :(B,HW,C i ) And a first input L of the current layer structure i :(B,C l T), wherein B is the number of triplets formed by the image sample, each target instance and a text description corresponding to each target instance, and can be greater than or equal to 1, H is the height of the candidate image feature, W is the width of the candidate image feature, C i The number of channels for the alternative image feature typically includes R, G, B three channels. C (C) l The maximum length of the feature is optimized for the text, i.e. the maximum number of words contained in the text description, and B is the dimension of the text feature.
V i (B,HW,C i ) Conversion to V1 (B, C) i HW) via 1 x 1 convolutional layers, respectively And a convolution layer of 1 x 1->Obtain V2 (B, C) l ,HW)、W3(B,HW,C l )。
L i :(B,C l T) are respectively subjected to convolution layers of 1×1And a convolution layer w of 1×1 l Obtaining L1 (B, T, C) i )、L2(B,C l ,T)。
V2(B,C l HW) and L1 (B, T, C l ) After matrix multiplication, the matrix is normalized by an exponential function and then is matched with V3 (B, HW, C l ) Matrix multiplication is performed to obtain L CA ,L CA Through a convolution layer w of 1 x 1 cl Obtaining L3 (B, C) l ,T),L2(B,C l T) and L3 (B, C l T) through element-by-element point multiplication, then throughConvolution layer w of 1×1 rl Obtaining text optimization features(B,C l ,T)。
V i (B,HW,C i ) Respectively through 1 x 1 convolution layersAnd a convolution layer w of 1×1 i To obtain V4 and V5 (B, HW, C) i ). Text optimization feature->(B,C l T) are each convolved with 1X 1 convolution layer +.>And a convolution layer of 1 x 1->Obtaining L4 (B, C) i ,T)、L5(B,T,C i ,)。
V4 and L4 are multiplied by matrix and then normalized by exponential function, and then by L5 (B, T, C i (ii) to obtain V CA (B,HW,C i ),V CA Through a convolution layer w of 1 x 1 ci Obtaining L6, multiplying L6 and V5 by element point, and then passing through a convolution layer w of 1 multiplied by 1 fi Obtaining the characteristics of the fused image(B,HW,C i )。
In the embodiment of the invention, the text perception fusion module, the first residual error module and the first addition module are introduced into the image encoder, so that the text optimization feature and the fusion image feature can be obtained, and the segmentation capability of the target reference image segmentation model for low-quality text description can be ensured.
On the basis of the above embodiment, the training method for the cited image segmentation model provided in the embodiment of the present invention collects each target instance in the image sample and the text description corresponding to each target instance, and then includes:
And constructing the image sample, each target instance and a text description triplet corresponding to each target instance, and taking each triplet as a training sample of the initial reference image segmentation model.
Specifically, as shown in fig. 4, each acquired image sample may include a plurality of target instances, each of which may have a plurality of textual descriptions. Thus, to facilitate training of the initial reference image segmentation model, after the collection of each target instance in the image sample and the corresponding textual description of each target instance, triples may be constructed from one image sample, one target instance, and one textual description, with each triplet being taken as a training sample of the initial reference image segmentation model.
As shown in fig. 5, on the basis of the above embodiment, the embodiment of the present invention further provides a reference image segmentation method, which includes:
s21, acquiring an image to be segmented and description information corresponding to a target object in the image to be segmented;
s22, inputting the image to be segmented and the description information into a target reference image segmentation model determined by the reference image segmentation model training method provided in the above embodiments, so as to obtain a segmentation result corresponding to the image to be segmented, which is output by the target reference image segmentation model.
Specifically, in the reference image segmentation method provided in the embodiment of the present invention, the execution subject is a reference image segmentation device, and the device may be configured in a computer, where the computer may be a local computer or a cloud computer, and the local computer may be a computer, a tablet, or the like, and is not limited herein specifically.
Step S21 is first executed to obtain an image to be segmented and description information corresponding to a target object in the image to be segmented, where the image to be segmented is an image in which a category of the target object needs to be determined, and the target object included in the image to be segmented may include one or more, for example, people, animals, trees, buildings, roads, and the like. The description information corresponding to each target object may include one or more pieces, which are not particularly limited herein.
Then, step S22 is executed, in which the image to be segmented and the description information are input into the target reference image segmentation model determined by the reference image segmentation model training method provided in the above embodiments, so as to obtain a segmentation result corresponding to the image to be segmented output by the target reference image segmentation model. The segmentation result is a pixel-by-pixel segmentation result, which is used for representing the category of each target object contained in the image to be segmented.
According to the reference image segmentation method provided by the embodiment of the invention, due to the adoption of the target reference image segmentation model, the segmentation result is more accurate and reliable, and the occurrence of wrong matching of the confusion target is reduced.
As shown in fig. 6, on the basis of the above embodiment, an apparatus for training a reference image segmentation model is provided in an embodiment of the present invention, including:
the text coding module 61 is configured to collect each target instance in the image sample and text descriptions corresponding to each target instance, input the text descriptions to a text encoder in an initial reference image segmentation model, and perform feature extraction on the text descriptions by the text encoder to obtain initial text features;
the text image fusion module 62 is configured to input the image sample and the initial text feature to an image encoder in the initial reference image segmentation model, extract, by the image encoder, the initial image feature of the image sample, optimize, based on the initial image feature, the initial text feature by using a cross attention mechanism to obtain a text optimized feature, fuse the text feature optimization with the initial image feature to obtain a fused image feature, and iteratively optimize and fuse the text optimized feature and the fused image feature to obtain a target text feature and a target cross-modal image fusion feature;
The model training module 63 is configured to input the target cross-modal image fusion feature to a decoder in the initial reference image segmentation model, obtain a segmentation result output by the decoder, calculate a training loss based on the target text feature, a target instance tag in the image sample, and the segmentation result, and iteratively optimize structural parameters of the initial reference image segmentation model based on the training loss, so as to obtain a target reference image segmentation model.
On the basis of the above embodiment, the reference image segmentation model training device and the model training module provided in the embodiments of the present invention are specifically configured to:
calculating a segmentation loss based on a target instance tag in the image sample and the segmentation result;
based on target text features corresponding to different target examples in the image sample, calculating consistency loss corresponding to each target example and contrast loss corresponding to each target text feature;
and calculating the training loss based on the segmentation loss, the consistency loss corresponding to each target instance and the contrast loss corresponding to each target text feature.
On the basis of the above embodiment, the reference image segmentation model training device and the model training module provided in the embodiments of the present invention are further specifically configured to:
Calculating the total consistency loss corresponding to each image sample based on the consistency loss corresponding to each target instance, and calculating the total consistency loss corresponding to each image sample based on the contrast loss corresponding to each target text feature;
and calculating a first weighted summation result of the total consistency loss and the total contrast loss, calculating a second weighted summation result of the first weighted summation result and the segmentation loss, and taking the second weighted summation result as the training loss.
On the basis of the above embodiment, the reference image segmentation model training device and the model training module provided in the embodiments of the present invention are further specifically configured to:
based on the target text features corresponding to the same target instance in the image sample, calculating positive similarity of the target text features corresponding to the same target instance;
calculating the negative similarity of the target text features corresponding to different target examples based on the target text features corresponding to different target examples in the image sample;
and calculating the contrast loss corresponding to each target text feature based on the positive similarity and the negative similarity.
On the basis of the above embodiment, the reference image segmentation model training device provided in the embodiment of the present invention, the image encoder includes sequentially connected multi-layer structures, each layer structure includes a first input, a first output, a second input and a second output, the first output of the former layer structure is used as the first input of the latter layer structure, and the second output of the former layer structure is used as the second input of the latter layer structure;
The first input of the first layer structure is the initial text feature, the second input of the first layer structure is the initial image feature, the first output of the last layer structure is the target text feature, and the second output of the last layer structure is the target cross-modal image fusion feature;
each layer of structure comprises a coding block, a text perception fusion module, a first residual error module, a second residual error module, a first addition module and a second addition module, wherein the second input of each layer of structure is subjected to coding to obtain alternative image characteristics, the first input of each layer of structure and the alternative image characteristics obtained by the current layer of structure are subjected to optimization on the first input of the current layer of structure by the text perception fusion module by adopting a cross attention mechanism to obtain text optimization characteristics, and the text optimization characteristics are fused with the alternative image characteristics by adopting a cross attention mechanism to obtain fused image characteristics;
the text optimization feature is subjected to the first residual error module and then is subjected to the first addition module with the first input of the current layer structure to obtain a first output of the current layer structure;
and the fusion image features pass through the second residual error module and then the alternative image features pass through the second addition module to obtain second output of the current layer structure.
On the basis of the above embodiment, the reference image segmentation model training device provided in the embodiment of the present invention, the text encoder is a pre-training language model.
On the basis of the above embodiment, the reference image segmentation model training device provided in the embodiment of the present invention further includes a triplet construction module, configured to:
and constructing the image sample, each target instance and a text description triplet corresponding to each target instance, and taking each triplet as a training sample of the initial reference image segmentation model.
Specifically, the functions of each module in the cited image segmentation model training device provided in the embodiment of the present invention are in one-to-one correspondence with the operation flow of each step in the above method embodiment, and the achieved effects are consistent.
As shown in fig. 7, on the basis of the above embodiment, there is provided a cited image dividing apparatus in an embodiment of the present invention, including:
an obtaining module 71, configured to obtain an image to be segmented and description information corresponding to a target object in the image to be segmented;
the reference image segmentation module 72 is configured to input the image to be segmented and the description information into a target reference image segmentation model determined by the reference image segmentation model training method provided in the above embodiments, so as to obtain a segmentation result corresponding to the image to be segmented output by the target reference image segmentation model.
Specifically, the functions of each module in the cited image segmentation device provided in the embodiment of the present invention are in one-to-one correspondence with the operation flow of each step in the above method embodiment, and the achieved effects are consistent.
Fig. 8 illustrates a physical structure diagram of an electronic device, as shown in fig. 8, which may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein Processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform the reference image segmentation model training method, or the reference image segmentation method, provided in the embodiments described above.
Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can perform the cited image segmentation model training method, or the cited image segmentation method provided in the foregoing embodiments.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the reference image segmentation model training method, or the reference image segmentation method, provided in the above embodiments.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A reference image segmentation model training method, comprising:
collecting each target instance in an image sample and text description corresponding to each target instance, inputting the text description to a text encoder in an initial reference image segmentation model, and extracting features of the text description by the text encoder to obtain initial text features;
inputting the image sample and the initial text feature into an image encoder in the initial reference image segmentation model, extracting the initial image feature of the image sample by the image encoder, optimizing the initial text feature by adopting a cross attention mechanism based on the initial image feature to obtain a text optimization feature, fusing the text feature optimization and the initial image feature to obtain a fused image feature, and performing iterative optimization and fusion on the text optimization feature and the fused image feature to obtain a target text feature and a target cross-modal image fusion feature;
inputting the target cross-modal image fusion characteristic to a decoder in the initial reference image segmentation model to obtain a segmentation result output by the decoder, calculating training loss based on the target text characteristic, a target instance label in the image sample and the segmentation result, and performing iterative optimization on structural parameters of the initial reference image segmentation model based on the training loss to obtain a target reference image segmentation model.
2. The cited image segmentation model training method of claim 1, wherein calculating a training loss based on the target text feature, a target instance tag in the image sample, and the segmentation result comprises:
calculating a segmentation loss based on a target instance tag in the image sample and the segmentation result;
based on target text features corresponding to different target examples in the image sample, calculating consistency loss corresponding to each target example and contrast loss corresponding to each target text feature;
and calculating the training loss based on the segmentation loss, the consistency loss corresponding to each target instance and the contrast loss corresponding to each target text feature.
3. The cited image segmentation model training method of claim 2, wherein calculating the training loss based on the segmentation loss, a consistency loss for each target instance, and a contrast loss for each target text feature comprises:
calculating the total consistency loss corresponding to each image sample based on the consistency loss corresponding to each target instance, and calculating the total consistency loss corresponding to each image sample based on the contrast loss corresponding to each target text feature;
And calculating a first weighted summation result of the total consistency loss and the total contrast loss, calculating a second weighted summation result of the first weighted summation result and the segmentation loss, and taking the second weighted summation result as the training loss.
4. The cited image segmentation model training method according to claim 2, wherein calculating a contrast loss corresponding to each target text feature based on the target text features corresponding to different target instances in the image sample comprises:
based on the target text features corresponding to the same target instance in the image sample, calculating positive similarity of the target text features corresponding to the same target instance;
calculating the negative similarity of the target text features corresponding to different target examples based on the target text features corresponding to different target examples in the image sample;
and calculating the contrast loss corresponding to each target text feature based on the positive similarity and the negative similarity.
5. The cited image segmentation model training method as set forth in claim 1, wherein the image encoder comprises sequentially connected multi-layered structures, each of the multi-layered structures comprising a first input, a first output, a second input, and a second output, the first output of a previous layer structure being the first input of a subsequent layer structure, the second output of the previous layer structure being the second input of the subsequent layer structure;
The first input of the first layer structure is the initial text feature, the second input of the first layer structure is the initial image feature, the first output of the last layer structure is the target text feature, and the second output of the last layer structure is the target cross-modal image fusion feature;
each layer of structure comprises a coding block, a text perception fusion module, a first residual error module, a second residual error module, a first addition module and a second addition module, wherein the second input of each layer of structure is subjected to coding to obtain alternative image characteristics, the first input of each layer of structure and the alternative image characteristics obtained by the current layer of structure are subjected to optimization on the first input of the current layer of structure by the text perception fusion module by adopting a cross attention mechanism to obtain text optimization characteristics, and the text optimization characteristics are fused with the alternative image characteristics by adopting a cross attention mechanism to obtain fused image characteristics;
the text optimization feature is subjected to the first residual error module and then is subjected to the first addition module with the first input of the current layer structure to obtain a first output of the current layer structure;
and the fusion image features pass through the second residual error module and then the alternative image features pass through the second addition module to obtain second output of the current layer structure.
6. The cited image segmentation model training method according to any one of claims 1-5, wherein the text encoder is a pre-trained language model.
7. The method for training a cited image segmentation model according to any one of claims 1-5, wherein collecting each target instance in the image sample and a text description corresponding to each target instance comprises:
and constructing the image sample, each target instance and a text description triplet corresponding to each target instance, and taking each triplet as a training sample of the initial reference image segmentation model.
8. A reference image segmentation method, comprising:
acquiring a to-be-segmented image and description information corresponding to a target object in the to-be-segmented image;
inputting the image to be segmented and the description information into a target reference image segmentation model determined by the reference image segmentation model training method according to any one of claims 1-7, and obtaining a segmentation result corresponding to the image to be segmented output by the target reference image segmentation model.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the reference image segmentation model training method of any one of claims 1-7 or the reference image segmentation method of claim 8 when the computer program is executed.
10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the reference image segmentation model training method according to any one of claims 1-7, or the reference image segmentation method according to claim 8.
CN202310877057.9A 2023-07-17 2023-07-17 Reference image segmentation model training method and reference image segmentation method Active CN116993976B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310877057.9A CN116993976B (en) 2023-07-17 2023-07-17 Reference image segmentation model training method and reference image segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310877057.9A CN116993976B (en) 2023-07-17 2023-07-17 Reference image segmentation model training method and reference image segmentation method

Publications (2)

Publication Number Publication Date
CN116993976A true CN116993976A (en) 2023-11-03
CN116993976B CN116993976B (en) 2024-06-14

Family

ID=88522374

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310877057.9A Active CN116993976B (en) 2023-07-17 2023-07-17 Reference image segmentation model training method and reference image segmentation method

Country Status (1)

Country Link
CN (1) CN116993976B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832570A (en) * 2020-07-02 2020-10-27 北京工业大学 Image semantic segmentation model training method and system
US20210045716A1 (en) * 2019-08-13 2021-02-18 GE Precision Healthcare LLC Method and system for providing interaction with a visual artificial intelligence ultrasound image segmentation module
WO2021146951A1 (en) * 2020-01-21 2021-07-29 京东方科技集团股份有限公司 Text detection method and apparatus, and storage medium
CN113792113A (en) * 2020-07-31 2021-12-14 北京京东尚科信息技术有限公司 Visual language model obtaining and task processing method, device, equipment and medium
CN115018805A (en) * 2022-06-21 2022-09-06 推想医疗科技股份有限公司 Segmentation model training method, image segmentation method, device, equipment and medium
WO2022205657A1 (en) * 2021-04-02 2022-10-06 中国科学院深圳先进技术研究院 Csm image segmentation method and apparatus, terminal device, and storage medium
CN115393854A (en) * 2022-10-27 2022-11-25 粤港澳大湾区数字经济研究院(福田) Visual alignment processing method, terminal and storage medium
US20220391755A1 (en) * 2021-05-26 2022-12-08 Salesforce.Com, Inc. Systems and methods for vision-and-language representation learning
CN115757692A (en) * 2022-10-20 2023-03-07 华为技术有限公司 Data processing method and device
CN115761757A (en) * 2022-11-04 2023-03-07 福州大学 Multi-mode text page classification method based on decoupling feature guidance
CN116229584A (en) * 2022-12-31 2023-06-06 重庆傲雄在线信息技术有限公司 Text segmentation recognition method, system, equipment and medium in artificial intelligence field

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210045716A1 (en) * 2019-08-13 2021-02-18 GE Precision Healthcare LLC Method and system for providing interaction with a visual artificial intelligence ultrasound image segmentation module
WO2021146951A1 (en) * 2020-01-21 2021-07-29 京东方科技集团股份有限公司 Text detection method and apparatus, and storage medium
CN111832570A (en) * 2020-07-02 2020-10-27 北京工业大学 Image semantic segmentation model training method and system
CN113792113A (en) * 2020-07-31 2021-12-14 北京京东尚科信息技术有限公司 Visual language model obtaining and task processing method, device, equipment and medium
WO2022205657A1 (en) * 2021-04-02 2022-10-06 中国科学院深圳先进技术研究院 Csm image segmentation method and apparatus, terminal device, and storage medium
US20220391755A1 (en) * 2021-05-26 2022-12-08 Salesforce.Com, Inc. Systems and methods for vision-and-language representation learning
CN115018805A (en) * 2022-06-21 2022-09-06 推想医疗科技股份有限公司 Segmentation model training method, image segmentation method, device, equipment and medium
CN115757692A (en) * 2022-10-20 2023-03-07 华为技术有限公司 Data processing method and device
CN115393854A (en) * 2022-10-27 2022-11-25 粤港澳大湾区数字经济研究院(福田) Visual alignment processing method, terminal and storage medium
CN115761757A (en) * 2022-11-04 2023-03-07 福州大学 Multi-mode text page classification method based on decoupling feature guidance
CN116229584A (en) * 2022-12-31 2023-06-06 重庆傲雄在线信息技术有限公司 Text segmentation recognition method, system, equipment and medium in artificial intelligence field

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ZHAO YANG 等: "LAVT: Language-Aware Vision Transformer for Referring Image Segmentation", 《2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》, vol. 2022, 27 September 2022 (2022-09-27), pages 18134 - 18144 *
ZHICHAO WEI 等: "Linguistic Query-Guided Mask Generation for Referring Image Segmentation", 《ARXIV》, vol. 2023, 22 March 2023 (2023-03-22), pages 3 - 4 *
苏宇: "基于多模态预训练的图文跨模态检索方法研究", 《万方数据》, vol. 2022, 14 November 2022 (2022-11-14) *

Also Published As

Publication number Publication date
CN116993976B (en) 2024-06-14

Similar Documents

Publication Publication Date Title
Chen et al. Residual attention u-net for automated multi-class segmentation of covid-19 chest ct images
Zhou et al. Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports
CN108171692B (en) Lung image retrieval method and device
CN112131383B (en) Specific target emotion polarity classification method
CN111243736B (en) Survival risk assessment method and system
CN111488921A (en) Panoramic digital pathological image intelligent analysis system and method
CN115132313A (en) Automatic generation method of medical image report based on attention mechanism
CN112396605B (en) Network training method and device, image recognition method and electronic equipment
CN114091450B (en) Judicial domain relation extraction method and system based on graph convolution network
CN113486173B (en) Text labeling neural network model and labeling method thereof
CN114898121A (en) Concrete dam defect image description automatic generation method based on graph attention network
CN113705361A (en) Method and device for detecting model in living body and electronic equipment
CN115775349A (en) False news detection method and device based on multi-mode fusion
CN111079930A (en) Method and device for determining quality parameters of data set and electronic equipment
CN112927236B (en) Clothing analysis method and system based on channel attention and self-supervision constraint
CN114037893A (en) High-resolution remote sensing image building extraction method based on convolutional neural network
CN116993976B (en) Reference image segmentation model training method and reference image segmentation method
CN117237685A (en) Mechanical equipment fault diagnosis method based on multi-mode deep clustering
CN111191035B (en) Method and device for recognizing lung cancer clinical database text entity
CN116468112B (en) Training method and device of target detection model, electronic equipment and storage medium
CN113096070A (en) Image segmentation method based on MA-Unet
CN115827878B (en) Sentence emotion analysis method, sentence emotion analysis device and sentence emotion analysis equipment
CN116958154A (en) Image segmentation method and device, storage medium and electronic equipment
CN112559640B (en) Training method and device of atlas characterization system
CN111046657B (en) Method, device and equipment for realizing text information standardization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant