CN115761222A - Image segmentation method, remote sensing image segmentation method and device - Google Patents

Image segmentation method, remote sensing image segmentation method and device Download PDF

Info

Publication number
CN115761222A
CN115761222A CN202211182413.7A CN202211182413A CN115761222A CN 115761222 A CN115761222 A CN 115761222A CN 202211182413 A CN202211182413 A CN 202211182413A CN 115761222 A CN115761222 A CN 115761222A
Authority
CN
China
Prior art keywords
image
vector
feature
text
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211182413.7A
Other languages
Chinese (zh)
Other versions
CN115761222B (en
Inventor
于超辉
周强
王志斌
王帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202211182413.7A priority Critical patent/CN115761222B/en
Publication of CN115761222A publication Critical patent/CN115761222A/en
Application granted granted Critical
Publication of CN115761222B publication Critical patent/CN115761222B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The embodiment of the specification provides an image segmentation method, a remote sensing image segmentation method and a device, wherein the image segmentation method comprises the following steps: the method comprises the steps of obtaining an image to be segmented, carrying out feature extraction on the image to be segmented to obtain a global image feature vector and a local image feature vector, constructing a prompt text vector according to a random text vector, the global image feature vector and a preset category label, carrying out feature extraction on the prompt text vector to obtain a text feature vector, and determining a segmentation result of the image to be segmented through feature compiling according to the local image feature vector and the text feature vector. For a single image to be segmented, deep features are fully excavated, so that the segmentation result can better meet the image use habit of people, a good processing result is obtained when the segmentation result is reprocessed, and the accuracy and the user experience of the segmentation result are improved.

Description

Image segmentation method, remote sensing image segmentation method and device
Technical Field
The embodiment of the specification relates to the technical field of image processing, in particular to an image segmentation method and a remote sensing image segmentation method.
Background
With the development of computer technology, artificial intelligence has been widely applied in the field of image processing, wherein image segmentation is a technology of segmenting an image to be segmented into a plurality of image regions of different types according to a certain segmentation condition, and the segmentation effect and efficiency are greatly improved by applying the technical means of machine learning.
At present, the machine learning technology is mainly used for realizing image segmentation, or based on the image characteristics of the image, a training sample is used for carrying out supervised or unsupervised pre-training on a neural network model, and the trained model is used for segmenting the image to be segmented.
However, the neural network model is only pre-trained based on the image features of the image, other features are not fully utilized, deep features of the image cannot be mined in the image region obtained by segmentation, and good processing results are obtained by fully combining the image use habits of people when the segmentation results are reprocessed at the downstream, so that the segmentation results are not accurate enough, and the user experience is not enough. Therefore, a more accurate image segmentation method with better user experience is needed.
Disclosure of Invention
In view of this, embodiments of the present specification provide an image segmentation method. One or more embodiments of the present disclosure also relate to a remote sensing image segmentation method, an image segmentation apparatus, a remote sensing image segmentation apparatus, a computing device, a computer-readable storage medium, and a computer program, so as to solve the technical drawbacks of the prior art.
According to a first aspect of embodiments herein, there is provided an image segmentation method including:
acquiring an image to be segmented;
performing feature extraction on an image to be segmented to obtain a global image feature vector and a local image feature vector;
constructing a prompt text vector according to the random text vector, the global image feature vector and a preset category label;
extracting the characteristics of the prompt text vector to obtain a text characteristic vector;
and determining the segmentation result of the image to be segmented through feature compiling according to the local image feature vector and the text feature vector.
According to a second aspect of embodiments of the present specification, there is provided a remote sensing image segmentation method including:
receiving a remote sensing image segmentation instruction input by a user, wherein the remote sensing image segmentation instruction comprises a remote sensing image to be segmented and a class label of a target segmentation object;
carrying out feature extraction on a remote sensing image to be segmented to obtain a global image feature vector and a local image feature vector;
constructing a prompt text vector according to the random text vector, the global image feature vector and the category label;
extracting the characteristics of the prompt text vector to obtain a text characteristic vector;
and determining a segmentation result aiming at the target segmentation object in the remote sensing image to be segmented through feature compilation according to the local image feature vector and the text feature vector.
According to a third aspect of embodiments herein, there is provided an image segmentation apparatus including:
the image segmentation device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is configured to acquire an image to be segmented;
the image segmentation method comprises the steps that a first extraction module is configured to extract features of an image to be segmented to obtain a global image feature vector and a local image feature vector;
the first construction module is configured to construct a prompt text vector according to the random text vector, the global image feature vector and a preset category label;
the second extraction module is configured to perform feature extraction on the prompt text vector to obtain a text feature vector;
and the first segmentation module is configured to determine a segmentation result of the image to be segmented through feature compiling according to the local image feature vector and the text feature vector.
According to a third aspect of embodiments herein, there is provided a remote sensing image segmentation apparatus including:
the remote sensing image segmentation method comprises a receiving module, a segmentation module and a segmentation module, wherein the receiving module is configured to receive a remote sensing image segmentation instruction input by a user, and the remote sensing image segmentation instruction comprises a remote sensing image to be segmented and a class label of a target segmentation object;
the third extraction module is configured to extract features of the remote sensing image to be segmented to obtain a global image feature vector and a local image feature vector;
the second construction module is configured to construct a prompt text vector according to the random text vector, the global image feature vector and the category label;
the fourth extraction module is configured to perform feature extraction on the prompt text vector to obtain a text feature vector;
and the second segmentation module is configured to determine a segmentation result aiming at the target segmentation object in the remote sensing image to be segmented through feature compiling according to the local image feature vector and the text feature vector.
According to a fifth aspect of embodiments herein, there is provided a computing device comprising:
a memory and a processor;
the memory is used for storing computer executable instructions, and the processor is used for executing the computer executable instructions, and the computer executable instructions realize the steps of the image segmentation method or the remote sensing image segmentation method when being executed by the processor.
According to a sixth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the image segmentation method or the remote sensing image segmentation method described above.
According to a seventh aspect of embodiments herein, there is provided a computer program, wherein when the computer program is executed in a computer, the computer is caused to execute the steps of the image segmentation method or the remote sensing image segmentation method described above.
In one or more embodiments of the present specification, an image to be segmented is obtained, feature extraction is performed on the image to be segmented to obtain a global image feature vector and a local image feature vector, a prompt text vector is constructed according to a random text vector, the global image feature vector and a preset category label, feature extraction is performed on the prompt text vector to obtain a text feature vector, and a segmentation result of the image to be segmented is determined through feature compilation according to the local image feature vector and the text feature vector. The method comprises the steps of extracting features of an image to be segmented to obtain a global image feature vector and a local image feature vector, constructing according to the global image feature vector, a random text vector and a preset category label to obtain a prompt text vector, segmenting the image by using the text features and the image features in the follow-up process, and fully mining the deep features of a single image to be segmented to enable the segmentation result to better meet the image use habits of people.
Drawings
Fig. 1 is a flowchart of an image segmentation method provided in an embodiment of the present specification;
FIG. 2 is a flow chart of a method for segmenting a remote sensing image according to an embodiment of the present disclosure;
FIG. 3 is a flowchart illustrating a process of an image segmentation method applied to entity identification of a remote sensing image according to an embodiment of the present disclosure;
FIG. 4 is a system architecture diagram of an image segmentation system provided in an embodiment of the present description;
fig. 5A is a schematic diagram of a remote sensing image to be segmented of a remote sensing image segmentation method provided in an embodiment of the present specification;
fig. 5B is a schematic diagram of a segmentation result of a remote sensing image to be segmented according to a remote sensing image segmentation method provided in an embodiment of the present specification;
fig. 6 is a schematic structural diagram of an image segmentation apparatus provided in an embodiment of the present specification;
fig. 7 is a schematic structural diagram of a remote sensing image segmentation apparatus provided in an embodiment of the present specification;
fig. 8 is a block diagram of a computing device according to an embodiment of the present disclosure.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, as those skilled in the art will be able to make and use the present disclosure without departing from the spirit and scope of the present disclosure.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if," as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination," depending on the context.
First, the noun terms referred to in one or more embodiments of the present specification are explained.
ImageNet: a large image database for image segmentation in which image data is tagged with a class label.
CLIP (Learning Transferable Visual Models From Natural Language Supervision): a segmentation method for guiding image segmentation through text features specifically comprises the steps of pre-training a neural network model by utilizing a corresponding relation between an image and a text, further realizing deep feature mining of an image to be segmented, and realizing accurate image segmentation.
DenseClip (Language-Guided sense Prediction with Context-Aware Prediction, language-Guided intensive Prediction with Context-Aware cues): a segmentation method for guiding an image by using text features realizes pixel-text level image segmentation by performing density analysis on image features.
SIFT (Scale-invariant feature transform, scale-invariant feature transform algorithm): an image feature extraction algorithm is used for extracting image features by searching extreme points in a spatial scale and extracting position, scale and rotation invariant of the extreme points.
HOG (Histogram of Oriented gradients, algorithm of Oriented gradients): an image feature extraction algorithm constructs image features by calculating and counting gradient direction histograms of local regions of an image.
ORB (Oriented Fast and Rotated Brief, directional Fast spin algorithm): an image feature extraction algorithm searches a plurality of key points in an image, and each key point calculates a corresponding feature vector by using the feature of pixel value change of the key point.
VGG model (Visual Geometry Group Network model): a neural network model with the characteristics of small convolution layer, small pooling layer and wider characteristic diagram with deeper layer number.
ResNet model: a neural network model with a super multi-layer network structure and a residual error processing module. Including ResNet-50, resNet-101, etc.
VIT model (Vision Transformer model): an image feature extraction model is characterized in that a special vector mapping layer is arranged to obtain image feature vectors with fixed dimensionality, and then a Transformer model is used for image feature extraction.
CNN (Convolutional Neural Networks, convolutional Neural network model): a multi-convolutional layer neural network model with forward propagation and backward propagation.
One-hot encoding: a text feature extraction algorithm utilizes an N-bit state register to encode N text states, each text state has its own independent register bit, and at any time, only one of the bits is valid, thereby obtaining a one-bit valid text feature vector.
TF-IDF (Term Frequency-inverse Document Frequency algorithm): a text feature extraction algorithm obtains a corresponding text feature vector by counting the word frequency (TF) of each word and attaching a weight parameter (IDF) to the TF.
Transformer model: a neural network model based on an attention mechanism can extract and analyze semantic features of a natural language text through the attention mechanism to generate a target text. The model structure of the Transformer model comprises an encoder and a decoder, wherein the encoder comprises an embedding layer, natural language texts of an input model can be encoded into feature vectors through embedding calculation, then text contexts represented by the feature vectors are analyzed based on an attention mechanism, and target texts are obtained through the output of the decoding layer.
FCN model (Full volumetric Networks Full convolution neural network): an image classification model is a neural network model which is obtained by replacing a convolution kernel and a full connection layer of a CNN model and can input images of any scale.
U-Net model: an image classification model with a forward path is a full convolution network structure. The method comprises a compression path and an expansion path, wherein the input image with a certain resolution is subjected to down-sampling processing on the compression path, the down-sampled image is expanded on the expansion path to obtain an output image with a corresponding resolution, and U-Net has strong retention capacity for local features, so that local details of the output image have high restoration degree.
FPN model (Feature Pyramid model): an image classification model based on multi-scale image features. And (3) obtaining different scales of image feature vectors by extracting upsampling, respectively predicting entity classes in the images, and counting to obtain a final prediction result so as to realize the classification of the entities in the images. Including the Semantic FPN model that incorporates textual features.
Automatic gradient update method: an algorithm for automatically adjusting model parameters comprises methods of SGD, MBGD, momentum update, nestrevo Momentum update, adagarad, adaDelta, RMSprop, adam and the like.
At present, CLIP pre-trains a neural network model by using corresponding text and image samples in advance to realize the segmentation of an image to be segmented, but the CLIP cannot fully utilize the relevant characteristics of the image to be segmented, so that the segmentation result completely depends on the training effect of the neural network model, when the training sample is insufficient or the training effect of the neural network model is insufficient, the accuracy of the segmentation result of the image to be segmented is insufficient, when the downstream segmentation result is processed, the image use habit of people cannot be met, a good processing result is obtained, and the user experience is insufficient.
The DenseClIP based on CLIP obtains the corresponding characteristics of the pixel and the text on the basis of the corresponding characteristics of the image and the text by carrying out densification processing on the image characteristics, so that the accuracy of image segmentation is improved to a certain extent, however, the DenseClIP adopts a prompt text vector in a uniform form to carry out subsequent image segmentation, so that a more targeted prompt text vector cannot be generated according to the high-dimensional characteristics of the image, and further more targeted text characteristics are obtained to be combined with the image characteristics to improve the segmentation accuracy of a single image. The method is characterized in that a neural network model is pre-trained, static analysis capability is obtained correspondingly, relevant characteristics of an image to be segmented are not fully utilized, a segmentation result completely depends on a training effect of the neural network model, the performance of the neural network model is difficult to improve, the accuracy of the segmentation result of the image to be segmented is insufficient, image use habits of people cannot be met during downstream segmentation result processing, a good processing result is obtained, and user experience is insufficient.
In view of the above-mentioned problems, in the present specification, an image segmentation method is provided, and the present specification relates to a remote sensing image segmentation method, an image segmentation apparatus, a remote sensing image segmentation apparatus, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.
Referring to fig. 1, fig. 1 shows a flowchart of an image segmentation method provided in an embodiment of the present specification, which specifically includes the following steps.
Step S102: and acquiring an image to be segmented.
The image to be segmented is a multimedia image including a plurality of entities, may be a real image acquired by an image acquisition device, and may be a virtual image generated by image generation software, which is not limited herein. The image to be segmented may be in the form of a picture, a video frame, or the like, and is not limited herein.
The image to be segmented is obtained, and the image to be segmented can be the image to be segmented sent by a receiving user, or the image to be segmented obtained through a local or remote database.
Illustratively, an image _1 to be segmented sent by a user is received.
By acquiring the image to be segmented, a characteristic material basis is provided for the subsequent image characteristic extraction.
Step S104: and performing feature extraction on the image to be segmented to obtain a global image feature vector and a local image feature vector.
The global image feature is an image feature representing a global high-dimensional feature of an image to be segmented and is used for representing features such as color, texture, shape, structure, entity distribution and the like of the image. The global image feature vector is a high-dimensional vector of global image features. The local image features are image features representing local low-dimensional features of the image to be segmented and are used for representing features such as pixels, entity edges and the like of the image. The local image feature vector is a low-dimensional vector of local image features. The local image feature vector can be characterized by a plurality of feature vector graphs for characterizing image features with different dimensions.
The method comprises the steps of extracting features of an image to be segmented, extracting the features of the image to be segmented and obtaining a global image feature vector and a local image feature vector. The image feature extraction algorithm may be a non-machine learning algorithm, such as SIFT, HOG, ORB, among others. Machine learning algorithms such as VGG model, resNet model, CNN model, VIT model, etc. are also possible. The pre-trained image feature extraction model comprises a global feature extraction module and a local feature extraction module.
Exemplarily, feature extraction is performed on an Image to be segmented by using a SIFT algorithm, so as to obtain a global Image feature vector I and a local Image feature vector Image Embedding.
The global image feature vector and the local image feature vector are obtained by extracting the features of the image to be segmented, so that the global image feature vector and the local image feature vector are not limited to local image features, a feature vector basis is provided for subsequent image segmentation, the global image feature is obtained, and a feature basis is laid for subsequent prompt text vector construction.
Step S106: and constructing a prompt text vector according to the random text vector, the global image feature vector and the preset category label.
The random text vector is generated according to random noise, does not correspond to the entity category, and can be extracted through subsequent features to obtain corresponding entity category information so as to represent corresponding entity category features. Specifically, random noise is obtained, and text vector coding is performed on the random noise to generate a random text vector.
The preset class label is a class label vector of a sample entity in a sample image obtained after a neural network model is trained in advance, for example, the sample image contains 10 entities (a chair, a table, a desk lamp and the like), the neural network model is trained in advance through the sample image, the classes of the 10 entities are identified, and class labels are correspondingly given. And in the subsequent feature compiling process, the preset category label carries out promoted compiling on the local image features, and then the entity corresponding to the image to be segmented is obtained by segmentation.
The prompt text vector is a text vector in which other modal features are added to text features, and is used for specifying a processing direction of image processing by multi-modal feature correspondence in image processing.
And constructing a prompt text vector according to the random text vector, the global image feature vector and the preset category label, wherein the specific mode is that the random text vector, the global image feature vector and the preset category label are subjected to vector fusion to obtain the prompt text vector. The fusion mode may be to perform feature fusion by using a full connection layer of the neural network model, or may be to perform vector splicing directly, which is not limited herein.
Exemplarily, feature fusion is performed by using a full-connection layer random text vector V, a global image feature vector I and a preset class label CLS of the neural network model to obtain a Prompt text vector Prompt.
And constructing a prompt text vector according to the random text vector, the global image feature vector and the preset category label, further segmenting the image by utilizing the text feature and the image feature in the subsequent process, and fully mining the deep features of the single image to be segmented to ensure that each image to be segmented has a corresponding dynamic prompt text vector, thereby further enhancing the correlation between the text feature and the image feature.
Step S108: and extracting the characteristics of the prompt text vector to obtain a text characteristic vector.
The text feature vector is a fusion vector containing entity category features, text features and global image features in the image to be segmented.
And performing feature extraction on the prompt text vector to obtain a text feature vector, wherein the specific mode is that feature extraction is performed on the prompt text vector by using a text extraction algorithm to obtain the text feature vector. The text feature extraction algorithm may be a non-machine learning algorithm, such as one-hot encoding, TF-IDF, among others. Machine learning algorithms, such as the Transformer model and its derivatives, are also possible.
Illustratively, performing feature extraction on the Prompt Text vector Prompt by using TF-IDF to obtain a Text feature vector Text Embedding.
The prompt text vector is subjected to feature extraction to obtain a text feature vector, so that a richer, more targeted and deeper text feature vector is obtained, a feature vector basis is provided for subsequent feature compilation and determination of a segmentation result, and the accuracy of the subsequent segmentation result is improved.
Step S110: and determining the segmentation result of the image to be segmented through feature compiling according to the local image feature vector and the text feature vector.
And feature compiling is to align the features of the plurality of feature vectors according to the correlation of the feature vectors, and perform image segmentation on the image to be segmented by using the aligned feature vectors. Specifically, feature compilation includes: feature alignment and image segmentation. The feature alignment is to perform feature alignment on the local image feature vector and the text feature vector, so that the local image feature vector and the text feature vector establish an image-text correspondence on the local image feature.
Exemplarily, according to the local Image feature vector Image Embedding and the Text feature vector Text Embedding, feature alignment is performed to obtain an aligned feature vector Embedding, and Image segmentation is performed on an Image to be segmented according to the feature vector Embedding.
In the embodiment of the description, an image to be segmented is obtained, feature extraction is performed on the image to be segmented to obtain a global image feature vector and a local image feature vector, a prompt text vector is constructed according to a random text vector, the global image feature vector and a preset category label, feature extraction is performed on the prompt text vector to obtain a text feature vector, and a segmentation result of the image to be segmented is determined through feature compilation according to the local image feature vector and the text feature vector. The method comprises the steps of extracting features of an image to be segmented to obtain a global image feature vector and a local image feature vector, constructing according to the global image feature vector, a random text vector and a preset category label to obtain a prompt text vector, segmenting the image by using the text features and the image features in the follow-up process, and fully mining deep features of a single image to be segmented to enable the segmentation result to better meet image use habits of people.
Optionally, step S106 includes the following specific steps:
carrying out dimension mapping on the global image feature vector to obtain a global image feature vector with the same vector dimension as the random text vector;
and splicing the random text vector, the dimensionality mapped global image feature vector and a preset category label to obtain a prompt text vector.
The dimension mapping is to map vectors with different dimensions to a uniform dimension, and then perform vector calculation subsequently, and the specific dimension mapping method may be to use a mapper (Projector), where the mapper may be a module of a fully connected layer neural network model, or may map vectors with different dimensions to a uniform dimension by using a preset transpose matrix, which is not limited herein.
Splicing the random text vector, the overall image feature vector after the dimensionality mapping and a preset category label to obtain a prompt text vector, wherein the specific mode is that the overall image feature vector after the dimensionality mapping is correspondingly spliced to the random text vector and the preset category label to obtain the prompt text vector
Exemplarily, the random text vector is Vn (V1 to V20), 20 vectors with 512 dimensions, the preset category labels are CLSn, the corresponding 20 preset category labels are provided, the global image feature vector I is a 1024-dimensional vector, and the global image feature vector is subjected to dimension mapping by using a transpose matrix T1 to obtain a 512-dimensional global image feature vector I'. Correspondingly splicing the global image feature vector I' subjected to dimension mapping to 20 random text vectors and preset category labels, and obtaining 20 Prompt text vectors with 512 dimensions, wherein the Prompt text vectors with 512 dimensions are as follows: prompt1{ V1+ I '+ CLS1}, prompt2{ V2+ I' + CLS2}, prompt3{ V3+ I '+ CLS3} … … Prompt20{ V20+ I' + CLS20}.
Performing dimension mapping on the global image feature vector to obtain a global image feature vector with the same vector dimension as that of the random text vector, and splicing the random text vector, the dimension-mapped global image feature vector and a preset category label to obtain a prompt text vector. And through dimension mapping, the feasibility of subsequent splicing is ensured, and the deep features of the single image to be segmented are fully mined, so that each image to be segmented has a corresponding dynamic prompt text vector, and the correlation between the text features and the image features is further enhanced.
Optionally, before step S110, the following specific steps are further included:
performing cross attention calculation on the text feature vector and the local image feature vector to determine a target text feature vector;
and fine-tuning the text feature vector based on the target text feature vector to obtain an updated text feature vector.
The cross attention calculation is to obtain vector weights corresponding to all vectors by pre-training a neural network model, and the cross attention mechanism is to determine the deep feature relationship among the vectors through a weight matrix, so that the image processing result vector obtained by performing weighting calculation by using the corresponding weights not only represents the features of the vector, but also represents the deep features of the related vectors.
And performing cross attention calculation on the text characteristic vector and the local image characteristic vector to determine a target text characteristic vector, wherein the specific mode is to perform cross attention calculation on the text characteristic vector and the local image characteristic vector to obtain corresponding vector weight, and obtain the target text characteristic vector according to weighting calculation.
And fine-tuning the text feature vector based on the target text feature vector to obtain an updated text feature vector.
Illustratively, for text feature vectorsPerforming cross attention calculation on Text Embedding and local Image feature vector Image Embedding to obtain corresponding vector weight omega 1 And ω 2 According to vector weight ω 1 And ω 2 And performing weighted calculation to obtain a Target Text eigenvector Embedding. And fine-tuning the Text feature vector Text Embedding based on the entity category features of the Target Text vector Text Embedding to obtain the updated Text feature vector Text Embedding.
And performing cross attention calculation on the text feature vector and the local image feature vector, determining a target text feature vector, and fine-tuning the text feature vector based on the target text feature vector to obtain an updated text feature vector. Through a cross attention mechanism, the text feature vector and the global image feature vector can deeply represent the features of each other to obtain the text feature vector with richer and deep features, and further obtain a more accurate segmentation result in subsequent feature compilation.
Optionally, performing cross attention calculation on the text feature vector and the local image feature vector to determine a target text feature vector, including the following specific steps:
and performing cross attention calculation on the text feature vector and the local image feature vector by using a preset multilayer structure translation model decoder to determine a target text feature vector.
The multi-layer structure translation model is a text translation model with a multi-hidden layer structure, and a decoder of the multi-layer structure translation model is a decoding module for feature vectors which represent deeper vectors. The multi-layer structure translation model may be a Transformer model and a derivative model thereof, and is not limited herein.
The multi-layer structure translation model is a multi-hidden layer model of a neural network model with a cross attention mechanism, taking a transform model as an example, the corresponding QKV is set (Query, key, value, query, key, value weighted full link layer calculation), the text feature vector is set as a Query vector, the local image feature vector is set as a Key vector and a Value vector, and the corresponding vector weight omega is obtained 1 And ω 2 To determine the eyesAnd marking the text feature vector.
Exemplarily, a Text feature vector Text Embedding is set as a query vector Q, a local Image feature vector Image Embedding is set as a key vector K and a value vector V, and a corresponding vector weight ω is obtained 1 And ω 2 And performing weighted calculation by using a Transformer model decoding layer to obtain a Target Text eigenvector Embedding.
And performing cross attention calculation on the text feature vector and the local image feature vector by using a preset multilayer structure translation model decoder to determine a target text feature vector. The text feature vector and the global image feature vector can represent the features of each other in a deeper layer, and further obtain the text feature vector with richer and deep features, so that in subsequent feature compilation, a more accurate segmentation result is obtained, the feature compilation efficiency is improved, and the image segmentation efficiency is improved.
Optionally, step S110 includes the following specific steps:
carrying out multi-scale feature alignment on the local image feature vector and the text feature vector to obtain a feature alignment vector;
and determining a segmentation result of the image to be segmented through feature compiling based on the feature alignment vector and the local image feature vector.
The feature alignment vector is a multi-modal feature comprising a local image feature and a text feature, and the local image feature and the text feature in the feature alignment vector have a spatial correspondence.
The multi-scale features refer to that local image features are subjected to up-sampling operation in different degrees, so that local image features in different scales are obtained, the text features are original text features, and the local image features in different scales and the text features are subjected to feature alignment, so that multi-mode feature vectors (feature alignment vectors) in different scales are obtained. Specifically, the upsampling operation is performed according to a preset sampling rule, for example, the size of an image to be segmented is 512 × 512, a 16 × 16 feature vector diagram of a feature vector diagram corresponding to a local image feature vector is obtained through image feature extraction, and then the feature vector diagram is upsampled for 3 times, which is 2 times, 4 times and 8 times, respectively, to obtain a feature vector diagram with sizes of 32 × 32, 64x64 and 128x 128. Theoretically, the smaller the scale, the more accurate the feature map characterizes the local image features.
The multi-scale feature alignment is to perform finer-grained feature alignment on the local image feature vector according to the text feature vector, for example, perform pixel-text level feature alignment, so that each pixel corresponds to a feature of the text. The feature alignment may be to perform feature alignment of the feature vector by using a pre-trained neural network model, perform feature alignment of the feature vector by using a preset vector alignment matrix, and perform feature alignment of the feature vector by cross-multiplying the feature vector.
Performing multi-scale feature alignment on the local image feature vector and the text feature vector to obtain a feature alignment vector.
Exemplarily, 2 times, 4 times and 8 times of local Image feature vector Image encoding with the scale of 16x16 are up-sampled to obtain local Image feature vectors Image encoding { Image encoding _1, image encoding_2 and Image encoding_3 } with the sizes of 32x32, 64x64 and 128x128, and pixel-text level feature Alignment is performed on the local Image feature vectors Image encoding { Image encoding _1, image encoding_2 and Image encoding_3 } with the sizes of 32x32, 64x64 and 128x128 by using the text feature vectors to obtain a feature Alignment vector Multi-scale Alignment.
And performing multi-scale feature alignment on the local image feature vector and the text feature vector to obtain a feature alignment vector, and determining a segmentation result of the image to be segmented through feature compilation based on the feature alignment vector and the local image feature vector. The accuracy of the feature alignment vector is ensured, so that the segmentation result obtained by subsequent feature compilation is more accurate.
Optionally, performing multi-scale feature alignment on the local image feature vector and the text feature vector to obtain a feature alignment vector, including the following specific steps:
and performing cross multiplication operation on the local image feature vector and the text feature vector to obtain a feature alignment vector.
And performing cross multiplication operation on the local image feature vector and the text feature vector to obtain a feature alignment vector, wherein the specific mode is to multiply the local image feature vector and a transposed matrix of the text feature vector to obtain the feature alignment vector.
Exemplarily, multiplying the local Image feature vector Image Embedding by a transpose matrix (Text Embedding) T of the Text feature vector Text Embedding to obtain a feature Alignment vector Multi-scale Alignment.
And performing cross multiplication operation on the local image feature vector and the text feature vector to obtain a feature alignment vector. The feature alignment is rapidly carried out on the local image feature vector and the text feature vector, the feature alignment vector is obtained through calculation, the accuracy of the feature alignment vector is guaranteed, and the accuracy of a segmentation result obtained through subsequent feature compiling is guaranteed.
Optionally, determining a segmentation result of the image to be segmented based on the feature alignment vector and the local image feature vector through feature coding, including the following specific steps:
splicing the feature alignment vector and the local image feature vector in corresponding dimensions to obtain a spliced feature vector;
and performing feature compiling on the spliced feature vector to obtain a segmentation result of the image to be segmented.
The local image feature vector in the embodiment of the present description is a multi-scale local image feature vector, and may be a local image feature vector subjected to multi-scale upsampling in the embodiment, or may be another local image feature vector subjected to multi-scale upsampling, which is not limited herein.
And splicing the corresponding dimensionality of the feature alignment vector and the local image feature vector to obtain a spliced feature vector.
Exemplarily, the local Image feature vectors Image Embedding { Image Embedding _1, image Embedding_2, image Embedding _3} of 32x32, 64x64, 128x128 size are respectively subjected to corresponding dimension splicing by using the feature Alignment vector Multi-scale Alignment to obtain the spliced feature vector concatemate Embedding.
And performing corresponding dimension splicing on the feature alignment vector and the local image feature vector to obtain a spliced feature vector, and performing feature compiling on the spliced feature vector to obtain a segmentation result of the image to be segmented. The local image features correspondingly contain the alignment feature vectors which are richer, deeper and more accurate in advance, and meanwhile, the phenomenon that partial features are over-fitted in the previous feature alignment, so that the local image features cannot be fully reflected in feature compiling, and the accuracy of the segmentation result is improved.
Optionally, step S104 includes the following specific steps:
inputting an image to be segmented into a pre-trained image encoder, and performing feature extraction on the image to be segmented by using the image encoder to obtain a global image feature vector and a local image feature vector;
correspondingly, step S108 includes the following specific steps:
inputting the prompt text vector into a text encoder, and performing feature extraction on the prompt text vector by using the text encoder to obtain a text feature vector;
correspondingly, step S110 includes the following specific steps:
performing multi-scale feature alignment on the local image feature vector and the text feature vector to obtain a feature alignment vector;
and inputting the local image feature vector and the feature alignment vector into a pre-trained decoder, and performing feature coding on the local image feature vector and the feature alignment vector by using the decoder to determine a segmentation result of the image to be segmented.
In the embodiments of the present description, a local image feature vector is a multi-scale local image feature vector and is a corresponding multi-scale feature vector diagram.
The image encoder is a pre-trained image feature extraction model, is a neural network model, and can be a VGG model, a ResNet model, a CNN model, a VIT model and the like. The system comprises a global image feature extraction module and a local image feature extraction module. The text encoder is a pre-trained text feature extraction model, is a neural network model, and can be a Transformer model and a derivative model thereof. The decoder is an image classification model, is a neural network model, and can identify and classify different entities in the image so as to obtain a segmentation result of the image to be segmented. Can be FCN model, U-Net model, FPN model, etc.
The image to be segmented is input into a pre-trained image encoder, feature extraction is carried out on the image to be segmented by using the image encoder, and a global image feature vector and a local image feature vector are obtained.
Inputting the local image feature vector and the feature alignment vector into a pre-trained decoder, performing feature coding on the local image feature vector and the feature alignment vector by using the decoder, and determining a segmentation result of the image to be segmented. The entity types of the local image feature vectors may be classified into different granularity classes, for example, entity classes of 2x2,4x4,8x8 granularity, or entity classes of 1x1 pixel level, and the local image feature vectors (i.e., each pixel on the feature vector graph of different scales and the corresponding entity type thereof is determined) are made by the entity classes, and the specific determination method may be to calculate the entity type confidence of each pixel, and further determine one or more entity types with the highest confidence as the entity type of the pixel.
Exemplarily, an Image to be segmented (including 5 entities, an entity 1, an entity 2, an entity 3, an entity 4, and an entity 5) is input into a pre-trained VGG model, a global Image feature extraction module of the VGG model is used to obtain a global Image feature extraction module I, and a local Image feature vector Image Embedding is obtained by using a local Image feature extraction module of the VGG model. And inputting the Prompt Text vector Prompt into a Transformer model, and performing feature extraction on the Prompt Text vector Prompt by using the Transformer model to obtain a Text feature vector Text Embedding. Inputting a local Image feature vector Image Embedding and a feature Alignment vector Multi-scale Alignment into a pre-trained FPN model, performing feature compilation on the Image Embedding and the feature Alignment vector Multi-scale Alignment by using a Semantic FPN model, namely performing pixel-level classification on the entity type of the local Image feature vector Image Embedding based on the feature Alignment vector Multi-scale Alignment, determining the confidence coefficient of the entity type of each pixel, determining the entity type with the highest confidence coefficient as the entity type of the pixel, and obtaining a classification result (entity 1: table, entity 2: chair, entity 3: table lamp, entity 4: cat, entity 5: human) to obtain a segmentation result of an Image to be segmented.
The method comprises the steps that a pre-trained image encoder is utilized to extract features of an image to be segmented to obtain a global image feature vector and a local image feature vector, the accuracy of the obtained global image feature vector and the local image feature vector is improved, and the accuracy of a prompt text vector, the accuracy of a subsequent feature alignment vector and the accuracy of image segmentation are improved; the pre-trained text encoder is used for extracting the features of the prompt text vector, so that the accuracy of the text feature vector, the accuracy of the subsequent feature alignment vector and the accuracy of image segmentation are improved; by utilizing the pre-trained decoder, the accuracy of image segmentation is improved. Meanwhile, the segmentation efficiency of image segmentation is improved by using the pre-trained image encoder, the pre-trained text encoder and the pre-trained decoder.
Optionally, the method further includes the following specific steps:
obtaining a sample image set, wherein the sample image set comprises a plurality of sample images and label images corresponding to the sample images;
extracting a first sample image and a first label image corresponding to the first sample image from a sample image set, wherein the first sample image is any sample image;
inputting a first sample image into a preset image encoder, and performing feature extraction on the first sample image by using the image encoder to obtain a first global image feature vector and a first local image feature vector;
constructing a first prompt text vector according to the random text vector, the first global image feature vector and a preset category label;
inputting the first prompt text vector into a text encoder, and performing feature extraction on the first prompt text vector by using the text encoder to obtain a first text feature vector;
performing multi-scale feature alignment on the first local image feature vector and the first text feature vector to obtain a first feature alignment vector;
inputting the first local image feature vector and the first feature alignment vector into a pre-trained decoder, and performing feature coding on the first local image feature vector and the first feature alignment vector by using the decoder to determine a segmentation result of the first sample image;
determining a total loss according to a segmentation result of the first sample image, the first feature alignment vector and the first label image;
and adjusting parameters of the image encoder and the decoder based on the total loss, and returning to the step of extracting the first sample image and the first label image corresponding to the first sample image from the sample image set until a training stopping condition is reached.
The sample image set is located in a set of pre-constructed sample images and comprises a plurality of sample images and label images corresponding to the sample images. The sample image is a multimedia image sample containing a plurality of entities, the label image corresponding to each sample image is a multimedia image sample which is labeled in advance by entity categories, the labeling mode can be manual labeling, or labeling by using a pre-trained entity labeling algorithm, the entity labeling algorithm can be a pixel value classification labeling method based on a pixel level, or a classification labeling method using a neural network model to classify image features, and the method is not limited herein. The sample image set may be constructed based on a segmentation result of image segmentation performed in advance, may also be obtained by opening a source database, and may also be obtained by manually labeling according to the sample image, which is not limited herein.
The first global image feature is an image feature representing a global high-dimensional feature of the first sample image, and is used for representing features such as color, texture, shape, structure, entity distribution and the like of the image. The first global image feature vector is a high-dimensional vector of first global image features.
The first local image feature is an image feature representing a local low-dimensional feature of the first sample image, and is used for representing features such as pixels and solid edges of the image. The first local image feature vector is a low-dimensional vector of the first local image feature. The first local image feature vector can be characterized by a plurality of feature vector graphs which are characterized by image features with different dimensions.
The first prompt text vector is a text vector with other modal characteristics added on the first text characteristic, and the first prompt text vector is used for correspondingly stipulating the training direction of the model training through the multi-modal characteristics in the model training process. The first feature alignment vector is a multi-modal feature comprising a first local image feature and a first text feature, and the first local image feature and the first text feature in the first feature alignment vector have a spatial correspondence.
And the total loss is a component loss value obtained by taking the first label image as a verification image and respectively calculating loss values with the segmentation result of the first sample image and the first feature alignment vector, and the total loss value is determined according to the component loss value and used for carrying out model performance evaluation on the image encoder and the decoder.
The stop training condition is a preset model training termination condition, may be a preset training frequency, that is, the iterative training is performed on the model, and the training is ended when the preset training frequency is met, or may be a loss value threshold value, and the training is ended when the total loss value meets the loss value threshold value.
And determining the total loss according to the segmentation result of the first sample image, the first feature alignment vector and the first label image, wherein the specific mode is that the first label image is used as a verification image, and after component loss values obtained by loss value calculation are respectively carried out on the segmentation result of the first sample image and the first feature alignment vector, the total loss value for carrying out model performance evaluation on an image encoder and an image decoder is determined according to the component loss values.
And adjusting parameters of the image encoder and the decoder based on the total loss, and returning to execute the step of extracting the first sample image and the first label image corresponding to the first sample image from the sample image set until the training stopping condition is reached.
Illustratively, a Sample Image set Sample is obtained, wherein the Sample Image set comprises a plurality of Sample images Image Sample { Image Sample 1, image Sample 2 … … Image Sample n } and Label images Image 1 l 1, image Label 2 … … Image Label n } corresponding to each Sample Image, a first Sample Image Sample m and a first Label Image m corresponding to the first Sample Image are extracted from the Sample Image set Image Sample, inputting a first Sample Image Sample m into a preset Image encoder, performing feature extraction on the first Sample Image Sample m by using the Image encoder to obtain a first global Image feature vector I (m) and a first local Image feature vector Image Embedding (m), and according to the random Text vector V, the first global Image feature vector I (m) and a preset category Label CLS, constructing a first Prompt Text vector Prompt m, inputting the first Prompt Text vector Prompt m into a Text encoder, performing feature extraction on the first Prompt Text vector Prompt m by using the Text encoder to obtain a first Text feature vector TextEmodd (m), performing Multi-scale feature Alignment on the first local Image feature vector ImageEmbedded (m) and the first Text feature vector TextEmodd (m) to obtain a first feature Alignment vector Multi-scale Alignment (m), inputting the first local Image feature vector ImageEmbedded (m) and the first feature Alignment vector Multi-scale Alignment (m) into a pre-trained decoder, and performing feature Alignment on the first local Image feature vector ImageEmbedded (m) and the first feature Alignment vector Multi-scale Alignment (m) by using the decoder And performing eigen-compilation, namely determining a segmentation Result m of the first sample Image, determining the total Loss according to the segmentation Result m of the first sample Image, the first feature Alignment vector Multi-scale Alignment (m) and the first label Image Label, adjusting parameters of an Image encoder and an Image decoder by using an automatic gradient updating method based on the total Loss, and returning to the step of extracting the first sample Image and the first label Image corresponding to the first sample Image from the sample Image set until a training stopping condition is reached.
The method comprises the steps of obtaining a sample image set, wherein the sample image set comprises a plurality of sample images and label images corresponding to the sample images, extracting a first sample image and a first label image corresponding to the first sample image from the sample image set, inputting the first sample image into a preset image encoder, performing feature extraction on the first sample image by using the image encoder to obtain a first global image feature vector and a first local image feature vector, constructing a first prompt text vector according to a random text vector, the first global image feature vector and a preset class label, inputting the first prompt text vector into a text encoder, performing feature extraction on the first prompt text vector by using the text encoder to obtain a first text feature vector, performing multi-scale feature alignment on the first local image feature vector and the first text feature vector to obtain a first feature alignment vector, inputting the first local image feature vector and the first feature alignment vector into a pre-trained decoder, performing feature compilation on the first local image feature vector and the first feature alignment vector by using the decoder to determine a first sample image result of the first local image alignment vector, determining a sample image loss, and performing total image loss adjustment on the sample image, and the sample image loss, and the total image loss are determined by using the decoder. Through the first image sample and the first label image, supervised model training is carried out on the image encoder and the decoder, parameters of the image encoder and the decoder are adjusted through total loss, when the training stopping condition is reached, the training is finished to obtain the trained image encoder and decoder, the performance and accuracy of the trained model are guaranteed, and the accuracy of image segmentation is improved.
Optionally, determining the total loss according to the segmentation result of the first sample image, the first feature alignment vector, and the first label image, includes the following specific steps:
calculating segmentation loss by using a preset segmentation loss function according to the segmentation result of the first sample image and the first label image;
calculating alignment loss by using a preset alignment loss function according to the first characteristic alignment vector and the first label image;
calculating a contrast loss by using a preset contrast loss function according to the first characteristic alignment vector and the first label image;
the segmentation loss, alignment loss, and contrast loss are weighted to obtain the total loss.
At present, in the DenseCLIP, a loss of a preset category label is adopted, that is, parameter adjustment of a model is performed only through segmentation loss, so that constraint of a loss value is weak, and under the condition that features are not aligned and well contrasted, the training effect of a sample on the model is difficult to ensure, and the accuracy of a segmentation result is seriously influenced.
The segmentation loss is a classification loss value which characterizes the entity type in the style result. The alignment loss is a space loss after alignment between features in the feature alignment vector is characterized, and the contrast loss is a corresponding loss of each feature in the feature alignment vector.
The segmentation loss, the alignment loss and the contrast loss are weighted to obtain the total loss, and the specific mode is that the segmentation loss, the alignment loss and the contrast loss are weighted to obtain the total loss by using preset loss weights. The specific weighting calculation is performed by using formula 1, where formula 1 is as follows:
Loss=γ 1 Loss_seg+γ 2 Loss_Align+γ 3 loss _ contract equation 1
Wherein Loss denotes the total Loss, γ 1 Weight representing segmentation Loss, loss _ seg representing segmentation Loss, γ 2 Weight representing alignment Loss, loss _ Align representing alignment Loss, γ 3 Weight representing Loss of contrast, loss _ contrast represents Loss of contrast.
Illustratively, the segmentation loss is calculated to be 0.18 using a preset segmentation loss function according to the segmentation result of the first sample image and the first label image, the alignment loss is calculated to be 0.24 using the preset alignment loss function according to the first feature alignment vector and the first label image, the contrast loss is calculated to be 0.36 using a preset contrast loss function according to the first feature alignment vector and the first label image, the weight of the segmentation loss is 0.4, the weight of the alignment loss is 0.2, the weight of the contrast loss is 0.4, and the segmentation loss 0.18, the alignment loss 0.24, and the contrast loss 0.36 are weighted using formula 1 to obtain the total loss of 0.264.
Calculating a segmentation loss by using a preset segmentation loss function according to a segmentation result of the first sample image and the first label image, calculating an alignment loss by using a preset alignment loss function according to the first feature alignment vector and the first label image, calculating a contrast loss by using a preset contrast loss function according to the first feature alignment vector and the first label image, and weighting the segmentation loss, the alignment loss and the contrast loss to obtain a total loss. The accuracy of total loss is guaranteed, the performance and the accuracy of the model obtained through training are further guaranteed, and the accuracy of image segmentation is improved.
Optionally, the method for calculating the contrast loss by using a preset contrast loss function according to the first feature alignment vector and the first label image includes the following specific steps:
and performing sample-point-by-sample point contrast loss calculation on the first feature alignment vector and the first label image by using a preset contrast loss function to obtain the contrast loss.
The sample points are the first feature alignment vector and the feature sample points on the first label image, and are image feature points with preset granularity, such as feature sample points with granularity of 1x1,2x2,4x4, 8x8. The sample points comprise difficult sample points and easy sample points, and correspond to characteristic sample points which are difficult to perform entity classification in the image and characteristic sample points which are easy to perform entity classification, and the difficulty is determined by comparing the contrast loss with a preset contrast loss threshold.
After the difficulty of the sample points is determined, the corresponding labeling is carried out, the labeling is used as a training result of the training, the subsequent iterative training can be added, the training is carried out through the sequence of the easy sample points and the difficult sample points, and the training effect of the model is improved.
And performing sample-point-by-sample point contrast loss calculation on the first feature alignment vector and the first label image by using a preset contrast loss function to obtain the contrast loss. Reference data are provided for subsequent determination model training, the training effect of the model is improved, and the accuracy of the subsequent segmentation result is improved.
Optionally, after step S110, the following specific steps are further included:
sending the segmentation result of the image to be segmented to the front end for displaying so that a user edits the segmentation result at the front end;
receiving an editing result fed back by the front end;
and taking the editing result as a sample image, and training an image segmentation model, wherein the image segmentation model comprises an image encoder for performing image feature extraction, a text encoder for performing text feature extraction and a decoder for performing feature decoding.
The front end is the front end of the client with the image segmentation function, which can execute the steps S102 to S110.
By sending the segmentation result to the front segment for display, the user can perform further editing operation after directly observing the visual effect of the segmentation result.
The editing result is an image processing result obtained by the user performing an editing operation on the segmentation result displayed at the front end, and the editing operation may be adjusting an image area of an entity in the segmentation result, adjusting display parameters such as an image scale, a size, a color, and a contrast of the segmentation result, and adjusting a category of the entity in the segmentation result, which is not limited herein.
The graph segmentation model is a model with an image segmentation function and comprises an image encoder for extracting image characteristics, a text encoder for extracting text characteristics and a decoder for decoding the characteristics. The image encoder is a neural network model with an image feature extraction function, and can be a VGG model, a ResNet model, a CNN model, a VIT model and the like. The text encoder is a neural network model with a text feature extraction function, and can be a transform model and a derivative model thereof. The decoder is a neural network model with the function of classifying entities in the image, and can identify and classify different entities in the image so as to obtain the segmentation result of the image to be segmented. Can be FCN model, U-Net model, FPN model, etc.
And sending the segmentation result of the image to be segmented to the front end for displaying, wherein the specific mode is that different entity areas in the segmentation result of the image to be segmented are subjected to category labeling and then sent to the front end for displaying. The type labeling may be a text label, an outline label, or a color label, and is not limited herein.
Illustratively, an image to be segmented sent by a user through a client with an image segmentation function is received, the image to be segmented is a photo containing 5 entities, the segmentation result of the photo is obtained by executing the steps S102 to S110, image areas of the 5 entities in the segmentation result are labeled with categories with different colors, and the segmentation result subjected to category labeling is displayed at the front end of the client. And receiving the editing result fed back by the front end.
Because the sample sets used for training the image segmentation model are generally universal sample sets, such as sample sets of ADE20K, COCO-Stuff10K, ADE20K-Full, and the like, in the actual use process of a user, sample images and corresponding label images more conforming to the actual application scene may be needed, and the cost for manually constructing a large number of corresponding sample images and label images is high, the segmentation result can be displayed at the front end and then edited by the user, so that the editing result more conforming to the time application scene is obtained and used as the sample images to train an image encoder and an image decoder, the accuracy of the segmentation result is improved, and the cost for training the model is saved.
Exemplarily, traffic lights in a general sample set are horizontal, model training is performed based on the general sample set, the horizontal traffic lights are subjected to entity identification and then segmented to obtain segmented structures, the traffic lights in another region are vertical, the general sample set is difficult to segment images to be segmented containing the vertical traffic lights, a user needs to perform editing operation, an editing result is obtained and used as a sample image to train an image segmentation model, and the capability of the image segmentation model for identifying and segmenting the vertical traffic lights in the region is improved.
The segmentation result of the image to be segmented is sent to the front end to be displayed, and the segmentation result can be visually displayed to a user, so that the user can further process the segmentation result, and the user experience is improved. And receiving an editing result fed back by the front end, taking the editing result as a sample image, and training an image segmentation model, wherein the image segmentation model comprises an image encoder for performing image feature extraction, a text encoder for performing text feature extraction and a decoder for performing feature decoding. The training effect of the image segmentation model is improved, the effect of segmenting the subsequent image to be segmented is improved, and the accuracy of the subsequent segmentation result is improved.
Referring to fig. 2, fig. 2 shows a flowchart of a remote sensing image segmentation method provided in an embodiment of the present specification, which specifically includes the following steps.
Step S202: receiving a remote sensing image segmentation instruction input by a user, wherein the remote sensing image segmentation instruction comprises a remote sensing image to be segmented and a class label of a target segmentation object;
step S204: performing feature extraction on a remote sensing image to be segmented to obtain a global image feature vector and a local image feature vector;
step S206: constructing a prompt text vector according to the random text vector, the global image feature vector and the category label;
step S208: extracting the characteristics of the prompt text vector to obtain a text characteristic vector;
step S210: and determining a segmentation result aiming at the target segmentation object in the remote sensing image to be segmented through feature compilation according to the local image feature vector and the text feature vector.
The embodiment of the specification is applied to a functional service providing terminal with a remote sensing image segmentation function.
The remote sensing image segmentation instruction is a graph segmentation instruction which is generated by a client and sent to a functional service provider after a user uploads a remote sensing image to be segmented and a class label of a target segmentation object through the client.
The remote sensing image to be segmented is a remote sensing multimedia image containing a plurality of surface entities, and is a real remote sensing image acquired by a remote sensing image acquisition device, and the remote sensing image to be segmented can be in the form of a picture, a video frame and the like, and is not limited herein. The target segmentation object is a surface entity which is required to be subjected to entity identification and segmented by a user to obtain a pair image, and the category label of the target segmentation object is one of preset transition categories corresponding to the target segmentation object.
The specific implementation manner in the embodiment of this specification has been described in detail in the embodiment of fig. 1, and is not described herein again.
In the embodiment of the specification, a remote sensing image segmentation instruction input by a user is received, wherein the remote sensing image segmentation instruction comprises a remote sensing image to be segmented and a category label of a target segmentation object, feature extraction is performed on the remote sensing image to be segmented to obtain a global image feature vector and a local image feature vector, a prompt text vector is constructed according to a random text vector, the global image feature vector and the category label, feature extraction is performed on the prompt text vector to obtain a text feature vector, and a segmentation result for the target segmentation object in the remote sensing image to be segmented is determined through feature compilation according to the local image feature vector and the text feature vector. The method comprises the steps of extracting features of a remote sensing image to be segmented to obtain a global image feature vector and a local image feature vector, constructing according to the global image feature vector, a random text vector and a preset category label to obtain a prompt text vector, subsequently segmenting the image by using the text features and the image features, and fully excavating deep features of a single remote sensing image to be segmented to enable the segmentation result to better meet image use habits of people.
Optionally, after step S210, the following specific steps are further included:
sending the segmentation result of the remote sensing image to be segmented to the front end for displaying so that a user can edit the segmentation result at the front end;
receiving an editing result fed back by the front end;
and taking the editing result as a sample image, and training an image segmentation model, wherein the image segmentation model comprises an image encoder for performing image feature extraction, a text encoder for performing text feature extraction and a decoder for performing feature decoding.
The front end is the front end of the client with the remote sensing image segmentation function.
By sending the segmentation result to the front end for display, the user can perform further editing operation after directly observing the visual effect of the segmentation result.
The editing result is an image processing result obtained by the user performing an editing operation on the segmentation result displayed at the front end, and the editing operation may be adjusting an image area of the surface entity in the segmentation result, adjusting display parameters such as an image proportion, a size, a color, and a contrast of the segmentation result, and adjusting a type of the surface entity in the segmentation result, which is not limited herein.
The graph segmentation model is a model with an image segmentation function and comprises an image encoder for extracting image features, a text encoder for extracting text features and a decoder for decoding the features. The image encoder is a neural network model with an image feature extraction function, and can be a VGG model, a ResNet model, a CNN model, a VIT model and the like. The text encoder is a neural network model with a text feature extraction function, and can be a Transformer model, a derivative model thereof and the like. The decoder is a neural network model with a function of classifying the earth surface entities in the image, and can identify and classify different earth surface entities in the image so as to obtain a segmentation result of the remote sensing image to be segmented. Can be FCN model, U-Net model, FPN model, etc.
And sending the segmentation result of the remote sensing image to be segmented to the front end for displaying, wherein the specific mode is that after class marking is carried out on image areas of different earth surface entities in the segmentation result of the remote sensing image to be segmented, the image areas are sent to the front end for displaying. The type labeling mode may be a text labeling, an outline labeling, or a color labeling, which is not limited herein
Illustratively, a remote sensing image to be segmented sent by a user through a client with an image segmentation function is received, the remote sensing image to be segmented is a remote sensing satellite map containing 3 types of earth surface entities (lakes, roads and buildings), a target segmentation object is a building, the segmentation result of the remote sensing satellite map is obtained by executing the steps S202-S210, the image area of the building in the segmentation result is labeled by categories with different outlines, and the segmentation result subjected to category labeling is displayed at the front end of the client. And receiving an editing result fed back by the user, wherein the editing result is an image processing result obtained by adjusting the image area of the building in the segmentation result by the user.
Because the sample sets used for training the image segmentation model are generally universal sample sets, such as sample sets of ADE20K, COCO-Stuff10K, ADE20K-Full, and the like, in the actual use process of a user, sample images and corresponding label images more conforming to the actual application scene may be needed, and the cost for manually constructing a large number of corresponding sample images and label images is high, the segmentation result can be displayed at the front end and then edited by the user, so that the editing result more conforming to the time application scene is obtained and used as the sample images to train an image encoder and an image decoder, the accuracy of the segmentation result is improved, and the cost for training the model is saved.
For example, a building in a general sample set is high-density, model training is performed based on the general sample set, a high-density earth surface entity is subjected to entity recognition and then is segmented to obtain a target segmentation object of the building, and buildings in certain regions are low-density, so that the building in the region difficult to segment by the remote sensing image to be segmented in the general sample set needs to be edited by a user, an editing result is obtained as a sample image, and the capability of the image segmentation model recognition and segmentation of the low-density building in the region is improved.
The segmentation result of the remote sensing image to be segmented is sent to the front end to be displayed, and the segmentation result can be visually displayed to a user, so that the user can further process the segmentation result, and the user experience is improved. And receiving an editing result fed back after the user edits the segmentation result, so that the actual use requirement of the user can be met, the adaptability and the accuracy of the remote sensing image segmentation are improved, and the user experience is improved. And taking the editing result as a sample image, and training an image segmentation model, wherein the image segmentation model comprises an image encoder for performing image feature extraction, a text encoder for performing text feature extraction and a decoder for performing feature decoding. The training effect of the image segmentation model is improved, the effect of segmenting the subsequent remote sensing images to be segmented is improved, and the accuracy of the subsequent segmentation result is improved.
The following will further describe the image segmentation method provided in this specification with reference to fig. 3 as an example of application of the image segmentation method to entity identification of a remote sensing image. Fig. 3 shows a processing flow chart of an image segmentation method applied to entity identification of a remote sensing image according to an embodiment of the present specification, and specifically includes the following steps.
Step S302: receiving a remote sensing image to be segmented sent by a user through a client;
the remote sensing image to be segmented is a multimedia image comprising a plurality of surface entities. The plurality of surface entities may be roads, trees, oil storage tanks, traffic vehicles, buildings, and the like.
Step S304: inputting the remote sensing image to be segmented into a VIT model, and performing feature extraction on the image to be segmented by using the VIT model to obtain a global image feature vector and a local image feature vector;
step S306: carrying out dimension mapping on the global image feature vector to obtain a global image feature vector with the same vector dimension as the random text vector;
step S308: splicing the random text vector, the dimensionality mapped global image feature vector and a preset category label to obtain a prompt text vector;
step S310: constructing a prompt text vector according to the random text vector, the global image feature vector and a preset category label;
step S312: inputting the prompt text vector into a Transformer model, and extracting the characteristics of the prompt text vector by using the Transformer model to obtain a text characteristic vector;
step S314: performing cross attention calculation on the text characteristic vector and the local image characteristic vector by using a preset Transformer model with a 6-layer structure to determine a target text characteristic vector;
step S316: fine-tuning the text feature vector based on the target text feature vector to obtain an updated text feature vector;
step S318: performing cross multiplication operation on the local image feature vector and the text feature vector to obtain a feature alignment vector;
step S320: splicing the feature alignment vector and the local image feature vector in corresponding dimensions to obtain a spliced feature vector;
step S322: inputting the local image feature vector and the feature alignment vector into a Semantic FPN model, and performing feature compilation on the local image feature vector and the feature alignment vector by using the Semantic FPN model to determine a segmentation result of the remote sensing image to be segmented;
step S324: and sending the segmentation result to the front end of the client for display.
In the embodiment of the specification, a VIT model is used for extracting features of a remote sensing image to be segmented to obtain a global image feature vector and a local image feature vector, the global image feature vector is subjected to dimension mapping and then is spliced with a random text vector and a preset category label to obtain a prompt text vector, the richer and deeper features of the prompt text vector are mined based on a cross attention mechanism and a Transformer model, the image is segmented subsequently by using the text feature and the image feature, the deep features of a single image to be segmented are fully mined, the segmentation result can better meet the image use habit of people, when the segmentation result is reprocessed at downstream, a good processing result is obtained, the accuracy of the segmentation result is improved, cross multiplication operation is performed on the local image feature vector and the text feature vector to obtain a feature alignment vector, the feature correlation of the subsequent segmentation result is ensured, the accuracy of the segmentation result is further improved, a Semantic FPN model is used for image segmentation, and the accuracy of the segmentation result is further improved.
Fig. 4 is a system architecture diagram of an image segmentation system provided in an embodiment of the present specification.
As shown in fig. 4, an image to be segmented is input into an image encoder, a global image feature vector and a local image feature vector are extracted and obtained, the global image feature vector is input into a mapper, the global image feature vector is subjected to dimension mapping and is spliced with a random text vector and a preset category label to construct a prompt text vector, the prompt text vector is input into a text encoder, the text feature vector is extracted and obtained, the text feature vector and the local image feature vector are input into a Transformer model, the text feature vector is subjected to fine tuning by using a cross attention mechanism to obtain an updated text feature vector, the text feature vector and the local image feature vector are subjected to feature alignment to obtain a feature alignment vector, the feature alignment vector and the local image feature vector are cascaded and are input into a decoder to obtain a segmentation result of the image to be segmented, the segmentation loss is calculated by using the segmentation result and a pre-obtained label image to obtain a segmentation loss, and the comparison loss calculation and the alignment loss calculation are respectively performed by using the pre-obtained label image and the alignment feature vector to obtain a comparison loss and an alignment loss. Segmentation, contrast, and alignment losses are used to adjust the parameters of the image encoder and decoder.
Fig. 5A shows a schematic diagram of a remote sensing image to be segmented of a remote sensing image segmentation method provided in an embodiment of the present specification. Fig. 5B is a schematic diagram illustrating a segmentation result of a remote sensing image to be segmented according to a remote sensing image segmentation method provided in an embodiment of the present specification.
The embodiment of the specification is front-end display of a client with a remote sensing image segmentation function.
As shown in fig. 5A, the remote sensing image to be segmented includes the ground surface entities of the road and the oil storage tank, fig. 5A is the remote sensing image to be segmented without image segmentation, the preset category label is set as the vector corresponding to the "oil storage tank", and the segmentation structure is obtained by processing the remote sensing image segmentation method of the embodiment in fig. 2, as shown in fig. 5B, the oil storage tank in the remote sensing image to be segmented is subjected to corresponding entity identification, the entity image corresponding to the oil storage tank is obtained by segmentation, and the other ground surface entities (roads) not having the preset category label are not subjected to entity identification.
Corresponding to the above method embodiment, the present specification further provides an image segmentation apparatus embodiment, and fig. 6 shows a schematic structural diagram of an image segmentation apparatus provided in an embodiment of the present specification. As shown in fig. 6, the apparatus includes:
a first obtaining module 602 configured to obtain an image to be segmented;
a first extraction module 604, configured to perform feature extraction on an image to be segmented to obtain a global image feature vector and a local image feature vector;
a first construction module 606 configured to construct a prompt text vector according to the random text vector, the global image feature vector, and a preset category label;
a second extraction module 608, configured to perform feature extraction on the prompt text vector to obtain a text feature vector;
and the first segmentation module 610 is configured to determine a segmentation result of the image to be segmented through feature compilation according to the local image feature vector and the text feature vector.
Optionally, the first building module 606 is further configured to:
carrying out dimension mapping on the global image feature vector to obtain a global image feature vector with the same vector dimension as the random text vector; and splicing the random text vector, the dimensionality mapped global image feature vector and a preset category label to obtain a prompt text vector.
Optionally, the apparatus further comprises:
the updating module is configured to perform cross attention calculation on the text feature vector and the local image feature vector and determine a target text feature vector; and fine-tuning the text feature vector based on the target text feature vector to obtain an updated text feature vector.
Optionally, the update module is further configured to:
and performing cross attention calculation on the text feature vector and the local image feature vector by using a preset multilayer structure translation model decoder to determine a target text feature vector.
Optionally, the first segmentation module 610 is further configured to:
performing multi-scale feature alignment on the local image feature vector and the text feature vector to obtain a feature alignment vector; and determining a segmentation result of the image to be segmented through feature compiling based on the feature alignment vector and the local image feature vector.
Optionally, the first segmentation module 610 is further configured to:
and performing cross multiplication operation on the local image feature vector and the text feature vector to obtain a feature alignment vector.
Optionally, the first segmentation module 610 is further configured to:
splicing the feature alignment vector and the local image feature vector in corresponding dimensions to obtain a spliced feature vector;
and performing feature compiling on the spliced feature vector to obtain a segmentation result of the image to be segmented.
Optionally, the first extraction module 604 is further configured to:
inputting an image to be segmented into a pre-trained image encoder, and performing feature extraction on the image to be segmented by using the image encoder to obtain a global image feature vector and a local image feature vector;
correspondingly, the second extraction module 608 is further configured to:
inputting the prompt text vector into a text encoder, and performing feature extraction on the prompt text vector by using the text encoder to obtain a text feature vector;
correspondingly, the first segmentation module 610 is further configured to:
performing multi-scale feature alignment on the local image feature vector and the text feature vector to obtain a feature alignment vector; and inputting the local image feature vector and the feature alignment vector into a pre-trained decoder, and performing feature coding on the local image feature vector and the feature alignment vector by using the decoder to determine a segmentation result of the image to be segmented.
Optionally, the apparatus further comprises:
the training module is configured to obtain a sample image set, wherein the sample image set comprises a plurality of sample images and label images corresponding to the sample images; extracting a first sample image and a first label image corresponding to the first sample image from a sample image set, wherein the first sample image is any sample image; inputting a first sample image into a preset image encoder, and performing feature extraction on the first sample image by using the image encoder to obtain a first global image feature vector and a first local image feature vector; constructing a first prompt text vector according to the random text vector, the first global image feature vector and a preset category label; inputting the first prompt text vector into a text encoder, and performing feature extraction on the first prompt text vector by using the text encoder to obtain a first text feature vector; performing multi-scale feature alignment on the first local image feature vector and the first text feature vector to obtain a first feature alignment vector; inputting the first local image feature vector and the first feature alignment vector into a pre-trained decoder, and performing feature coding on the first local image feature vector and the first feature alignment vector by using the decoder to determine a segmentation result of the first sample image; determining total loss according to the segmentation result of the first sample image, the first feature alignment vector and the first label image; and adjusting parameters of the image encoder and the decoder based on the total loss, and returning to the step of extracting the first sample image and the first label image corresponding to the first sample image from the sample image set until a training stopping condition is reached.
Optionally, the training module is further configured to:
calculating segmentation loss by using a preset segmentation loss function according to the segmentation result of the first sample image and the first label image; calculating alignment loss by using a preset alignment loss function according to the first characteristic alignment vector and the first label image; calculating a contrast loss by using a preset contrast loss function according to the first characteristic alignment vector and the first label image; the segmentation loss, alignment loss, and contrast loss are weighted to obtain a total loss.
Optionally, the training module is further configured to:
and performing sample-point-by-sample point contrast loss calculation on the first feature alignment vector and the first label image by using a preset contrast loss function to obtain the contrast loss.
Optionally, the apparatus further comprises:
the first editing and training module is configured to send a segmentation result of an image to be segmented to a front end for displaying, so that a user edits the segmentation result at the front end, receives an editing result fed back by the front end, takes the editing result as a sample image, and trains an image segmentation model, wherein the image segmentation model comprises an image encoder for performing image feature extraction, a text encoder for performing text feature extraction and an encoder for performing feature decoding.
In the embodiment of the description, an image to be segmented is obtained, feature extraction is performed on the image to be segmented to obtain a global image feature vector and a local image feature vector, a prompt text vector is constructed according to a random text vector, the global image feature vector and a preset category label, feature extraction is performed on the prompt text vector to obtain a text feature vector, and a segmentation result of the image to be segmented is determined through feature compilation according to the local image feature vector and the text feature vector. The method comprises the steps of extracting features of an image to be segmented to obtain a global image feature vector and a local image feature vector, constructing according to the global image feature vector, a random text vector and a preset category label to obtain a prompt text vector, segmenting the image by using the text features and the image features in the follow-up process, and fully mining deep features of a single image to be segmented to enable the segmentation result to better meet image use habits of people.
The above is a schematic scheme of an image segmentation apparatus of the present embodiment. It should be noted that the technical solution of the image segmentation apparatus belongs to the same concept as the technical solution of the image segmentation method described above, and for details that are not described in detail in the technical solution of the image segmentation apparatus, reference may be made to the description of the technical solution of the image segmentation method described above.
Corresponding to the above method embodiment, the present specification further provides an embodiment of a remote sensing image segmentation apparatus, and fig. 7 shows a schematic structural diagram of the remote sensing image segmentation apparatus provided in an embodiment of the present specification. As shown in fig. 7, the apparatus includes:
a receiving module 702, configured to receive a remote sensing image segmentation instruction input by a user, wherein the remote sensing image segmentation instruction includes a remote sensing image to be segmented and a category label of a target segmentation object;
a third extraction module 704, configured to perform feature extraction on the remote sensing image to be segmented to obtain a global image feature vector and a local image feature vector;
a second construction module 706 configured to construct a prompt text vector from the random text vector, the global image feature vector, and the category label;
a fourth extraction module 708 configured to perform feature extraction on the prompt text vector to obtain a text feature vector;
and the second segmentation module 710 is configured to determine a segmentation result for the target segmentation object in the remote sensing image to be segmented through feature compilation according to the local image feature vector and the text feature vector.
Optionally, the apparatus further comprises:
the first editing training module is configured to send a segmentation result of the remote sensing image to be segmented to the front end for displaying, so that a user edits the segmentation result at the front end, receives an editing result fed back by the front end, takes the editing result as a sample image, and trains an image segmentation model, wherein the image segmentation model comprises an image encoder for performing image feature extraction, a text encoder for performing text feature extraction and a decoder for performing feature decoding.
In the embodiment of the specification, a remote sensing image segmentation instruction input by a user is received, wherein the remote sensing image segmentation instruction comprises a remote sensing image to be segmented and a category label of a target segmentation object, feature extraction is performed on the remote sensing image to be segmented to obtain a global image feature vector and a local image feature vector, a prompt text vector is constructed according to a random text vector, the global image feature vector and the category label, feature extraction is performed on the prompt text vector to obtain a text feature vector, and a segmentation result for the target segmentation object in the remote sensing image to be segmented is determined through feature compilation according to the local image feature vector and the text feature vector. The method comprises the steps of extracting features of a remote sensing image to be segmented to obtain a global image feature vector and a local image feature vector, constructing according to the global image feature vector, a random text vector and a preset category label to obtain a prompt text vector, subsequently segmenting the image by using the text features and the image features, and fully excavating deep features of a single remote sensing image to be segmented to enable the segmentation result to better meet image use habits of people.
The foregoing is a schematic configuration of a remote sensing image segmentation apparatus according to the present embodiment. It should be noted that the technical solution of the remote sensing image segmentation apparatus and the technical solution of the remote sensing image segmentation method belong to the same concept, and details of the technical solution of the remote sensing image segmentation apparatus, which are not described in detail, can be referred to the description of the technical solution of the remote sensing image segmentation method.
Fig. 8 shows a block diagram of a computing device provided in an embodiment of the present specification. The components of the computing device 800 include, but are not limited to, memory 810 and a processor 820. The processor 820 is coupled to the memory 810 via a bus 830, and the database 850 is used to store data.
Computing device 800 also includes access device 840, access device 840 enabling computing device 800 to communicate via one or more networks 860. Examples of such networks include a Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 840 may include one or more of any type of Network Interface (e.g., network Interface Controller) whether wired or Wireless, such as an IEEE802.12 Wireless Local Area Network (WLAN) Wireless Interface, a worldwide Interoperability for Microwave Access (Wi-MAX) Interface, an ethernet Interface, a Universal Serial Bus (USB) Interface, a cellular Network Interface, a bluetooth Interface, a Near Field Communication (NFC) Interface, etc.
In one embodiment of the present description, the above-described components of computing device 800, as well as other components not shown in FIG. 8, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 8 is for purposes of example only and is not limiting as to the scope of the description. Those skilled in the art may add or replace other components as desired.
Computing device 800 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 800 may also be a mobile or stationary server.
Wherein the processor 820 is configured to execute computer-executable instructions that, when executed by the processor, implement the steps of the image segmentation method or the remote sensing image segmentation method described above.
The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device belongs to the same concept as the technical solutions of the image segmentation method and the remote sensing image segmentation method, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the image segmentation method or the remote sensing image segmentation method.
An embodiment of the present specification further provides a computer-readable storage medium storing computer-executable instructions, which when executed by a processor, implement the steps of the image segmentation method or the remote sensing image segmentation method.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solutions of the image segmentation method and the remote sensing image segmentation method, and details of the technical solution of the storage medium, which are not described in detail, can be referred to the description of the technical solution of the image segmentation method or the remote sensing image segmentation method.
An embodiment of the present specification further provides a computer program, wherein when the computer program is executed in a computer, the computer is caused to execute the steps of the image segmentation method or the remote sensing image segmentation method.
The above is an illustrative scheme of a computer program of the present embodiment. It should be noted that the technical solution of the computer program is the same concept as the technical solution of the image segmentation method and the remote sensing image segmentation method, and details not described in detail in the technical solution of the computer program can be referred to the description of the technical solution of the image segmentation method or the remote sensing image segmentation method.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.
It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts, but those skilled in the art should understand that the present embodiment is not limited by the described acts, because some steps may be performed in other sequences or simultaneously according to the present embodiment. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that acts and modules referred to are not necessarily required for an embodiment of the specification.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, and to thereby enable others skilled in the art to best understand the specification and utilize the specification. The specification is limited only by the claims and their full scope and equivalents.

Claims (14)

1. An image segmentation method, comprising:
acquiring an image to be segmented;
extracting features of the image to be segmented to obtain a global image feature vector and a local image feature vector;
constructing a prompt text vector according to the random text vector, the global image feature vector and a preset category label;
extracting the characteristics of the prompt text vector to obtain a text characteristic vector;
and determining the segmentation result of the image to be segmented through feature compilation according to the local image feature vector and the text feature vector.
2. The method of claim 1, wherein constructing a prompt text vector from the random text vector, the global image feature vector, and a preset category label comprises:
carrying out dimension mapping on the global image feature vector to obtain a global image feature vector with the same vector dimension as that of the random text vector;
and splicing the random text vector, the dimensionality mapped global image feature vector and a preset category label to obtain a prompt text vector.
3. The method according to claim 1, before determining a segmentation result of the image to be segmented according to the local image feature vector and the text feature vector through feature coding, further comprising:
performing cross attention calculation on the text feature vector and the local image feature vector to determine a target text feature vector;
and fine-tuning the text feature vector based on the target text feature vector to obtain the updated text feature vector.
4. The method of claim 3, the cross-attention computing the text feature vector and the local image feature vector, determining a target text feature vector, comprising:
and performing cross attention calculation on the text feature vector and the local image feature vector by using a preset multilayer structure translation model decoder to determine a target text feature vector.
5. The method according to any one of claims 1 to 4, wherein the determining a segmentation result of the image to be segmented according to the local image feature vector and the text feature vector through feature coding comprises:
performing multi-scale feature alignment on the local image feature vector and the text feature vector to obtain a feature alignment vector;
and determining a segmentation result of the image to be segmented through feature coding based on the feature alignment vector and the local image feature vector.
6. The method of claim 5, wherein the determining a segmentation result of the image to be segmented based on the feature alignment vector and the local image feature vector through feature coding comprises:
splicing the feature alignment vector and the local image feature vector in corresponding dimensions to obtain a spliced feature vector;
and performing feature compiling on the spliced feature vector to obtain a segmentation result of the image to be segmented.
7. The method according to claim 1, wherein the extracting features of the image to be segmented to obtain a global image feature vector and a local image feature vector comprises:
inputting the image to be segmented into a pre-trained image encoder, and performing feature extraction on the image to be segmented by using the image encoder to obtain a global image feature vector and a local image feature vector;
the extracting the feature of the prompt text vector to obtain a text feature vector includes:
inputting the prompt text vector into a text encoder, and performing feature extraction on the prompt text vector by using the text encoder to obtain a text feature vector;
the determining the segmentation result of the image to be segmented according to the local image feature vector and the text feature vector through feature coding comprises:
carrying out multi-scale feature alignment on the local image feature vector and the text feature vector to obtain a feature alignment vector;
inputting the local image feature vector and the feature alignment vector into a pre-trained decoder, and performing feature coding on the local image feature vector and the feature alignment vector by using the decoder to determine a segmentation result of the image to be segmented.
8. The method of claim 7, further comprising:
obtaining a sample image set, wherein the sample image set comprises a plurality of sample images and label images corresponding to the sample images;
extracting a first sample image and a first label image corresponding to the first sample image from the sample image set, wherein the first sample image is any sample image;
inputting the first sample image into a preset image encoder, and performing feature extraction on the first sample image by using the image encoder to obtain a first global image feature vector and a first local image feature vector;
constructing a first prompt text vector according to the random text vector, the first global image feature vector and a preset category label;
inputting the first prompt text vector into a text encoder, and performing feature extraction on the first prompt text vector by using the text encoder to obtain a first text feature vector;
performing multi-scale feature alignment on the first local image feature vector and the first text feature vector to obtain a first feature alignment vector;
inputting the first local image feature vector and the first feature alignment vector into a pre-trained decoder, performing feature coding on the first local image feature vector and the first feature alignment vector by using the decoder, and determining a segmentation result of the first sample image;
determining a total loss according to the segmentation result of the first sample image, the first feature alignment vector and the first label image;
and adjusting parameters of the image encoder and the decoder based on the total loss, and returning to execute the step of extracting the first sample image and the first label image corresponding to the first sample image from the sample image set until a training stopping condition is reached.
9. The method of claim 8, determining a total loss from the segmentation result of the first sample image, the first feature alignment vector, and the first label image, comprising:
calculating segmentation loss by using a preset segmentation loss function according to the segmentation result of the first sample image and the first label image;
calculating alignment loss by using a preset alignment loss function according to the first characteristic alignment vector and the first label image;
calculating a contrast loss by using a preset contrast loss function according to the first feature alignment vector and the first label image;
weighting the segmentation loss, the alignment loss and the contrast loss to obtain a total loss.
10. The method according to any one of claims 1-4, 7-9, further comprising, after said determining a segmentation result for the image to be segmented:
sending the segmentation result of the image to be segmented to a front end for displaying so that a user can edit the segmentation result at the front end;
receiving an editing result fed back by the front end;
and taking the editing result as a sample image, and training an image segmentation model, wherein the image segmentation model comprises an image encoder for performing image feature extraction, a text encoder for performing text feature extraction and a decoder for performing feature decoding.
11. A remote sensing image segmentation method comprises the following steps:
receiving a remote sensing image segmentation instruction input by a user, wherein the remote sensing image segmentation instruction comprises a remote sensing image to be segmented and a class label of a target segmentation object;
extracting the features of the remote sensing image to be segmented to obtain a global image feature vector and a local image feature vector;
constructing a prompt text vector according to the random text vector, the global image feature vector and the category label;
extracting the characteristics of the prompt text vector to obtain a text characteristic vector;
and determining a segmentation result aiming at the target segmentation object in the remote sensing image to be segmented through feature compilation according to the local image feature vector and the text feature vector.
12. The method of claim 11, further comprising, after the determining a segmentation result for the target segmentation in the remote sensing image to be segmented,:
sending the segmentation result of the remote sensing image to be segmented to a front end for displaying so that a user can edit the segmentation result at the front end;
receiving an editing result fed back by the front end;
and taking the editing result as a sample image, and training an image segmentation model, wherein the image segmentation model comprises an image encoder for performing image feature extraction, a text encoder for performing text feature extraction and a decoder for performing feature decoding.
13. A computing device, comprising:
a memory and a processor;
the memory is for storing computer executable instructions and the processor is for executing the computer executable instructions which when executed by the processor perform the steps of the image segmentation method of any one of claims 1 to 10 or the remote sensing image segmentation method of any one of claims 11 to 12.
14. A computer readable storage medium storing computer executable instructions which, when executed by a processor, perform the steps of the image segmentation method of any one of claims 1 to 10 or the remote sensing image segmentation method of any one of claims 11 to 12.
CN202211182413.7A 2022-09-27 2022-09-27 Image segmentation method, remote sensing image segmentation method and device Active CN115761222B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211182413.7A CN115761222B (en) 2022-09-27 2022-09-27 Image segmentation method, remote sensing image segmentation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211182413.7A CN115761222B (en) 2022-09-27 2022-09-27 Image segmentation method, remote sensing image segmentation method and device

Publications (2)

Publication Number Publication Date
CN115761222A true CN115761222A (en) 2023-03-07
CN115761222B CN115761222B (en) 2023-11-03

Family

ID=85352056

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211182413.7A Active CN115761222B (en) 2022-09-27 2022-09-27 Image segmentation method, remote sensing image segmentation method and device

Country Status (1)

Country Link
CN (1) CN115761222B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116758093A (en) * 2023-05-30 2023-09-15 首都医科大学宣武医院 Image segmentation method, model training method, device, equipment and medium
CN117314938A (en) * 2023-11-16 2023-12-29 中国科学院空间应用工程与技术中心 Image segmentation method and device based on multi-scale feature fusion decoding
CN117992992A (en) * 2024-04-07 2024-05-07 武昌首义学院 Extensible satellite information data cloud platform safe storage method and system

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160014482A1 (en) * 2014-07-14 2016-01-14 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Generating Video Summary Sequences From One or More Video Segments
CN112818955A (en) * 2021-03-19 2021-05-18 北京市商汤科技开发有限公司 Image segmentation method and device, computer equipment and storage medium
US20210390700A1 (en) * 2020-06-12 2021-12-16 Adobe Inc. Referring image segmentation
CN114036336A (en) * 2021-11-15 2022-02-11 上海交通大学 Semantic division-based pedestrian image searching method based on visual text attribute alignment
CN114283127A (en) * 2021-12-14 2022-04-05 山东大学 Multi-mode information-guided medical image segmentation system and image processing method
US20220156992A1 (en) * 2020-11-18 2022-05-19 Adobe Inc. Image segmentation using text embedding
CN114529757A (en) * 2022-01-21 2022-05-24 四川大学 Cross-modal single-sample three-dimensional point cloud segmentation method
CN114565625A (en) * 2022-03-09 2022-05-31 昆明理工大学 Mineral image segmentation method and device based on global features
WO2022142450A1 (en) * 2020-12-28 2022-07-07 北京达佳互联信息技术有限公司 Methods and apparatuses for image segmentation model training and for image segmentation
CN114943789A (en) * 2022-03-28 2022-08-26 华为技术有限公司 Image processing method, model training method and related device
CN115035213A (en) * 2022-05-23 2022-09-09 中国农业银行股份有限公司 Image editing method, device, medium and equipment
CN115631205A (en) * 2022-12-01 2023-01-20 阿里巴巴(中国)有限公司 Method, device and equipment for image segmentation and model training
CN116128894A (en) * 2023-01-31 2023-05-16 马上消费金融股份有限公司 Image segmentation method and device and electronic equipment

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160014482A1 (en) * 2014-07-14 2016-01-14 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Generating Video Summary Sequences From One or More Video Segments
US20210390700A1 (en) * 2020-06-12 2021-12-16 Adobe Inc. Referring image segmentation
US20220156992A1 (en) * 2020-11-18 2022-05-19 Adobe Inc. Image segmentation using text embedding
WO2022142450A1 (en) * 2020-12-28 2022-07-07 北京达佳互联信息技术有限公司 Methods and apparatuses for image segmentation model training and for image segmentation
CN112818955A (en) * 2021-03-19 2021-05-18 北京市商汤科技开发有限公司 Image segmentation method and device, computer equipment and storage medium
CN114036336A (en) * 2021-11-15 2022-02-11 上海交通大学 Semantic division-based pedestrian image searching method based on visual text attribute alignment
CN114283127A (en) * 2021-12-14 2022-04-05 山东大学 Multi-mode information-guided medical image segmentation system and image processing method
CN114529757A (en) * 2022-01-21 2022-05-24 四川大学 Cross-modal single-sample three-dimensional point cloud segmentation method
CN114565625A (en) * 2022-03-09 2022-05-31 昆明理工大学 Mineral image segmentation method and device based on global features
CN114943789A (en) * 2022-03-28 2022-08-26 华为技术有限公司 Image processing method, model training method and related device
CN115035213A (en) * 2022-05-23 2022-09-09 中国农业银行股份有限公司 Image editing method, device, medium and equipment
CN115631205A (en) * 2022-12-01 2023-01-20 阿里巴巴(中国)有限公司 Method, device and equipment for image segmentation and model training
CN116128894A (en) * 2023-01-31 2023-05-16 马上消费金融股份有限公司 Image segmentation method and device and electronic equipment

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ALEC RADFORD: "Learning Transferable Visual Models From Natural Language Supervision", 《ARXIV》, pages 1 - 48 *
TIMO LUDDECKE: "Image Segmentation Using Text and Image Prompts", 《ARXIV》, pages 1 - 14 *
尤洪峰;田生伟;禹龙;吕亚龙;: "基于Word Embedding的遥感影像检测分割", 电子学报, no. 01, pages 78 - 86 *
魏庆为: "基于文本表达的指向性目标分割方法研究", 《测试技术学报》, pages 42 - 47 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116758093A (en) * 2023-05-30 2023-09-15 首都医科大学宣武医院 Image segmentation method, model training method, device, equipment and medium
CN116758093B (en) * 2023-05-30 2024-05-07 首都医科大学宣武医院 Image segmentation method, model training method, device, equipment and medium
CN117314938A (en) * 2023-11-16 2023-12-29 中国科学院空间应用工程与技术中心 Image segmentation method and device based on multi-scale feature fusion decoding
CN117314938B (en) * 2023-11-16 2024-04-05 中国科学院空间应用工程与技术中心 Image segmentation method and device based on multi-scale feature fusion decoding
CN117992992A (en) * 2024-04-07 2024-05-07 武昌首义学院 Extensible satellite information data cloud platform safe storage method and system

Also Published As

Publication number Publication date
CN115761222B (en) 2023-11-03

Similar Documents

Publication Publication Date Title
CN115761222B (en) Image segmentation method, remote sensing image segmentation method and device
CN112101165B (en) Interest point identification method and device, computer equipment and storage medium
CN111311578B (en) Object classification method and device based on artificial intelligence and medical image equipment
CN113780296B (en) Remote sensing image semantic segmentation method and system based on multi-scale information fusion
CN111127493A (en) Remote sensing image semantic segmentation method based on attention multi-scale feature fusion
CN111950453A (en) Optional-shape text recognition method based on selective attention mechanism
CN111739027B (en) Image processing method, device, equipment and readable storage medium
CN113538480A (en) Image segmentation processing method and device, computer equipment and storage medium
US20220215656A1 (en) Method, apparatus, device for image processing, and storage medium
CN114067119B (en) Training method of panorama segmentation model, panorama segmentation method and device
CN114943876A (en) Cloud and cloud shadow detection method and device for multi-level semantic fusion and storage medium
CN113204659B (en) Label classification method and device for multimedia resources, electronic equipment and storage medium
CN113411550B (en) Video coloring method, device, equipment and storage medium
CN116453121B (en) Training method and device for lane line recognition model
CN116580257A (en) Feature fusion model training and sample retrieval method and device and computer equipment
CN114495916B (en) Method, device, equipment and storage medium for determining insertion time point of background music
CN112668638A (en) Image aesthetic quality evaluation and semantic recognition combined classification method and system
CN115496820A (en) Method and device for generating image and file and computer storage medium
CN115577768A (en) Semi-supervised model training method and device
CN116168274A (en) Object detection method and object detection model training method
CN116980541B (en) Video editing method, device, electronic equipment and storage medium
CN117253044B (en) Farmland remote sensing image segmentation method based on semi-supervised interactive learning
CN114283315A (en) RGB-D significance target detection method based on interactive guidance attention and trapezoidal pyramid fusion
CN117540221A (en) Image processing method and device, storage medium and electronic equipment
CN116361502B (en) Image retrieval method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant