CN115761222B - Image segmentation method, remote sensing image segmentation method and device - Google Patents

Image segmentation method, remote sensing image segmentation method and device Download PDF

Info

Publication number
CN115761222B
CN115761222B CN202211182413.7A CN202211182413A CN115761222B CN 115761222 B CN115761222 B CN 115761222B CN 202211182413 A CN202211182413 A CN 202211182413A CN 115761222 B CN115761222 B CN 115761222B
Authority
CN
China
Prior art keywords
image
vector
text
feature
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211182413.7A
Other languages
Chinese (zh)
Other versions
CN115761222A (en
Inventor
于超辉
周强
王志斌
王帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202211182413.7A priority Critical patent/CN115761222B/en
Publication of CN115761222A publication Critical patent/CN115761222A/en
Application granted granted Critical
Publication of CN115761222B publication Critical patent/CN115761222B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The embodiment of the specification provides an image segmentation method, a remote sensing image segmentation method and a device, wherein the image segmentation method comprises the following steps: the method comprises the steps of obtaining an image to be segmented, carrying out feature extraction on the image to be segmented to obtain a global image feature vector and a local image feature vector, constructing a prompt text vector according to a random text vector, the global image feature vector and a preset category label, carrying out feature extraction on the prompt text vector to obtain a text feature vector, and determining a segmentation result of the image to be segmented through feature compiling according to the local image feature vector and the text feature vector. And fully mining deep features of a single image to be segmented, so that the segmentation result can meet the image use habit of people, and when the segmentation result is reprocessed, a good processing result is obtained, and the accuracy and user experience of the segmentation result are improved.

Description

Image segmentation method, remote sensing image segmentation method and device
Technical Field
The embodiment of the specification relates to the technical field of image processing, in particular to an image segmentation method and a remote sensing image segmentation method.
Background
Along with the development of computer technology, artificial intelligence is widely applied in the field of image processing, wherein image segmentation is a technology of segmenting an image to be segmented into a plurality of different types of image areas according to certain segmentation conditions, and the segmentation effect and efficiency are greatly improved by applying a machine learning technical means.
At present, machine learning technology is used for realizing image segmentation mainly or based on image characteristics of the image, a training sample is used for carrying out supervised or unsupervised pre-training on a neural network model, and the training completed model is used for segmenting the image to be segmented.
However, the neural network model is pre-trained only based on the image features of the image, other features are not fully utilized, the image region is segmented to obtain deep features of the image, the image using habit of people is fully combined, and when the downstream segmentation result is reprocessed, a good processing result is obtained, so that the segmentation result is inaccurate, and the user experience is insufficient. Thus, there is a need for a more accurate image segmentation method with better user experience.
Disclosure of Invention
In view of this, the present embodiment provides an image segmentation method. One or more embodiments of the present disclosure relate to a remote sensing image segmentation method, an image segmentation apparatus, a remote sensing image segmentation apparatus, a computing device, a computer-readable storage medium, and a computer program, which solve the technical drawbacks of the prior art.
According to a first aspect of embodiments of the present specification, there is provided an image segmentation method, including:
acquiring an image to be segmented;
extracting features of the image to be segmented to obtain a global image feature vector and a local image feature vector;
constructing a prompt text vector according to the random text vector, the global image feature vector and the preset category label;
extracting features of the prompt text vector to obtain a text feature vector;
and determining a segmentation result of the image to be segmented through feature compiling according to the local image feature vector and the text feature vector.
According to a second aspect of embodiments of the present disclosure, there is provided a remote sensing image segmentation method, including:
receiving a remote sensing image segmentation instruction input by a user, wherein the remote sensing image segmentation instruction comprises a remote sensing image to be segmented and a class label of a target segmented object;
extracting features of the remote sensing image to be segmented to obtain a global image feature vector and a local image feature vector;
constructing a prompt text vector according to the random text vector, the global image feature vector and the category label;
extracting features of the prompt text vector to obtain a text feature vector;
and determining a segmentation result aiming at the target segmentation object in the remote sensing image to be segmented through feature compiling according to the local image feature vector and the text feature vector.
According to a third aspect of embodiments of the present specification, there is provided an image dividing apparatus comprising:
the first acquisition module is configured to acquire an image to be segmented;
the first extraction module is configured to extract features of the image to be segmented to obtain a global image feature vector and a local image feature vector;
the first construction module is configured to construct a prompt text vector according to the random text vector, the global image feature vector and the preset category label;
the second extraction module is configured to perform feature extraction on the prompt text vector to obtain a text feature vector;
the first segmentation module is configured to determine a segmentation result of the image to be segmented through feature compiling according to the local image feature vector and the text feature vector.
According to a third aspect of embodiments of the present specification, there is provided a remote sensing image segmentation apparatus, comprising:
the remote sensing image segmentation module is configured to receive a remote sensing image segmentation instruction input by a user, wherein the remote sensing image segmentation instruction comprises a remote sensing image to be segmented and a class label of a target segmented object;
the third extraction module is configured to extract characteristics of the remote sensing image to be segmented to obtain a global image characteristic vector and a local image characteristic vector;
The second construction module is configured to construct a prompt text vector according to the random text vector, the global image feature vector and the category label;
the fourth extraction module is configured to perform feature extraction on the prompt text vector to obtain a text feature vector;
the second segmentation module is configured to determine a segmentation result aiming at the target segmented object in the remote sensing image to be segmented through feature compiling according to the local image feature vector and the text feature vector.
According to a fifth aspect of embodiments of the present specification, there is provided a computing device comprising:
a memory and a processor;
the memory is configured to store computer executable instructions that, when executed by the processor, implement the steps of the image segmentation method or the remote sensing image segmentation method described above.
According to a sixth aspect of embodiments of the present specification, there is provided a computer readable storage medium storing computer executable instructions which, when executed by a processor, implement the steps of the above-described image segmentation method or remote sensing image segmentation method.
According to a seventh aspect of embodiments of the present specification, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the above-described image segmentation method or remote sensing image segmentation method.
In one or more embodiments of the present disclosure, an image to be segmented is obtained, feature extraction is performed on the image to be segmented to obtain a global image feature vector and a local image feature vector, a prompt text vector is constructed according to a random text vector, the global image feature vector and a preset category label, feature extraction is performed on the prompt text vector to obtain a text feature vector, and a segmentation result of the image to be segmented is determined through feature compilation according to the local image feature vector and the text feature vector. The method comprises the steps of extracting features of an image to be segmented to obtain a global image feature vector and a local image feature vector, constructing a prompt text vector according to the global image feature vector, a random text vector and a preset category label, further segmenting the image by utilizing text features and image features, fully excavating deep features of the image to be segmented, enabling a segmentation result to better meet image use habits of people, obtaining a good processing result when the segmentation result is reprocessed at the downstream, improving accuracy of the segmentation result and improving user experience.
Drawings
FIG. 1 is a flow chart of an image segmentation method provided in one embodiment of the present disclosure;
FIG. 2 is a flowchart of a remote sensing image segmentation method according to an embodiment of the present disclosure;
FIG. 3 is a process flow diagram of an image segmentation method for entity identification applied to a remote sensing image according to one embodiment of the present disclosure;
FIG. 4 is a system architecture diagram of an image segmentation system according to one embodiment of the present disclosure;
fig. 5A is a schematic diagram of a remote sensing image to be segmented according to a remote sensing image segmentation method according to an embodiment of the present disclosure;
fig. 5B is a schematic diagram of a segmentation result of a remote sensing image to be segmented according to a remote sensing image segmentation method according to an embodiment of the present disclosure;
fig. 6 is a schematic structural view of an image segmentation apparatus according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of a remote sensing image segmentation apparatus according to an embodiment of the present disclosure;
FIG. 8 is a block diagram of a computing device provided in one embodiment of the present description.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.
The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
First, terms related to one or more embodiments of the present specification will be explained.
ImageNet: a large image database for image segmentation, wherein image data is appended with category labels.
CLIP (Learning Transferable Visual Models From Natural Language Supervision, learning a migratable visual model from natural language supervision): a segmentation method for guiding image segmentation through text features specifically includes the steps of pre-training a neural network model by utilizing correspondence between images and texts, further achieving deep feature mining of images to be segmented, and achieving accurate image segmentation.
DensecLIP (Language-guided dense prediction with context-Aware hints): a segmentation method for guiding images by using text features realizes pixel-text level image segmentation through densely analyzing the image features.
SIFT (Scale-invariant feature transform, scale-invariant feature transform algorithm): an image feature extraction algorithm is used for extracting image features by searching extreme points in a spatial scale and extracting positions, scales and rotation invariants of the extreme points.
HOG (Histogram of Oriented Gradient, directional gradient histogram algorithm): an image feature extraction algorithm constructs image features by computing and counting a gradient direction histogram of a local region of an image.
ORB (Oriented Fast and Rotated Brief, directional fast rotation fast algorithm): an image feature extraction algorithm calculates a corresponding feature vector for each key point by searching a plurality of key points in an image and utilizing the features of pixel value variation of the key points.
VGG model (Visual Geometry Group Network, visual geometry group network model): a neural network model with the characteristics of small convolution layer, small pooling layer and deeper layer number and wider feature map.
ResNet model: a neural network model having a supermultilayer network structure and a residual processing module. Including ResNet-50, resNet-101, etc.
VIT model (Vision Transformer model): an image feature extraction model is characterized in that a special vector mapping layer is arranged to obtain an image feature vector with a fixed dimension, and then a transducer model is used for extracting image features.
CNN (Convolutional Neural Networks, convolutional neural network model): a multi-roll laminated neural network model with forward propagation and reverse propagation.
Single heat coding: a text feature extraction algorithm utilizes N-bit state registers to encode N text states, each having its own register bit, and at any time only one of the bits is valid, thereby obtaining a valid text feature vector.
TF-IDF (Term frequency-inverse document frequency algorithm) inverse Document Frequency: a text feature extraction algorithm obtains a corresponding text feature vector by counting the word frequency (TF) of each word, and then attaching a weight parameter (IDF) to the word frequency.
Transformer model: a neural network model based on an attention mechanism can extract and analyze semantic features of natural language texts through the attention mechanism to generate target texts. The model structure of the transducer model comprises an encoder and a decoder, wherein the encoder comprises an embedding layer, natural language texts of an input model can be encoded into feature vectors through embedding calculation, then text contexts represented by the feature vectors are analyzed based on an attention mechanism, and target texts are obtained through output of the decoding layer.
FCN model (Full Convolutional Networks full convolutional neural network): an image classification model is a neural network model which is obtained by replacing a convolution kernel of a CNN model and a full-connection layer and can be used for inputting images of any scale.
U-Net model: an image classification model with forward path is a full convolution network structure. The method comprises a compression path and an expansion path, wherein the downsampling processing is carried out on an input image with a certain resolution on the compression path, the downsampled image is expanded on the expansion path to obtain an output image with a corresponding resolution, and the U-Net has strong retention capacity on local characteristics, so that local details of the output image have high reduction degree.
FPN model (Feature Pyramid Networks, feature pyramid model): an image classification model based on multi-scale image features. And extracting up-sampling to obtain feature vectors of images with different scales, respectively predicting entity types in the images, and carrying out statistics to obtain a final prediction result so as to realize classification of the entities in the images. Including the Semantic FPN model incorporating text features.
The automatic gradient updating method comprises the following steps: an algorithm for automatically adjusting model parameters comprises SGD, MBGD, momentum update, nestrevo Momentum update, adagrad, adaDelta, RMSprop, adam and the like.
At present, the CLIP pre-trains a neural network model by utilizing corresponding text and image samples in advance to realize the segmentation of an image to be segmented, but the related characteristics of the image to be segmented cannot be fully utilized, so that the segmentation result is completely dependent on the training effect of the neural network model, when the training sample is insufficient or the training effect of the neural network model is insufficient, the accuracy of the segmentation result of the image to be segmented is insufficient, when the downstream segmentation result is processed, the image use habit of people cannot be satisfied, a good processing result is obtained, and the user experience is insufficient.
DenseClIP on the basis of CLIP obtains pixel-text corresponding characteristics on the basis of image-text corresponding characteristics through densification processing of the image characteristics, improves the accuracy of image segmentation to a certain extent, but adopts a unified form of prompt text vector to perform subsequent image segmentation, so that a more targeted prompt text vector cannot be generated according to the high-dimensional characteristics of the image itself, and further obtains a more targeted text characteristic to be combined with the image characteristics to improve the segmentation accuracy of a single image. The method is characterized in that the static analysis capability is correspondingly obtained through pre-training of the neural network model, the related characteristics of the image to be segmented cannot be fully utilized, the segmentation result is completely dependent on the training effect of the neural network model, the performance of the neural network model is difficult to improve, the accuracy of the segmentation result of the image to be segmented is insufficient, the image use habit of people cannot be met during downstream segmentation result processing, a good processing result is obtained, and the user experience is insufficient.
Based on the above-described problems, in the present specification, an image segmentation method is provided, and the present specification relates to a remote sensing image segmentation method, an image segmentation apparatus, a remote sensing image segmentation apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.
Referring to fig. 1, fig. 1 shows a flowchart of an image segmentation method according to an embodiment of the present disclosure, which specifically includes the following steps.
Step S102: and acquiring an image to be segmented.
The image to be segmented is a multimedia image including a plurality of entities, may be a real image acquired by an image acquisition device, and may be a virtual image generated by image generation software, which is not limited herein. The image to be segmented may be in the form of a picture, video frame, etc., and is not limited herein.
The image to be segmented can be obtained by receiving the image to be segmented sent by the user, or can be obtained by a local or remote database.
Illustratively, an image image_1 to be segmented sent by a user is received.
By acquiring the image to be segmented, a feature material foundation is provided for subsequent image feature extraction.
Step S104: and extracting features of the image to be segmented to obtain a global image feature vector and a local image feature vector.
The global image features are image features representing global high-dimensional features of the image to be segmented and are used for representing the features such as color, texture, shape, structure, entity distribution and the like of the image. The global image feature vector is a high-dimensional vector of global image features. The local image features are image features representing local low-dimensional features of the image to be segmented and are used for representing the features such as pixels, entity edges and the like of the image. The local image feature vector is a low-dimensional vector of local image features. The local image feature vector may be characterized as a plurality of feature vector diagrams characterizing image features of different dimensions.
The method comprises the steps of carrying out feature extraction on an image to be segmented, carrying out feature extraction on the image to be segmented to obtain a global image feature vector and a local image feature vector, and specifically carrying out feature extraction on the image to be segmented by utilizing an image feature extraction algorithm to obtain the global image feature vector and the local image feature vector. The image feature extraction algorithm may be a non-machine learning algorithm, such as SIFT, HOG, ORB, among others. Machine learning algorithms are also possible, such as VGG model, res net model, CNN model, VIT model, etc. The pre-trained image feature extraction model comprises a global feature extraction module and a local feature extraction module.
By way of example, feature extraction is performed on an Image to be segmented by using a SIFT algorithm, and a global Image feature vector I and a local Image feature vector Image enhancement are obtained.
The global image feature vector and the local image feature vector are obtained by carrying out feature extraction on the image to be segmented, the method is not limited to local image features, a feature vector foundation is provided for subsequent image segmentation, global image features are obtained, and a feature foundation is laid for subsequent construction of a prompt text vector.
Step S106: and constructing a prompt text vector according to the random text vector, the global image feature vector and the preset category label.
The random text vector is a text vector generated according to random noise, the random text vector does not correspond to the entity category, and the corresponding entity category information can be obtained through subsequent feature extraction, so that the corresponding entity category feature is characterized. Specifically, random noise is acquired, text vector encoding is performed on the random noise, and a random text vector is generated.
The preset class label is a class label vector of a sample entity in a sample image obtained after the neural network model is pre-trained, for example, the sample image comprises 10 entities (a chair, a table, a desk lamp and the like), the neural network model is pre-trained through the sample image, the class of the 10 entities is identified, and the class label is correspondingly given. And compiling the local image features with promotion property in the subsequent feature compiling process by the preset category labels, and further segmenting to obtain corresponding entities in the images to be segmented.
The hint text vector is a text vector with other modal features attached to the text features, and is used for specifying the processing direction of image processing through the correspondence of the multi-modal features in the image processing.
And constructing a prompt text vector according to the random text vector, the global image feature vector and the preset category label by carrying out vector fusion on the random text vector, the global image feature vector and the preset category label to obtain the prompt text vector. The fusion mode can be feature fusion by using a full-connection layer of the neural network model, or vector splicing directly, which is not limited herein.
By means of the method, feature fusion is conducted on the full-connection layer random text vector V, the global image feature vector I and the preset category label CLS of the neural network model, and the Prompt text vector promt is obtained.
According to the random text vector, the global image feature vector and the preset category labels, a prompt text vector is constructed, then the images are segmented by utilizing text features and image features, and deep features of the images are fully mined for the single images to be segmented, so that each image to be segmented has a corresponding dynamic prompt text vector, and the correlation between the text features and the image features is further enhanced.
Step S108: and extracting the characteristics of the prompt text vector to obtain a text characteristic vector.
The text feature vector is a fusion vector containing entity category features, text features and global image features in the image to be segmented.
And extracting features of the prompt text vector to obtain a text feature vector, wherein the text feature vector is obtained by extracting features of the prompt text vector by using a text extraction algorithm. The text feature extraction algorithm may be a non-machine learning algorithm, such as, for example, one-hot encoding, TF-IDF. Machine learning algorithms, such as a transducer model and its derivatives, etc., are also possible.
Illustratively, the Text feature vector Text Embedding is obtained by feature extraction of the Prompt Text vector Prompt by using TF-IDF.
And extracting the features of the prompt text vector to obtain a text feature vector, so that a richer, more targeted and deeper text feature vector is obtained, a feature vector foundation is provided for subsequent feature compiling and determining a segmentation result, and the accuracy of the subsequent segmentation result is improved.
Step S110: and determining a segmentation result of the image to be segmented through feature compiling according to the local image feature vector and the text feature vector.
Feature compiling is carried out to align the features of the feature vectors according to the correlation of the feature vectors, and the aligned feature vectors are utilized to carry out image segmentation on the image to be segmented. Specifically, feature compilation includes: feature alignment and image segmentation. Feature alignment is to perform feature alignment on the local image feature vector and the text feature vector, so that the local image feature vector and the text feature vector establish an image-text corresponding relation on the local image feature.
The Image segmentation method includes the steps of obtaining an aligned feature vector Embedding through feature alignment according to a local Image feature vector Image Embedding and a Text feature vector Text Embedding, and carrying out Image segmentation on an Image to be segmented according to the feature vector Embedding.
In the embodiment of the specification, an image to be segmented is obtained, feature extraction is performed on the image to be segmented, a global image feature vector and a local image feature vector are obtained, a prompt text vector is constructed according to a random text vector, the global image feature vector and a preset category label, feature extraction is performed on the prompt text vector, a text feature vector is obtained, and a segmentation result of the image to be segmented is determined through feature compiling according to the local image feature vector and the text feature vector. The method comprises the steps of extracting features of an image to be segmented to obtain a global image feature vector and a local image feature vector, constructing a prompt text vector according to the global image feature vector, a random text vector and a preset category label, further segmenting the image by utilizing text features and image features, fully excavating deep features of the image to be segmented, enabling a segmentation result to better meet image use habits of people, obtaining a good processing result when the segmentation result is reprocessed at the downstream, improving accuracy of the segmentation result and improving user experience.
Optionally, step S106 includes the following specific steps:
performing dimension mapping on the global image feature vector to obtain the global image feature vector with the same vector dimension as the random text vector;
and splicing the random text vector, the overall image feature vector after dimension mapping and the preset category label to obtain a prompt text vector.
The dimension mapping is to map the vectors of different dimensions to a unified dimension, and then perform vector calculation in the following steps, and a specific dimension mapping method may be to use a mapper (Projector), where the mapper may be a module of a neural network model of a full connection layer, or may be to map the vectors of different dimensions to the unified dimension by using a preset transposed matrix, which is not limited herein.
Splicing the random text vector, the global image feature vector mapped by the dimension and the preset category label to obtain a prompt text vector, wherein the global image feature vector mapped by the dimension is correspondingly spliced to the random text vector and the preset category label to obtain the prompt text vector
For example, the random text vector is Vn (V1-V20), 20 vectors with 512 dimensions, the preset class label is CLSn, the corresponding 20 preset class labels, the global image feature vector I is a 1024-dimensional vector, and the global image feature vector is subjected to dimension mapping through the transpose matrix T1 to obtain a 512-dimensional global image feature vector I'. Correspondingly splicing the dimension mapped global image feature vector I' to 20 random text vectors and preset category labels to obtain 20 512-dimension Prompt text vectors promt as follows: prompt1{ V1+I '+CLS1}, prompt2{ V2+I' +CLS2}, prompt3{ V3+I '+CLS3} … … Prompt20{ V20+I' +CLS20}.
And performing dimension mapping on the global image feature vector to obtain a global image feature vector with the same vector dimension as the random text vector, and splicing the random text vector, the dimension-mapped global image feature vector and a preset category label to obtain a prompt text vector. And the feasibility of subsequent splicing is ensured through dimension mapping, and meanwhile, the deep features of the images to be segmented are fully mined for each image to be segmented, so that each image to be segmented has a corresponding dynamic prompt text vector, and the correlation between the text features and the image features is further enhanced.
Optionally, before step S110, the method further includes the following specific steps:
performing cross attention calculation on the text feature vector and the local image feature vector, and determining a target text feature vector;
and fine tuning the text feature vector based on the target text feature vector to obtain an updated text feature vector.
The cross attention calculation is to obtain vector weights corresponding to the vectors through pre-training the neural network model, and the cross attention mechanism is to determine deep feature relations among the vectors through a weight matrix, so that the image processing result vector obtained through weighting calculation by using the corresponding weights not only characterizes the self features, but also characterizes the deep features of the related vectors.
And performing cross attention calculation on the text feature vector and the local image feature vector to determine a target text feature vector, wherein the specific mode is that the text feature vector and the local image feature vector are subjected to cross attention calculation to obtain corresponding vector weights, and the target text feature vector is obtained according to weighting calculation.
And fine tuning the text feature vector based on the target text feature vector to obtain an updated text feature vector, wherein the text feature vector is fine-tuned based on the entity category feature of the target text vector to obtain the updated text feature vector.
Exemplary, the Text feature vector Text component and the local Image feature vector Image component are subjected to cross attention calculation to obtain the corresponding vector weight omega 1 And omega 2 According to the vector weight omega 1 And omega 2 Weighting calculation is carried out to obtain the targetThe text feature vector Target Text Embedding. And fine tuning the Text feature vector Text Embedding based on the entity category feature of the target Text vector Target Text Embedding to obtain an updated Text feature vector Text Embedding.
And performing cross attention calculation on the text feature vector and the local image feature vector, determining a target text feature vector, and performing fine adjustment on the text feature vector based on the target text feature vector to obtain an updated text feature vector. Through a cross attention mechanism, the text feature vector and the global image feature vector can further represent the features of each other, so that the text feature vector with richer and deeper features is obtained, and further, in the subsequent feature compiling, a more accurate segmentation result is obtained.
Optionally, the cross attention calculation is performed on the text feature vector and the local image feature vector, and the target text feature vector is determined, including the following specific steps:
and performing cross attention calculation on the text feature vector and the local image feature vector by using a preset multi-layer structure translation model decoder, and determining a target text feature vector.
The multi-layer structure translation model is a text translation model of a multi-hidden layer structure, and a decoder of the multi-layer structure translation model is a decoding module for representing feature vectors among deeper vectors. The translation model of the multilayer structure may be a transducer model and its derivative model, which is not limited herein.
The multi-layer structure translation model is a multi-hidden layer model of a neural network model with a cross attention mechanism, taking a transducer model as an example, setting a corresponding QKV (Query, key, value, query, key, value weighted full-connection layer calculation), setting a text feature vector as a Query vector, setting a local image feature vector as a Key vector and a Value vector, and obtaining a corresponding vector weight omega 1 And omega 2 And determining the target text feature vector.
Illustratively, a Text feature vector Text component is set as a query vector Q, a local Image feature vector Image component is set as a key vector K and a value vector V, and a corresponding direction is obtained Weight of weight omega 1 And omega 2 And (5) performing weighted calculation by using a transducer model decoding layer to obtain the target text feature vector Target Text Embedding.
And performing cross attention calculation on the text feature vector and the local image feature vector by using a preset multi-layer structure translation model decoder, and determining a target text feature vector. The text feature vector and the global image feature vector can further represent the features of each other, the text feature vector with rich deep features is further obtained, further in the subsequent feature compiling, a more accurate segmentation result is obtained, the feature compiling efficiency is improved, and the image segmentation efficiency is improved.
Optionally, step S110 includes the following specific steps:
performing multi-scale feature alignment on the local image feature vector and the text feature vector to obtain feature pairs Ji Xiangliang;
and determining a segmentation result of the image to be segmented through feature compiling based on the feature alignment vector and the local image feature vector.
The feature alignment vector is a multi-modal feature comprising local image features and text features, the local image features and text features in the feature alignment vector having a spatial correspondence.
The multi-scale feature refers to that up-sampling operations are performed on the local image features to different degrees, so that the local image features with different scales are obtained, the text features are the original text features, and the local image features with different scales and the text features are subjected to feature alignment, so that multi-mode feature vectors (feature pairs Ji Xiangliang) with different scales are obtained. Specifically, the upsampling operation is performed according to a preset sampling rule, for example, the image to be segmented is 512x512, a 16x16 feature vector diagram of the feature vector diagram corresponding to the local image feature vector is obtained after image feature extraction, and then the feature vector diagram is upsampled 3 times, namely 2 times, 4 times and 8 times, respectively, to obtain feature vector diagrams of 32x32, 64x64 and 128x 128. Theoretically, the smaller the scale, the more accurate the feature map characterizes the local image features.
Multi-scale feature alignment is to perform feature alignment at a finer granularity on local image feature vectors based on text feature vectors, e.g., pixel-text level feature alignment such that each pixel corresponds to a characterized text feature. The feature alignment may be feature alignment of feature vectors using a neural network model trained in advance, feature alignment of feature vectors using a preset vector alignment matrix, or feature alignment of feature vectors by cross-multiplying feature vectors.
The local image feature vector and the text feature vector are subjected to multi-scale feature alignment to obtain feature pairs Ji Xiangliang, and the specific mode is that the local image feature vector is up-sampled to obtain a multi-scale local image feature vector, and the text feature vector is utilized to carry out fine-granularity feature alignment on the multi-scale local image feature vector to obtain a feature alignment vector.
Illustratively, up-sampling is performed on the local Image feature vectors Image components with the dimensions of 16x16 by 2 times, 4 times and 8 times to obtain local Image feature vectors Image components { Image components_ 1,Image Embedding_2,Image Embedding_3} with the dimensions of 32x32, 64x64 and 128x128, and pixel-text level feature Alignment is performed on the local Image feature vectors Image components { Image components_ 1,Image Embedding_2,Image Embedding_3} with the dimensions of 32x32, 64x64 and 128x128 respectively by using text feature vectors to obtain feature pairs Ji Xiangliang Multi-scale Alignment.
And carrying out multi-scale feature alignment on the local image feature vector and the text feature vector to obtain a feature alignment vector, and determining a segmentation result of the image to be segmented through feature compiling based on the feature alignment vector and the local image feature vector. The accuracy of the feature alignment vector is ensured, so that the segmentation result obtained by subsequent feature compiling is more accurate.
Optionally, performing multi-scale feature alignment on the local image feature vector and the text feature vector to obtain a feature alignment vector, including the following specific steps:
and performing cross multiplication operation on the local image feature vector and the text feature vector to obtain a feature alignment vector.
The feature pair Ji Xiangliang is obtained by performing cross multiplication on the local image feature vector and the text feature vector, and specifically, the feature alignment vector is obtained by multiplying the local image feature vector and the transpose matrix of the text feature vector.
Illustratively, the feature pair Ji Xiangliang Multi-scale Alignment is obtained by multiplying the local Image feature vector Image enhancement and the transposed matrix (Text enhancement) T of the Text feature vector Text enhancement.
And performing cross multiplication operation on the local image feature vector and the text feature vector to obtain a feature alignment vector. Feature alignment is carried out on the local image feature vector and the text feature vector, the feature alignment vector is obtained through calculation, the accuracy of the feature alignment vector is guaranteed, and the accuracy of a segmentation result obtained through subsequent feature compiling is guaranteed.
Optionally, determining a segmentation result of the image to be segmented through feature compiling based on the feature alignment vector and the local image feature vector comprises the following specific steps:
splicing the feature alignment vector and the local image feature vector in corresponding dimensions to obtain a spliced feature vector;
and performing feature compiling on the spliced feature vectors to obtain a segmentation result of the image to be segmented.
The local image feature vector in the embodiment of the present disclosure is a multi-scale local image feature vector, which may be a local image feature vector subjected to multi-scale upsampling in the embodiment, or may be a local image feature vector subjected to multi-scale upsampling in addition, which is not limited herein.
And performing corresponding dimension stitching on the feature alignment vector and the local image feature vector to obtain a stitched feature vector, wherein the specific mode is that the feature pair Ji Xiangliang is utilized to respectively perform corresponding dimension stitching on the multi-scale local feature vector to obtain the stitched feature vector.
Illustratively, the feature pairs Ji Xiangliang Multi-scale Alignment are utilized to respectively splice corresponding dimensions of the local Image feature vectors Image enhancement { Image enhancement_ 1,Image Embedding_2,Image Embedding_3} with the sizes of 32x32, 64x64 and 128x128 to obtain a spliced feature vector Concatenate Embedding.
And performing corresponding dimension stitching on the feature alignment vector and the local image feature vector to obtain a stitched feature vector, and performing feature compiling on the stitched feature vector to obtain a segmentation result of the image to be segmented. The local image features are correspondingly provided with the alignment feature vectors which are richer, deeper and more accurate in advance, and meanwhile partial features are prevented from being fitted in the previous feature alignment, so that the local image features cannot be fully reflected in feature compilation, and the accuracy of the segmentation result is improved.
Optionally, step S104 includes the following specific steps:
inputting the image to be segmented into a pre-trained image encoder, and extracting the characteristics of the image to be segmented by using the image encoder to obtain a global image characteristic vector and a local image characteristic vector;
correspondingly, the step S108 includes the following specific steps:
inputting the prompt text vector into a text encoder, and extracting features of the prompt text vector by using the text encoder to obtain a text feature vector;
correspondingly, the step S110 includes the following specific steps:
performing multi-scale feature alignment on the local image feature vector and the text feature vector to obtain feature pairs Ji Xiangliang;
And inputting the local image feature vectors and the feature alignment vectors into a pre-trained decoder, and performing feature compiling on the local image feature vectors and the feature alignment vectors by using the decoder to determine the segmentation result of the image to be segmented.
In the embodiment of the present disclosure, the local image feature vector is a multi-scale local image feature vector, which is a corresponding multi-scale feature vector diagram.
The image encoder is a pre-trained image feature extraction model, is a neural network model, and can be a VGG model, a ResNet model, a CNN model, a VIT model and the like. The method comprises a global image feature extraction module and a local image feature extraction module. The text encoder is a pre-trained text feature extraction model, is a neural network model, and can be a transducer model, a derivative model thereof and the like. The decoder is an image classification model, is a neural network model, and can identify and classify different entities in an image so as to obtain a segmentation result of the image to be segmented. May be FCN model, U-Net model, FPN model, etc.
The method comprises the steps of inputting an image to be segmented into a pre-trained image encoder, carrying out feature extraction on the image to be segmented by using the image encoder to obtain a global image feature vector and a local image feature vector.
The method comprises the steps of inputting local image feature vectors and feature alignment vectors into a pre-trained decoder, performing feature compiling on the local image feature vectors and the feature alignment vectors by using the decoder, and determining segmentation results of images to be segmented. The classification of the entity types of the local image feature vectors may be classification of different granularities, for example, classification of entities of 2x2,4x4,8x8 granularity, or classification of entities of 1x1 pixel level, and the local image feature vectors (i.e., each pixel on the feature vector diagrams of different scales, and determining the corresponding entity types) are made by the entity classification, and a specific determination method may be used to calculate the confidence of the entity types of each pixel, and further determine one or more entity types with the highest confidence as the entity types of the pixel.
For example, an Image to be segmented (including 5 entities, entity 1, entity 2, entity 3, entity 4, entity 5) is input into a pre-trained VGG model, a global Image feature extraction module I is obtained by using a global Image feature extraction module of the VGG model, and a local Image feature vector Image enhancement is obtained by using a local Image feature extraction module of the VGG model. Inputting the Prompt Text vector Prompt into a transducer model, and extracting features of the Prompt Text vector Prompt by using the transducer model to obtain a Text feature vector Text Embedding. Inputting the local Image feature vector Image enhancement and the feature pair Ji Xiangliang Multi-scale Alignment into a pre-trained FPN model, performing feature compiling on the Image enhancement and the feature pair Ji Xiangliang Multi-scale Alignment by utilizing the Semantic FPN model, namely performing pixel level classification on the entity type of the local Image feature vector Image enhancement based on the feature pair Ji Xiangliang Multi-scale Alignment, determining the confidence degree of the entity type of each pixel, and determining the entity type with the highest confidence degree as the entity type of the pixel, thereby obtaining a classification result (entity 1: table, entity 2: chair, entity 3: table lamp, entity 4: cat, entity 5: person), and obtaining a segmentation result of the Image to be segmented.
The feature extraction is carried out on the image to be segmented by utilizing a pre-trained image encoder, so that a global image feature vector and a local image feature vector are obtained, the accuracy of the obtained global image feature vector and local image feature vector is improved, and the accuracy of a prompt text vector, the accuracy of a subsequent feature alignment vector and the accuracy of image segmentation are improved; the pre-trained text encoder is utilized to extract the features of the prompt text vector, so that the accuracy of the text feature vector, the accuracy of the subsequent feature alignment vector and the accuracy of image segmentation are improved; the accuracy of image segmentation is improved by using a pre-trained decoder. Meanwhile, the segmentation efficiency of image segmentation is improved by utilizing a pre-trained image encoder, a pre-trained text encoder and a pre-trained text decoder.
Optionally, the method further comprises the following specific steps:
acquiring a sample image set, wherein the sample image set comprises a plurality of sample images and label images corresponding to the sample images;
extracting a first sample image and a first label image corresponding to the first sample image from a sample image set, wherein the first sample image is any sample image;
inputting the first sample image into a preset image encoder, and extracting the characteristics of the first sample image by using the image encoder to obtain a first global image characteristic vector and a first local image characteristic vector;
Constructing a first prompt text vector according to the random text vector, the first global image feature vector and the preset category label;
inputting the first prompt text vector into a text encoder, and extracting features of the first prompt text vector by using the text encoder to obtain a first text feature vector;
performing multi-scale feature alignment on the first local image feature vector and the first text feature vector to obtain a first feature pair Ji Xiangliang;
inputting the first local image feature vector and the first feature alignment vector into a pre-trained decoder, and performing feature compiling on the first local image feature vector and the first feature alignment vector by using the decoder to determine a segmentation result of the first sample image;
determining a total loss according to the segmentation result of the first sample image, the first feature pair Ji Xiangliang and the first label image;
based on the total loss, parameters of the image encoder and decoder are adjusted, and the step of extracting the first sample image and the first label image corresponding to the first sample image from the sample image set is performed back until the training stopping condition is reached.
The sample image set is located in a pre-constructed sample image set and comprises a plurality of sample images and label images corresponding to the sample images. The sample image is a multimedia image sample containing a plurality of entities, the label image corresponding to each sample image is a multimedia image sample with entity class labeling performed in advance, the labeling mode can be manual labeling, labeling can be performed by using a pre-trained entity labeling algorithm, the entity labeling algorithm can be a pixel value classification labeling method based on a pixel level, and the entity labeling algorithm can be a classification labeling method for classifying image features by using a neural network model, and the method is not limited. The sample image set may be constructed based on a segmentation result obtained by performing image segmentation in advance, may be obtained by corresponding to an open source database, may be obtained by manually labeling according to a sample image, and is not limited herein.
The first global image feature is an image feature representing a global high-dimensional feature of the first sample image and is used for representing the color, texture, shape, structure, entity distribution and other features of the image. The first global image feature vector is a high-dimensional vector of the first global image feature.
The first local image feature is an image feature characterizing a local low-dimensional feature of the first sample image, and is used for characterizing pixels, physical edges, and the like of the image. The first local image feature vector is a low-dimensional vector of the first local image feature. The first local image feature vector may be characterized as a plurality of feature vector diagrams characterizing image features of different dimensions.
The first prompt text vector is a text vector with other modal features attached to the first text feature, and is used for correspondingly prescribing a training direction of model training through the multi-modal features in the model training process. The first feature alignment vector is a multi-modal feature comprising a first local image feature and a first text feature, the first local image feature and the first text feature in the first feature alignment vector having a spatial correspondence.
The total loss is a component loss value obtained by taking the first label image as a verification image and respectively carrying out loss value calculation on the segmentation result of the first label image and the first characteristic alignment vector, and then carrying out model performance evaluation on the image encoder and the decoder according to the component loss value.
The stopping training condition is a preset model training termination condition, and can be preset training times, namely, the model is subjected to iterative training, the training is ended when the preset training times are met, the model training termination condition can also be a loss value threshold, and the training is ended when the total loss value meets the loss value threshold.
And determining total loss according to the segmentation result of the first sample image, the first feature pair Ji Xiangliang and the first label image, wherein the first label image is taken as a verification image, and after the loss value calculation is carried out on the segmentation result of the first sample image and the first feature alignment vector respectively to obtain component loss values, the total loss values for model performance evaluation of the image encoder and the image decoder are determined according to the component loss values.
The parameters of the image encoder and decoder are adjusted based on the total loss, and the step of extracting the first sample image and the first label image corresponding to the first sample image from the sample image set is performed back until the training stopping condition is reached.
Illustratively, a Sample Image set Sample is obtained, wherein the Sample Image set comprises a plurality of Sample Image samples { Image Sample 1, image Sample 2 … … Image Sample n } and Label Image labels { Image Label 1, image Label 2 … … Image Label n } corresponding to each Sample Image, a first Sample Image Sample m and a first Label Image Label m corresponding to the first Sample Image are extracted from the Sample Image set Image Sample, the first Sample Image Sample m is input to a preset Image encoder, feature extraction is performed on the first Sample Image Sample m by the Image encoder to obtain a first global Image feature vector I (m) and a first local Image feature vector (m), constructing a first Prompt Text vector Prompt m according to a random Text vector V, a first global Image feature vector I (m) and a preset category Label CLS, inputting the first Prompt Text vector Prompt m into a Text encoder, extracting features of the first Prompt Text vector Prompt m by using the Text encoder to obtain a first Text feature vector Text Embedding (m), performing Multi-scale feature Alignment on a first local Image feature vector (m) and the first Text feature vector Text Embedding (m) to obtain a first feature pair Ji Xiangliang Multi-scale Alignment (m), inputting the first local Image feature vector Image feature (m) and a first feature pair Ji Xiangliang Multi-scale Alignment (m) into a pre-trained decoder, feature compiling is performed on the first local Image feature vector Image (m) and the first feature pair Ji Xiangliang Multi-scale Alignment (m) by using a decoder, a segmentation Result m of the first sample Image is determined, according to the segmentation Result m of the first sample Image, the first feature pair Ji Xiangliang Multi-scale Alignment (m) and the first label Image label m, determining a total Loss, adjusting parameters of an Image encoder and a decoder by using an automatic gradient updating method based on the total Loss, and returning to execute the step of extracting the first sample Image and the first label Image corresponding to the first sample Image from the sample Image set until the training stopping condition is reached.
Obtaining a sample image set, wherein the sample image set comprises a plurality of sample images and label images corresponding to the sample images, extracting a first sample image and a first label image corresponding to the first sample image from the sample image set, wherein the first sample image is any sample image, inputting the first sample image into a preset image encoder, carrying out feature extraction on the first sample image by using the image encoder to obtain a first global image feature vector and a first local image feature vector, constructing a first prompt text vector according to a random text vector, the first global image feature vector and a preset category label, inputting the first prompt text vector into the text encoder, carrying out feature extraction on the first prompt text vector by using the text encoder, obtaining a first text feature vector, performing multi-scale feature alignment on the first local image feature vector and the first text feature vector to obtain a first feature alignment vector, inputting the first local image feature vector and the first feature alignment vector into a pre-trained decoder, performing feature compiling on the first local image feature vector and the first feature alignment vector by using the decoder, determining a segmentation result of the first sample image, determining total loss according to the segmentation result of the first sample image, the first feature pair Ji Xiangliang and the first label image, adjusting parameters of the image encoder and the decoder based on the total loss, and returning to execute the step of extracting the first sample image and the first label image corresponding to the first sample image from the sample image set until a training stopping condition is reached. The image encoder and the decoder are subjected to supervised model training through the first image sample and the first label image, parameters of the image encoder and the decoder are adjusted through total loss, and when the training stopping condition is reached, the training is ended to obtain the trained image encoder and decoder, so that the performance and accuracy of the model obtained through training are ensured, and the accuracy of image segmentation is improved.
Optionally, determining the total loss according to the segmentation result of the first sample image, the first feature pair Ji Xiangliang and the first label image includes the following specific steps:
calculating segmentation loss by using a preset segmentation loss function according to a segmentation result of the first sample image and the first label image;
calculating alignment loss according to the first feature pair Ji Xiangliang and the first label image by using a preset alignment loss function;
calculating contrast loss according to the first feature pair Ji Xiangliang and the first label image by using a preset contrast loss function;
the segmentation, alignment and contrast losses are weighted to obtain the total loss.
At present, a preset type label is adopted in DenseLIP, namely, parameter adjustment of a model is carried out only through segmentation loss, the constraint of a loss value is weak, under the condition that characteristics are not aligned and not well compared, the training effect of the sample on the model is difficult to ensure, the accuracy of a segmentation result is seriously influenced, and in the embodiment of the specification, the constraint is improved, the training effect on the model is improved, and the accuracy of the segmentation result is improved by adding the alignment loss and the comparison loss.
The segmentation loss is a classification loss value that characterizes the entity type in the style results. The alignment loss is the spatial loss after alignment between features in the feature alignment vector, and the contrast loss is the corresponding loss of each feature in the feature alignment vector.
The segmentation loss, the alignment loss and the contrast loss are weighted to obtain total loss, and the total loss is obtained by weighting the segmentation loss, the alignment loss and the contrast loss by using preset loss weights. The specific weighting calculation is performed using equation 1, equation 1 being as follows:
Loss=γ 1 Loss_seg+γ 2 Loss_Align+γ 3 loss_coherent formula 1
Wherein Loss represents total Loss, gamma 1 Weight indicating segmentation Loss, loss_seg indicates segmentation Loss, γ 2 Weight representing alignment Loss, loss_align represents alignment Loss, γ 3 The loss_comparison represents the weight of the comparison Loss.
Illustratively, according to the segmentation result of the first sample image and the first label image, the segmentation loss is calculated to be 0.18 by using a preset segmentation loss function, the alignment loss is calculated to be 0.24 by using a preset alignment loss function according to the first feature pair Ji Xiangliang and the first label image, the contrast loss is calculated to be 0.36 by using a preset contrast loss function according to the first feature pair Ji Xiangliang and the first label image, the weight of the segmentation loss is 0.4, the weight of the alignment loss is 0.2, the weight of the contrast loss is 0.4, and the segmentation loss is 0.18, the alignment loss is 0.24 and the contrast loss is 0.36 by using formula 1 to obtain the total loss as 0.264.
According to the segmentation result of the first sample image and the first label image, the segmentation loss is calculated by using a preset segmentation loss function, the alignment loss is calculated by using a preset alignment loss function according to the first characteristic pair Ji Xiangliang and the first label image, the contrast loss is calculated by using a preset contrast loss function according to the first characteristic pair Ji Xiangliang and the first label image, and the segmentation loss, the alignment loss and the contrast loss are weighted to obtain the total loss. The accuracy of total loss is guaranteed, the performance and accuracy of the model obtained through training are further guaranteed, and the accuracy of image segmentation is improved.
Optionally, calculating the contrast loss according to the first feature pair Ji Xiangliang and the first label image by using a preset contrast loss function, including the following specific steps:
and performing sample-point-by-sample contrast loss calculation on the first feature pair Ji Xiangliang and the first label image by using a preset contrast loss function to obtain contrast loss.
The sample points are the first feature pair Ji Xiangliang and feature sample points on the first label image, and are image feature points with preset granularity, such as feature sample points with granularity of 1x1,2x2,4x4,8x 8. The sample points comprise difficult sample points and easy sample points, and correspond to characteristic sample points which are difficult to classify the entity in the image and characteristic sample points which are easy to classify the entity, and the difficulty is determined by comparing the contrast loss with a preset contrast loss threshold value.
After the difficulty of the sample points is determined, the corresponding labeling is carried out, and as a training result of the training, the method can be added into subsequent iterative training, and training is carried out through the sequence of the sample points which are easy to sample and then the sample points which are difficult to sample, so that the training effect of the model is improved.
And performing sample-point-by-sample contrast loss calculation on the first feature pair Ji Xiangliang and the first label image by using a preset contrast loss function to obtain contrast loss. Reference data is provided for subsequent model training, the training effect of the model is improved, and the accuracy of a subsequent segmentation result is improved.
Optionally, after step S110, the following specific steps are further included:
sending the segmentation result of the image to be segmented to the front end for display, so that a user edits the segmentation result at the front end;
receiving an editing result fed back by the front end;
and training an image segmentation model by taking the editing result as a sample image, wherein the image segmentation model comprises an image encoder for extracting image features, a text encoder for extracting text features and a decoder for decoding features.
The front end is the front end of the client having the image segmentation function, which can perform the above steps S102 to S110.
By sending the segmentation result to the front section for display, the user can execute further editing operation after directly observing the visual effect of the segmentation result.
The editing result is an image processing result obtained after the user performs editing operation on the segmentation result displayed at the front end, and the editing operation may be to adjust an image area of an entity in the segmentation result, or may be to adjust display parameters such as an image proportion, a size, a color, a contrast and the like of the segmentation result, or may be to adjust a category of the entity in the segmentation result, which is not limited herein.
The graph segmentation model is a model with an image segmentation function and comprises an image encoder for extracting image features, a text encoder for extracting text features and a decoder for decoding features. The image encoder is a neural network model with an image feature extraction function, and can be a VGG model, a ResNet model, a CNN model, a VIT model and the like. The text encoder is a neural network model with a text feature extraction function, and can be a transducer model, a derivative model thereof and the like. The decoder is a neural network model with a function of classifying entities in the image, and can identify and classify different entities in the image so as to obtain a segmentation result of the image to be segmented. May be FCN model, U-Net model, FPN model, etc.
And sending the segmentation result of the image to be segmented to the front end for display, wherein the specific mode is that different entity areas in the segmentation result of the image to be segmented are marked by categories and then sent to the front end for display. The manner of category labeling can be text labeling, outline labeling, color labeling, and is not limited herein.
The image to be segmented, which is sent by the user through the client with the image segmentation function, is received, the image to be segmented is a photo containing 5 entities, the segmentation result of the photo is obtained by executing step S102-step S110, the image areas of the 5 entities in the segmentation result are labeled by categories with different colors, and the segmentation result labeled by the categories is displayed at the front end of the client. And receiving an editing result fed back by the front end.
Because the sample set used for training the image segmentation model is generally a general sample set, such as sample sets of ADE20K, COCO-Stuff10K, ADE20K-Full and the like, in the actual use process of a user, sample images and corresponding label images which are more in line with the actual application scene may be needed, and the cost of manually constructing a large number of corresponding sample images and label images is high, so that the segmentation result can be displayed at the front end and edited by the user to obtain an editing result which is more in line with the time application scene, and the editing result is used as the sample image to train an image encoder and an image decoder, thereby improving the accuracy of the segmentation result and saving the cost of model training.
The traffic lights in the general sample set are horizontal, model training is performed based on the general sample set, the horizontal traffic lights are subjected to entity recognition and then are segmented to obtain a segmented structure, the traffic lights in the other region are vertical, the general sample set is difficult to segment images to be segmented containing vertical traffic lights, a user is required to edit the images, an editing result is obtained as a sample image, the image segmentation model training is performed, and the capability of the image segmentation model to recognize and segment the vertical traffic lights in the region is improved.
The segmentation result of the image to be segmented is sent to the front end for display, and can be intuitively displayed to the user, so that the user can further process the segmentation result, and the user experience is improved. And receiving an editing result fed back by the front end, taking the editing result as a sample image, and training an image segmentation model, wherein the image segmentation model comprises an image encoder for extracting image features, a text encoder for extracting text features and a decoder for decoding features. The training effect of the image segmentation model is improved, the effect of segmenting the subsequent image to be segmented is improved, and the accuracy of the subsequent segmentation result is improved.
Referring to fig. 2, fig. 2 shows a flowchart of a remote sensing image segmentation method according to an embodiment of the present disclosure, which specifically includes the following steps.
Step S202: receiving a remote sensing image segmentation instruction input by a user, wherein the remote sensing image segmentation instruction comprises a remote sensing image to be segmented and a class label of a target segmented object;
step S204: extracting features of the remote sensing image to be segmented to obtain a global image feature vector and a local image feature vector;
step S206: constructing a prompt text vector according to the random text vector, the global image feature vector and the category label;
step S208: extracting features of the prompt text vector to obtain a text feature vector;
step S210: and determining a segmentation result aiming at the target segmentation object in the remote sensing image to be segmented through feature compiling according to the local image feature vector and the text feature vector.
The embodiment of the specification is applied to the functional service providing end with the remote sensing image segmentation function.
The remote sensing image segmentation instruction is a graph segmentation instruction which is generated by the client and sent to the functional service providing end after the user uploads the class labels of the remote sensing image to be segmented and the target segmented object through the client.
The remote sensing image to be segmented is a remote sensing multimedia image comprising a plurality of earth surface entities, is a real remote sensing image acquired by remote sensing image acquisition equipment, and can be in the form of pictures, video frames and the like, and is not limited herein. The target segmented object is a ground surface entity of the image to be used, which is needed by a user to be subjected to entity identification and segmentation, and the class label of the target segmented object is one corresponding to the target segmented object in a preset transition class.
The specific implementation manner in the embodiment of the present disclosure has been described in detail in the embodiment of the foregoing fig. 1, and will not be described herein again.
In the embodiment of the specification, a remote sensing image segmentation instruction input by a user is received, wherein the remote sensing image segmentation instruction comprises a remote sensing image to be segmented and a class label of a target segmented object, feature extraction is carried out on the remote sensing image to be segmented to obtain a global image feature vector and a local image feature vector, a prompt text vector is constructed according to a random text vector, the global image feature vector and the class label, feature extraction is carried out on the prompt text vector to obtain a text feature vector, and a segmentation result aiming at the target segmented object in the remote sensing image to be segmented is determined through feature compiling according to the local image feature vector and the text feature vector. The method comprises the steps of extracting features of a remote sensing image to be segmented to obtain a global image feature vector and a local image feature vector, constructing a prompt text vector according to the global image feature vector, a random text vector and a preset category label, further segmenting the image by using text features and image features, fully mining own deep features of the single remote sensing image to be segmented, enabling a segmentation result to better meet image use habits of people, obtaining a good processing result when the segmentation result is reprocessed at the downstream, improving accuracy of the segmentation result and improving user experience.
Optionally, after step S210, the following specific steps are further included:
transmitting the segmentation result of the remote sensing image to be segmented to the front end for display, so that a user edits the segmentation result at the front end;
receiving an editing result fed back by the front end;
and training an image segmentation model by taking the editing result as a sample image, wherein the image segmentation model comprises an image encoder for extracting image features, a text encoder for extracting text features and a decoder for decoding features.
The front end is the front end of the client with the remote sensing image segmentation function.
By sending the segmentation result to the front end for display, the user can execute further editing operation after directly observing the visual effect of the segmentation result.
The editing result is an image processing result obtained after the user performs editing operation on the segmentation result displayed at the front end, and the editing operation may be to adjust an image area of the ground surface entity in the segmentation result, or may be to adjust display parameters such as an image proportion, a size, a color, a contrast and the like of the segmentation result, or may be to adjust a category of the ground surface entity in the segmentation result, which is not limited herein.
The graph segmentation model is a model with an image segmentation function and comprises an image encoder for extracting image features, a text encoder for extracting text features and a decoder for decoding features. The image encoder is a neural network model with an image feature extraction function, and can be a VGG model, a ResNet model, a CNN model, a VIT model and the like. The text encoder is a neural network model with a text feature extraction function, and can be a transducer model, a derivative model thereof and the like. The decoder is a neural network model with the function of classifying the earth surface entities in the image, and can identify and classify different earth surface entities in the image so as to obtain the segmentation result of the remote sensing image to be segmented. May be FCN model, U-Net model, FPN model, etc.
And sending the segmentation result of the remote sensing image to be segmented to the front end for display, wherein the specific mode is that the image areas of different surface entities in the segmentation result of the remote sensing image to be segmented are marked by categories and then sent to the front end for display. The manner of category labeling can be text labeling, outline labeling, or color labeling, and is not limited herein
The method includes the steps of receiving a remote sensing image to be segmented, which is sent by a user through a client with an image segmentation function, wherein the remote sensing image to be segmented is a remote sensing satellite image containing 3 types of ground surface entities (lakes, roads and buildings), the target segmented object is a building, obtaining segmentation results of the remote sensing satellite image by executing steps S202-S210, labeling image areas of the building in the segmentation results with types of different outlines, and displaying the segmentation results labeled by the types at the front end of the client. And receiving an editing result fed back by the user, wherein the editing result is an image processing result obtained by adjusting the image area of the building in the segmentation result by the user.
Because the sample set used for training the image segmentation model is generally a general sample set, such as sample sets of ADE20K, COCO-Stuff10K, ADE20K-Full and the like, in the actual use process of a user, sample images and corresponding label images which are more in line with the actual application scene may be needed, and the cost of manually constructing a large number of corresponding sample images and label images is high, so that the segmentation result can be displayed at the front end and edited by the user to obtain an editing result which is more in line with the time application scene, and the editing result is used as the sample image to train an image encoder and an image decoder, thereby improving the accuracy of the segmentation result and saving the cost of model training.
For example, the building in the general sample set is high-density, the model training is performed based on the general sample set, the object segmentation of the building is obtained by carrying out entity recognition on the high-density earth surface entity, the building in some areas is low-density, the building which is difficult to segment by the remote sensing image to be segmented in the areas in the general sample set is required to be edited by a user, the editing result is obtained as a sample image to train the image segmentation model, and the capability of the image segmentation model to recognize and segment the low-density building in the areas is improved.
The segmentation result of the remote sensing image to be segmented is sent to the front end for display, and can be intuitively displayed to the user, so that the user can further process the segmentation result, and the user experience is improved. The editing result fed back by the user after editing the segmentation result is received, so that the actual use requirement of the user can be met, the adaptability and accuracy of remote sensing image segmentation are improved, and the user experience is improved. And training an image segmentation model by taking the editing result as a sample image, wherein the image segmentation model comprises an image encoder for extracting image features, a text encoder for extracting text features and a decoder for decoding features. The training effect of the image segmentation model is improved, the segmentation effect of the subsequent remote sensing image to be segmented is improved, and the accuracy of the subsequent segmentation result is improved.
The image segmentation method provided in the present specification will be further described with reference to fig. 3 by taking an application of the image segmentation method to entity recognition of a remote sensing image as an example. Fig. 3 is a flowchart of a process of an image segmentation method applied to entity recognition of a remote sensing image according to an embodiment of the present disclosure, which specifically includes the following steps.
Step S302: receiving a remote sensing image to be segmented sent by a user through a client;
the remote sensing image to be segmented is a multimedia image comprising a plurality of surface entities. The plurality of surface entities may be roads, trees, oil storage tanks, traffic vehicles, buildings, and the like.
Step S304: inputting the remote sensing image to be segmented into a VIT model, and extracting the characteristics of the image to be segmented by using the VIT model to obtain a global image characteristic vector and a local image characteristic vector;
step S306: performing dimension mapping on the global image feature vector to obtain the global image feature vector with the same vector dimension as the random text vector;
step S308: splicing the random text vector, the overall image feature vector after dimension mapping and a preset category label to obtain a prompt text vector;
step S310: constructing a prompt text vector according to the random text vector, the global image feature vector and the preset category label;
Step S312: inputting the prompt text vector into a transducer model, and extracting features of the prompt text vector by using the transducer model to obtain a text feature vector;
step S314: performing cross attention calculation on the text feature vector and the local image feature vector by using a preset transducer model with a 6-layer structure, and determining a target text feature vector;
step S316: fine tuning the text feature vector based on the target text feature vector to obtain an updated text feature vector;
step S318: performing cross multiplication operation on the local image feature vector and the text feature vector to obtain a feature pair Ji Xiangliang;
step S320: splicing the feature alignment vector and the local image feature vector in corresponding dimensions to obtain a spliced feature vector;
step S322: inputting the local image feature vector and the feature alignment vector into a Semantic FPN model, and performing feature compiling on the local image feature vector and the feature alignment vector by using the Semantic FPN model to determine a segmentation result of the remote sensing image to be segmented;
step S324: and sending the segmentation result to the front end of the client for display.
In the embodiment of the specification, feature extraction is performed on a remote sensing image to be segmented by using a VIT model to obtain a global image feature vector and a local image feature vector, the global image feature vector is subjected to dimension mapping and then spliced with a random text vector and a preset category label to obtain a prompt text vector, then based on a cross attention mechanism and a Transformer model, the richer and deeper features of the prompt text vector are mined, further, the image is segmented by using the text feature and the image feature, the deep features of the image to be segmented are fully mined for each image to be segmented, so that the segmentation result can better meet the image use habit of people, when the segmentation result is reprocessed in the downstream, a good processing result is obtained, the accuracy of the segmentation result is improved, and the local image feature vector and the text feature vector are subjected to cross multiplication operation to obtain a feature alignment vector, so that the feature correlation of the subsequent segmentation result is ensured, the accuracy of the segmentation result is further improved, and the image segmentation result is further improved by using a Seman FPN model.
Fig. 4 shows a system architecture diagram of an image segmentation system according to an embodiment of the present disclosure.
As shown in fig. 4, an image to be segmented is input into an image encoder, global image feature vectors and local image feature vectors are extracted to obtain, the global image feature vectors are input into a mapper, the global image feature vectors are subjected to dimension mapping, are spliced with random text vectors and preset category labels to obtain prompt text vectors, the prompt text vectors are input into the text encoder, the text feature vectors are extracted to obtain text feature vectors, the text feature vectors and the local image feature vectors are input into a transducer model, fine tuning is performed on the text feature vectors by using a cross attention mechanism to obtain updated text feature vectors, the text feature vectors and the local image feature vectors are subjected to feature alignment to obtain feature alignment vectors, after the feature alignment vectors and the local image feature vectors are cascaded, segmentation results of the image to be segmented are input into a decoder, segmentation loss is calculated by using the segmentation results and the pre-acquired label images, contrast loss calculation and alignment loss calculation are respectively performed by using the pre-acquired label images to obtain contrast loss and alignment loss. The segmentation loss, contrast loss, and alignment loss are used to adjust parameters of the image encoder and decoder.
Fig. 5A illustrates a schematic diagram of a remote sensing image to be segmented according to an embodiment of the present disclosure. Fig. 5B is a schematic diagram illustrating a segmentation result of a remote sensing image to be segmented according to an embodiment of the present disclosure.
The embodiment of the specification is a front-end display of a client with a remote sensing image segmentation function.
As shown in fig. 5A, the remote sensing image to be segmented includes the ground surface entities of the road and the oil storage tank, fig. 5A is a remote sensing image to be segmented which is not subjected to image segmentation, the preset type label is set as a vector corresponding to the oil storage tank, the remote sensing image segmentation method in the embodiment of fig. 2 is used for obtaining a segmentation structure, as shown in fig. 5B, the oil storage tank in the remote sensing image to be segmented is subjected to corresponding entity recognition, the entity image corresponding to the oil storage tank is obtained by segmentation, and the entity recognition is not performed on other ground surface entities (roads) of the non-preset type label.
Corresponding to the above method embodiments, the present disclosure further provides an image segmentation apparatus embodiment, and fig. 6 shows a schematic structural diagram of an image segmentation apparatus according to one embodiment of the present disclosure. As shown in fig. 6, the apparatus includes:
A first acquisition module 602 configured to acquire an image to be segmented;
a first extraction module 604, configured to perform feature extraction on the image to be segmented, so as to obtain a global image feature vector and a local image feature vector;
a first construction module 606 configured to construct a hint text vector from the random text vector, the global image feature vector, and the preset category labels;
a second extraction module 608, configured to perform feature extraction on the prompt text vector, to obtain a text feature vector;
the first segmentation module 610 is configured to determine a segmentation result of the image to be segmented through feature compilation according to the local image feature vector and the text feature vector.
Optionally, the first construction module 606 is further configured to:
performing dimension mapping on the global image feature vector to obtain the global image feature vector with the same vector dimension as the random text vector; and splicing the random text vector, the overall image feature vector after dimension mapping and the preset category label to obtain a prompt text vector.
Optionally, the apparatus further comprises:
the updating module is configured to perform cross attention calculation on the text feature vector and the local image feature vector and determine a target text feature vector; and fine tuning the text feature vector based on the target text feature vector to obtain an updated text feature vector.
Optionally, the update module is further configured to:
and performing cross attention calculation on the text feature vector and the local image feature vector by using a preset multi-layer structure translation model decoder, and determining a target text feature vector.
Optionally, the first segmentation module 610 is further configured to:
performing multi-scale feature alignment on the local image feature vector and the text feature vector to obtain feature pairs Ji Xiangliang; and determining a segmentation result of the image to be segmented through feature compiling based on the feature alignment vector and the local image feature vector.
Optionally, the first segmentation module 610 is further configured to:
and performing cross multiplication operation on the local image feature vector and the text feature vector to obtain a feature alignment vector.
Optionally, the first segmentation module 610 is further configured to:
splicing the feature alignment vector and the local image feature vector in corresponding dimensions to obtain a spliced feature vector;
and performing feature compiling on the spliced feature vectors to obtain a segmentation result of the image to be segmented.
Optionally, the first extraction module 604 is further configured to:
inputting the image to be segmented into a pre-trained image encoder, and extracting the characteristics of the image to be segmented by using the image encoder to obtain a global image characteristic vector and a local image characteristic vector;
Correspondingly, the second extraction module 608 is further configured to:
inputting the prompt text vector into a text encoder, and extracting features of the prompt text vector by using the text encoder to obtain a text feature vector;
correspondingly, the first segmentation module 610 is further configured to:
performing multi-scale feature alignment on the local image feature vector and the text feature vector to obtain feature pairs Ji Xiangliang; and inputting the local image feature vectors and the feature alignment vectors into a pre-trained decoder, and performing feature compiling on the local image feature vectors and the feature alignment vectors by using the decoder to determine the segmentation result of the image to be segmented.
Optionally, the apparatus further comprises:
the training module is configured to acquire a sample image set, wherein the sample image set comprises a plurality of sample images and label images corresponding to the sample images; extracting a first sample image and a first label image corresponding to the first sample image from a sample image set, wherein the first sample image is any sample image; inputting the first sample image into a preset image encoder, and extracting the characteristics of the first sample image by using the image encoder to obtain a first global image characteristic vector and a first local image characteristic vector; constructing a first prompt text vector according to the random text vector, the first global image feature vector and the preset category label; inputting the first prompt text vector into a text encoder, and extracting features of the first prompt text vector by using the text encoder to obtain a first text feature vector; performing multi-scale feature alignment on the first local image feature vector and the first text feature vector to obtain a first feature pair Ji Xiangliang; inputting the first local image feature vector and the first feature alignment vector into a pre-trained decoder, and performing feature compiling on the first local image feature vector and the first feature alignment vector by using the decoder to determine a segmentation result of the first sample image; determining a total loss according to the segmentation result of the first sample image, the first feature pair Ji Xiangliang and the first label image; based on the total loss, parameters of the image encoder and decoder are adjusted, and the step of extracting the first sample image and the first label image corresponding to the first sample image from the sample image set is performed back until the training stopping condition is reached.
Optionally, the training module is further configured to:
calculating segmentation loss by using a preset segmentation loss function according to a segmentation result of the first sample image and the first label image; calculating alignment loss according to the first feature pair Ji Xiangliang and the first label image by using a preset alignment loss function; calculating contrast loss according to the first feature pair Ji Xiangliang and the first label image by using a preset contrast loss function; the segmentation, alignment and contrast losses are weighted to obtain the total loss.
Optionally, the training module is further configured to:
and performing sample-point-by-sample contrast loss calculation on the first feature pair Ji Xiangliang and the first label image by using a preset contrast loss function to obtain contrast loss.
Optionally, the apparatus further comprises:
the first editing training module is configured to send the segmentation result of the image to be segmented to the front end for display, so that a user edits the segmentation result at the front end, the editing result fed back by the front end is received, the editing result is used as a sample image, and an image segmentation model is trained, wherein the image segmentation model comprises an image encoder for extracting image features, a text encoder for extracting text features and a decoder for decoding features.
In the embodiment of the specification, an image to be segmented is obtained, feature extraction is performed on the image to be segmented, a global image feature vector and a local image feature vector are obtained, a prompt text vector is constructed according to a random text vector, the global image feature vector and a preset category label, feature extraction is performed on the prompt text vector, a text feature vector is obtained, and a segmentation result of the image to be segmented is determined through feature compiling according to the local image feature vector and the text feature vector. The method comprises the steps of extracting features of an image to be segmented to obtain a global image feature vector and a local image feature vector, constructing a prompt text vector according to the global image feature vector, a random text vector and a preset category label, further segmenting the image by utilizing text features and image features, fully excavating deep features of the image to be segmented, enabling a segmentation result to better meet image use habits of people, obtaining a good processing result when the segmentation result is reprocessed at the downstream, improving accuracy of the segmentation result and improving user experience.
The above is a schematic solution of an image segmentation apparatus of the present embodiment. It should be noted that, the technical solution of the image segmentation apparatus and the technical solution of the image segmentation method belong to the same concept, and details of the technical solution of the image segmentation apparatus, which are not described in detail, can be referred to the description of the technical solution of the image segmentation method.
Corresponding to the method embodiment, the present disclosure further provides an embodiment of a remote sensing image segmentation apparatus, and fig. 7 shows a schematic structural diagram of a remote sensing image segmentation apparatus according to an embodiment of the present disclosure. As shown in fig. 7, the apparatus includes:
the receiving module 702 is configured to receive a remote sensing image segmentation instruction input by a user, where the remote sensing image segmentation instruction includes a class label of a remote sensing image to be segmented and a target segmented object;
a third extraction module 704, configured to perform feature extraction on the remote sensing image to be segmented, so as to obtain a global image feature vector and a local image feature vector;
a second construction module 706 configured to construct a hint text vector from the random text vector, the global image feature vector, and the category label;
a fourth extraction module 708 configured to perform feature extraction on the prompt text vector to obtain a text feature vector;
the second segmentation module 710 is configured to determine a segmentation result for the target segmentation object in the remote sensing image to be segmented according to the local image feature vector and the text feature vector through feature compilation.
Optionally, the apparatus further comprises:
the first editing training module is configured to send the segmentation result of the remote sensing image to be segmented to the front end for display, so that a user edits the segmentation result at the front end, the editing result fed back by the front end is received, the editing result is used as a sample image, and an image segmentation model is trained, wherein the image segmentation model comprises an image encoder for extracting image features, a text encoder for extracting text features and a decoder for decoding features.
In the embodiment of the specification, a remote sensing image segmentation instruction input by a user is received, wherein the remote sensing image segmentation instruction comprises a remote sensing image to be segmented and a class label of a target segmented object, feature extraction is carried out on the remote sensing image to be segmented to obtain a global image feature vector and a local image feature vector, a prompt text vector is constructed according to a random text vector, the global image feature vector and the class label, feature extraction is carried out on the prompt text vector to obtain a text feature vector, and a segmentation result aiming at the target segmented object in the remote sensing image to be segmented is determined through feature compiling according to the local image feature vector and the text feature vector. The method comprises the steps of extracting features of a remote sensing image to be segmented to obtain a global image feature vector and a local image feature vector, constructing a prompt text vector according to the global image feature vector, a random text vector and a preset category label, further segmenting the image by using text features and image features, fully mining own deep features of the single remote sensing image to be segmented, enabling a segmentation result to better meet image use habits of people, obtaining a good processing result when the segmentation result is reprocessed at the downstream, improving accuracy of the segmentation result and improving user experience.
The foregoing is a schematic scheme of a remote sensing image segmentation apparatus of this embodiment. It should be noted that, the technical solution of the remote sensing image segmentation device and the technical solution of the remote sensing image segmentation method belong to the same concept, and details of the technical solution of the remote sensing image segmentation device which are not described in detail can be referred to the description of the technical solution of the remote sensing image segmentation method.
FIG. 8 illustrates a block diagram of a computing device provided in one embodiment of the present description. The components of computing device 800 include, but are not limited to, memory 810 and processor 820. Processor 820 is coupled to memory 810 through bus 830 and database 850 is used to hold data.
Computing device 800 also includes access device 840, access device 840 enabling computing device 800 to communicate via one or more networks 860. Examples of such networks include public switched telephone networks (PSTN, public Switched Telephone Network), local area networks (LAN, local Area Network), wide area networks (WAN, wide Area Network), personal area networks (PAN, personal Area Network), or combinations of communication networks such as the internet. Access device 840 may include one or more of any type of network interface, wired or wireless, such as a network interface card (NIC, network Interface Controller), such as an IEEE802.12 wireless local area network (WLAN, wireless Local Area Networks) wireless interface, a worldwide interoperability for microwave access (Wi-MAX, world Interoperability for Microwave Access) interface, an ethernet interface, a universal serial bus (USB, universal Serial Bus) interface, a cellular network interface, a bluetooth interface, a near field communication (NFC, near Field Communication) interface, and so forth.
In one embodiment of the present description, the above-described components of computing device 800, as well as other components not shown in FIG. 8, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 8 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.
Computing device 800 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 800 may also be a mobile or stationary server.
Wherein the processor 820 is configured to execute computer-executable instructions that, when executed by the processor, perform the steps of the image segmentation method or the remote sensing image segmentation method described above.
The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device belongs to the same concept as the technical solutions of the image segmentation method and the remote sensing image segmentation method, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solutions of the image segmentation method or the remote sensing image segmentation method.
An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the above-described image segmentation method or remote sensing image segmentation method.
The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solutions of the image segmentation method and the remote sensing image segmentation method belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solutions of the image segmentation method or the remote sensing image segmentation method.
An embodiment of the present disclosure further provides a computer program, where the computer program, when executed in a computer, causes the computer to perform the steps of the image segmentation method or the remote sensing image segmentation method described above.
The above is an exemplary version of a computer program of the present embodiment. It should be noted that, the technical solution of the computer program and the technical solutions of the image segmentation method and the remote sensing image segmentation method belong to the same concept, and details of the technical solution of the computer program, which are not described in detail, can be referred to the description of the technical solutions of the image segmentation method or the remote sensing image segmentation method.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims (13)

1. An image segmentation method, comprising:
acquiring an image to be segmented;
extracting features of the image to be segmented to obtain a global image feature vector and a local image feature vector;
constructing a prompt text vector according to the random text vector, the global image feature vector and a preset category label;
extracting features of the prompt text vector to obtain a text feature vector;
Performing cross attention calculation on the text feature vector and the local image feature vector to determine a target text feature vector;
fine tuning the text feature vector based on the target text feature vector to obtain an updated text feature vector;
and determining a segmentation result of the image to be segmented through feature compiling according to the local image feature vector and the text feature vector.
2. The method of claim 1, the constructing a hint text vector from a random text vector, the global image feature vector, and a preset category label, comprising:
performing dimension mapping on the global image feature vector to obtain a global image feature vector with the same vector dimension as the random text vector;
and splicing the random text vector, the global image feature vector subjected to dimension mapping and a preset category label to obtain a prompt text vector.
3. The method of claim 1, the cross-attention computation of the text feature vector and the local image feature vector, determining a target text feature vector, comprising:
and performing cross attention calculation on the text feature vector and the local image feature vector by using a preset multi-layer structure translation model decoder, and determining a target text feature vector.
4. A method according to any one of claims 1-3, wherein said determining a segmentation result of the image to be segmented from the local image feature vector and the text feature vector via feature compilation comprises:
performing multi-scale feature alignment on the local image feature vector and the text feature vector to obtain feature pairs Ji Xiangliang;
and determining a segmentation result of the image to be segmented through feature compiling based on the feature alignment vector and the local image feature vector.
5. The method of claim 4, wherein determining the segmentation result of the image to be segmented based on the feature alignment vector and the local image feature vector via feature compilation comprises:
splicing the feature alignment vector and the local image feature vector in corresponding dimensions to obtain a spliced feature vector;
and performing feature compiling on the spliced feature vectors to obtain a segmentation result of the image to be segmented.
6. The method of claim 1, the feature extraction of the image to be segmented to obtain a global image feature vector and a local image feature vector, comprising:
inputting the image to be segmented into a pre-trained image encoder, and extracting features of the image to be segmented by using the image encoder to obtain a global image feature vector and a local image feature vector;
The feature extraction is performed on the prompt text vector to obtain a text feature vector, which comprises the following steps:
inputting the prompt text vector into a text encoder, and extracting features of the prompt text vector by using the text encoder to obtain a text feature vector;
the determining the segmentation result of the image to be segmented through feature compiling according to the local image feature vector and the text feature vector comprises the following steps:
performing multi-scale feature alignment on the local image feature vector and the text feature vector to obtain feature pairs Ji Xiangliang;
inputting the local image feature vector and the feature alignment vector into a pre-trained decoder, and performing feature compiling on the local image feature vector and the feature alignment vector by using the decoder to determine a segmentation result of the image to be segmented.
7. The method of claim 6, further comprising:
acquiring a sample image set, wherein the sample image set comprises a plurality of sample images and label images corresponding to the sample images;
extracting a first sample image and a first label image corresponding to the first sample image from the sample image set, wherein the first sample image is any sample image;
Inputting the first sample image into a preset image encoder, and extracting features of the first sample image by using the image encoder to obtain a first global image feature vector and a first local image feature vector;
constructing a first prompt text vector according to the random text vector, the first global image feature vector and a preset category label;
inputting the first prompt text vector into a text encoder, and extracting features of the first prompt text vector by using the text encoder to obtain a first text feature vector;
performing multi-scale feature alignment on the first local image feature vector and the first text feature vector to obtain a first feature pair Ji Xiangliang;
inputting the first local image feature vector and the first feature alignment vector into a pre-trained decoder, and performing feature compiling on the first local image feature vector and the first feature alignment vector by using the decoder to determine a segmentation result of the first sample image;
determining total loss according to the segmentation result of the first sample image, the first feature alignment vector and the first label image;
And adjusting parameters of the image encoder and the decoder based on the total loss, and returning to the step of executing the first sample image and the first label image corresponding to the first sample image extracted from the sample image set until reaching a training stopping condition.
8. The method of claim 7, determining a total loss from the segmentation result of the first sample image, the first feature alignment vector, and the first label image, comprising:
calculating segmentation loss by using a preset segmentation loss function according to the segmentation result of the first sample image and the first label image;
calculating alignment loss by using a preset alignment loss function according to the first characteristic alignment vector and the first label image;
calculating contrast loss by using a preset contrast loss function according to the first feature alignment vector and the first label image;
and weighting the segmentation loss, the alignment loss and the contrast loss to obtain a total loss.
9. The method according to any one of claims 1-3, 6-8, further comprising, after said determining a segmentation result of the image to be segmented:
Transmitting the segmentation result of the image to be segmented to a front end for display, so that a user edits the segmentation result at the front end;
receiving an editing result fed back by the front end;
and training an image segmentation model by taking the editing result as a sample image, wherein the image segmentation model comprises an image encoder for extracting image features, a text encoder for extracting text features and a decoder for decoding features.
10. A remote sensing image segmentation method, comprising:
receiving a remote sensing image segmentation instruction input by a user, wherein the remote sensing image segmentation instruction comprises a remote sensing image to be segmented and a class label of a target segmented object;
extracting features of the remote sensing image to be segmented to obtain a global image feature vector and a local image feature vector;
constructing a prompt text vector according to the random text vector, the global image feature vector and the category label;
extracting features of the prompt text vector to obtain a text feature vector;
performing cross attention calculation on the text feature vector and the local image feature vector to determine a target text feature vector;
Fine tuning the text feature vector based on the target text feature vector to obtain an updated text feature vector;
and determining a segmentation result aiming at the target segmentation object in the remote sensing image to be segmented through feature compiling according to the local image feature vector and the text feature vector.
11. The method of claim 10, further comprising, after said determining a segmentation result for the target segmentation in the remote sensing image to be segmented:
transmitting the segmentation result of the remote sensing image to be segmented to a front end for display, so that a user edits the segmentation result at the front end;
receiving an editing result fed back by the front end;
and training an image segmentation model by taking the editing result as a sample image, wherein the image segmentation model comprises an image encoder for extracting image features, a text encoder for extracting text features and a decoder for decoding features.
12. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer executable instructions that, when executed by the processor, implement the steps of the image segmentation method of any one of claims 1 to 9 or the remote sensing image segmentation method of any one of claims 10-11.
13. A computer readable storage medium storing computer executable instructions which when executed by a processor implement the steps of the image segmentation method of any one of claims 1 to 9 or the remote sensing image segmentation method of any one of claims 10-11.
CN202211182413.7A 2022-09-27 2022-09-27 Image segmentation method, remote sensing image segmentation method and device Active CN115761222B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211182413.7A CN115761222B (en) 2022-09-27 2022-09-27 Image segmentation method, remote sensing image segmentation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211182413.7A CN115761222B (en) 2022-09-27 2022-09-27 Image segmentation method, remote sensing image segmentation method and device

Publications (2)

Publication Number Publication Date
CN115761222A CN115761222A (en) 2023-03-07
CN115761222B true CN115761222B (en) 2023-11-03

Family

ID=85352056

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211182413.7A Active CN115761222B (en) 2022-09-27 2022-09-27 Image segmentation method, remote sensing image segmentation method and device

Country Status (1)

Country Link
CN (1) CN115761222B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116758093B (en) * 2023-05-30 2024-05-07 首都医科大学宣武医院 Image segmentation method, model training method, device, equipment and medium
CN117314938B (en) * 2023-11-16 2024-04-05 中国科学院空间应用工程与技术中心 Image segmentation method and device based on multi-scale feature fusion decoding
CN117992992B (en) * 2024-04-07 2024-07-05 武昌首义学院 Extensible satellite information data cloud platform safe storage method and system
CN118413923B (en) * 2024-06-27 2024-10-18 杭州字棒棒科技有限公司 Intelligent desk lamp learning auxiliary system and method based on AI voice interaction
CN118673177A (en) * 2024-08-22 2024-09-20 浙江大华技术股份有限公司 Feature construction method, cross-modal retrieval method, device and computer storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818955A (en) * 2021-03-19 2021-05-18 北京市商汤科技开发有限公司 Image segmentation method and device, computer equipment and storage medium
CN114036336A (en) * 2021-11-15 2022-02-11 上海交通大学 Semantic division-based pedestrian image searching method based on visual text attribute alignment
CN114283127A (en) * 2021-12-14 2022-04-05 山东大学 Multi-mode information-guided medical image segmentation system and image processing method
CN114529757A (en) * 2022-01-21 2022-05-24 四川大学 Cross-modal single-sample three-dimensional point cloud segmentation method
CN114565625A (en) * 2022-03-09 2022-05-31 昆明理工大学 Mineral image segmentation method and device based on global features
WO2022142450A1 (en) * 2020-12-28 2022-07-07 北京达佳互联信息技术有限公司 Methods and apparatuses for image segmentation model training and for image segmentation
CN114943789A (en) * 2022-03-28 2022-08-26 华为技术有限公司 Image processing method, model training method and related device
CN115035213A (en) * 2022-05-23 2022-09-09 中国农业银行股份有限公司 Image editing method, device, medium and equipment
CN115631205A (en) * 2022-12-01 2023-01-20 阿里巴巴(中国)有限公司 Method, device and equipment for image segmentation and model training
CN116128894A (en) * 2023-01-31 2023-05-16 马上消费金融股份有限公司 Image segmentation method and device and electronic equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160014482A1 (en) * 2014-07-14 2016-01-14 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Generating Video Summary Sequences From One or More Video Segments
US11657230B2 (en) * 2020-06-12 2023-05-23 Adobe Inc. Referring image segmentation
US11615567B2 (en) * 2020-11-18 2023-03-28 Adobe Inc. Image segmentation using text embedding

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022142450A1 (en) * 2020-12-28 2022-07-07 北京达佳互联信息技术有限公司 Methods and apparatuses for image segmentation model training and for image segmentation
CN112818955A (en) * 2021-03-19 2021-05-18 北京市商汤科技开发有限公司 Image segmentation method and device, computer equipment and storage medium
CN114036336A (en) * 2021-11-15 2022-02-11 上海交通大学 Semantic division-based pedestrian image searching method based on visual text attribute alignment
CN114283127A (en) * 2021-12-14 2022-04-05 山东大学 Multi-mode information-guided medical image segmentation system and image processing method
CN114529757A (en) * 2022-01-21 2022-05-24 四川大学 Cross-modal single-sample three-dimensional point cloud segmentation method
CN114565625A (en) * 2022-03-09 2022-05-31 昆明理工大学 Mineral image segmentation method and device based on global features
CN114943789A (en) * 2022-03-28 2022-08-26 华为技术有限公司 Image processing method, model training method and related device
CN115035213A (en) * 2022-05-23 2022-09-09 中国农业银行股份有限公司 Image editing method, device, medium and equipment
CN115631205A (en) * 2022-12-01 2023-01-20 阿里巴巴(中国)有限公司 Method, device and equipment for image segmentation and model training
CN116128894A (en) * 2023-01-31 2023-05-16 马上消费金融股份有限公司 Image segmentation method and device and electronic equipment

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Image Segmentation Using Text and Image Prompts;Timo Luddecke;《arXiv》;第1-14页 *
Learning Transferable Visual Models From Natural Language Supervision;Alec Radford;《arXiv》;第1-48页 *
基于Word Embedding的遥感影像检测分割;尤洪峰;田生伟;禹龙;吕亚龙;;电子学报(第01期);第78-86页 *
基于文本表达的指向性目标分割方法研究;魏庆为;《测试技术学报》;第42-47+59页 *

Also Published As

Publication number Publication date
CN115761222A (en) 2023-03-07

Similar Documents

Publication Publication Date Title
CN115761222B (en) Image segmentation method, remote sensing image segmentation method and device
CN111311578B (en) Object classification method and device based on artificial intelligence and medical image equipment
CN113780296B (en) Remote sensing image semantic segmentation method and system based on multi-scale information fusion
AU2019268184B2 (en) Precise and robust camera calibration
CN111368846B (en) Road ponding identification method based on boundary semantic segmentation
CN112101165A (en) Interest point identification method and device, computer equipment and storage medium
CN111739027B (en) Image processing method, device, equipment and readable storage medium
CN113538480A (en) Image segmentation processing method and device, computer equipment and storage medium
CN114943876A (en) Cloud and cloud shadow detection method and device for multi-level semantic fusion and storage medium
CN113411550B (en) Video coloring method, device, equipment and storage medium
CN117540221B (en) Image processing method and device, storage medium and electronic equipment
CN112084859A (en) Building segmentation method based on dense boundary block and attention mechanism
CN117253044B (en) Farmland remote sensing image segmentation method based on semi-supervised interactive learning
CN116453121A (en) Training method and device for lane line recognition model
CN115496820A (en) Method and device for generating image and file and computer storage medium
CN115577768A (en) Semi-supervised model training method and device
CN116980541B (en) Video editing method, device, electronic equipment and storage medium
Qin et al. Face inpainting network for large missing regions based on weighted facial similarity
CN114495916A (en) Method, device, equipment and storage medium for determining insertion time point of background music
CN117576248A (en) Image generation method and device based on gesture guidance
CN116361502B (en) Image retrieval method, device, computer equipment and storage medium
CN113570509A (en) Data processing method and computer device
CN111914850B (en) Picture feature extraction method, device, server and medium
CN116541556A (en) Label determining method, device, equipment and storage medium
CN115205624A (en) Cross-dimension attention-convergence cloud and snow identification method and equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant