CN115631205B

CN115631205B - Method, device and equipment for image segmentation and model training

Info

Publication number: CN115631205B
Application number: CN202211528536.1A
Authority: CN
Inventors: 周强; 于超辉; 王志斌
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-12-01
Filing date: 2022-12-01
Publication date: 2023-03-21
Anticipated expiration: 2042-12-01
Also published as: CN115631205A

Abstract

The application provides a method, a device and equipment for image segmentation and model training. According to the method, the name of the category to be selected of the image segmentation scene/task is input into the image segmentation model together with the image, the image segmentation model automatically maps the input name of the category into the text embedded vector in the unified category representing space, the image feature of the image is extracted, image segmentation is carried out according to the image feature and the text embedded vector, the position mask of the image and the category information corresponding to the position mask are obtained, the unified category name and the category representing space do not need to be manually established, the method can be suitable for various image segmentation scenes/tasks using different category systems, the generalization capability and the robustness of the image segmentation model are improved, and the accuracy of image segmentation is improved.

Description

Method, device and equipment for image segmentation and model training

Technical Field

The present application relates to computer technologies, and in particular, to a method, an apparatus, and a device for image segmentation and model training.

Background

Image segmentation is a core technology in the field of computer vision, and plays a key role in wide applications from automatic driving to remote sensing image analysis and the like. Due to the limitations of data collection and annotation costs, data sets are currently available in a number of different scenarios. For example, ADE20K datasets containing indoor, outdoor and natural scenes, cityscaps dataset focused on city street scenes, different datasets like COCO or COCOstuff datasets for object detection, segmentation.

Most of the current image segmentation schemes select a data set based on a scene related to a specific image number segmentation task, and train an image segmentation model under a single data set, so that the image segmentation model has poor generalization capability and low image segmentation accuracy.

Disclosure of Invention

The application provides a method, a device and equipment for image segmentation and model training, which are used for solving the problems of poor generalization capability and low image segmentation accuracy of an image segmentation model.

In one aspect, the present application provides an image segmentation method, including:

acquiring an image to be segmented and a category name to be selected; inputting the image and the category name into an image segmentation model, extracting image features of the image through the image segmentation model, mapping the category name into a text embedding vector in a uniform category representation space, and determining a position mask of the image and category information corresponding to the position mask according to the image features and the text embedding vector; outputting a position mask of the image and category information corresponding to the position mask, wherein the position mask indicates a segmented region in the image, and the category information corresponding to the position mask indicates the category information of the segmented region in the image.

In another aspect, the present application provides an image segmentation model training method, including:

acquiring a plurality of data sets and names of categories to be selected of the data sets, wherein the data sets comprise sample images and image segmentation and annotation results of the sample images, and the image segmentation and annotation results comprise position masks of the sample images and category information corresponding to the position masks; inputting a to-be-trained image segmentation model of the sample image and a to-be-selected category name of a data set where the sample image is located, extracting image features of the sample image through the image segmentation model, mapping the to-be-selected category name into a text embedding vector in a unified category representation space, and determining an image segmentation prediction result according to the image features and the text embedding vector, wherein the image segmentation prediction result comprises a prediction result of a position mask of the sample image and a prediction result of category information corresponding to the position mask; calculating loss according to the image segmentation prediction result and the image segmentation labeling result of the sample image, and training model parameters of the image segmentation model to obtain a trained image segmentation model; the trained image segmentation model is used for carrying out image segmentation on an input image and determining a position mask of the input image and category information corresponding to the position mask.

In another aspect, the present application provides a method for segmenting a remote sensing image, including:

acquiring a remote sensing image to be segmented and a name of a ground object category to be selected; inputting the remote sensing image and the surface feature category name into an image segmentation model, extracting the image characteristics of the remote sensing image through the image segmentation model, mapping the surface feature category name into a text embedding vector in a uniform category representation space, and determining a position mask of the remote sensing image and surface feature category information corresponding to the position mask according to the image characteristics and the text embedding vector; and outputting a position mask of the remote sensing image and the ground object category information corresponding to the position mask, wherein the position mask indicates a segmentation region in the image, and the ground object category information corresponding to the position mask indicates the ground object category information of the segmentation region in the image.

In another aspect, the present application provides a cloud server, including: a processor, and a memory communicatively coupled to the processor; the memory stores computer-executable instructions; the processor executes computer-executable instructions stored by the memory to implement the method of any of the above aspects.

In another aspect, the present application provides a computer-readable storage medium having stored thereon computer-executable instructions for implementing the method of any one of the above aspects when executed by a processor.

According to the image segmentation and model training method, device and equipment, the to-be-selected category names of the image segmentation scenes/tasks are input into the image segmentation model together with the image, the image segmentation model automatically maps the input category names into the text embedding vectors in the unified category representing space, the image features of the image are extracted, image segmentation is carried out according to the image features and the text embedding vectors, the position mask of the image and the category information corresponding to the position mask are obtained, the unified category names and the category representing spaces do not need to be manually established, the method, the device and the equipment can be suitable for various image segmentation scenes/tasks using different category systems, the generalization capability and the robustness of the image segmentation model are improved, and the image segmentation accuracy is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic diagram of a system architecture for image segmentation based on the present application;

FIG. 2 is a flowchart of an image segmentation method provided in an exemplary embodiment of the present application;

FIG. 3 is a block diagram of an image segmentation model provided in an exemplary embodiment of the present application;

FIG. 4 is a block diagram of a class guided decoding layer according to an exemplary embodiment of the present application;

FIG. 5 is a detailed flowchart of an image segmentation method provided in an exemplary embodiment of the present application;

FIG. 6 is a flow chart of a method for segmenting a remote sensing image according to an exemplary embodiment of the present application;

FIG. 7 is a flowchart of an image segmentation model training method provided by an exemplary embodiment of the present application;

FIG. 8 is a block diagram of an image segmentation model training provided by an exemplary embodiment of the present application;

FIG. 9 is a detailed flowchart of an image segmentation model training method according to an exemplary embodiment of the present application;

FIG. 10 is a schematic illustration of data enhancement provided by an exemplary embodiment of the present application;

FIG. 11 is a block diagram of a remote sensing image segmentation apparatus according to an exemplary embodiment of the present application;

FIG. 12 is a block diagram of an image segmentation model training apparatus according to an exemplary embodiment of the present application;

fig. 13 is a schematic structural diagram of a cloud server according to an exemplary embodiment of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

Aiming at the problems of poor generalization capability and low image segmentation accuracy of an image segmentation model in the prior art, the image segmentation method is suitable for image segmentation tasks in various different scenes, and the image segmentation tasks in different scenes have different category systems, namely different names of categories to be selected. The method comprises the steps of inputting an image to be segmented into an image segmentation model, inputting a category name to be selected as language guide information into the image segmentation model, extracting image characteristics of the image through the image segmentation model, and mapping the category name into a text embedding vector, wherein the text embedding vector of the category name reflects the semantic relation between the category names; furthermore, a position mask of the image and category information corresponding to the position mask are determined according to the image characteristics and the text embedding vector, and in the process of image segmentation, the prediction result of the model is redirected to the input name of the category to be selected under the driving/guiding of the text embedding vector.

The application also provides an image segmentation model training method, which is used for training the image segmentation model by using the multiple data sets on the premise of not unifying the categories and re-labeling the multiple data sets, so that the generalization capability, robustness and accuracy of the image segmentation model are improved.

Referring to fig. 1, fig. 1 is a schematic diagram of a system architecture for image segmentation based on the present application, and the system architecture shown in fig. 1 may specifically include a server and a peer device. The server can be a server cluster arranged at the cloud end, a trained image segmentation model is stored in the server, and the image segmentation model supports image segmentation under various different category system scenes. By presetting operation logic in the server, the server can realize the image segmentation function under various different category system scenes.

The end-side device may specifically be a hardware device having a network communication function, an operation function, and an information display function, and includes but is not limited to a smart phone, a tablet computer, a desktop computer, an internet of things device, a cluster deployed in the cloud, and the like.

Through communication interaction with the server, the user can submit the image to be segmented and the category names included in the used category system to the server through the end-side device, wherein the category names are the category names to be selected. The server can input the image and the category name into the image segmentation model, extract the image characteristics of the image through the image segmentation model, map the category name into a text embedding vector, and determine the position mask of the image and the category information corresponding to the position mask according to the image characteristics and the text embedding vector. The server may output a location mask of the image and category information corresponding to the location mask.

For example, the server may display the position mask of the image and the category information corresponding to the position mask on line, or the server sends the position mask of the image and the category information corresponding to the position mask to the end-side device, or the server provides a download link to the end-side device, so that the end-side device downloads the position mask of the image and the category information corresponding to the position mask according to the download link.

In addition, the server may also store a plurality of data sets, and different data sets use different category systems, that is, different data sets have different category name sets. The server can train the image segmentation model by using a plurality of data sets with different category name sets, so that a universal image segmentation model suitable for a plurality of different category system scenes is obtained.

It should be noted that the image segmentation model training party and the image segmentation method may be implemented on the same server, and the server trains the image segmentation model using a plurality of data sets, and deploys the trained image segmentation model as a local service, thereby providing an external service for image segmentation.

Alternatively, the image segmentation model trainer and the image segmentation method may be implemented on different servers, respectively. Specifically, the first server stores a plurality of data sets, and different data sets use different category systems and have different category name sets. The method comprises the steps that a first server trains an image segmentation model by using a plurality of data sets with different category name sets to obtain a universal image segmentation model suitable for a plurality of different category system scenes, and the image segmentation model is deployed to a second server. The second server provides an image segmentation service for the outside, obtains an image to be processed and a category name to be selected uploaded by the end-side equipment, executes a processing flow of the image segmentation method, inputs the image and the category name into an image segmentation model, extracts image features of the image through the image segmentation model, maps the category name into a text embedding vector, and determines a position mask of the image and category information corresponding to the position mask according to the image features and the text embedding vector. And the second server outputs the position mask of the image and the category information corresponding to the position mask.

The image segmentation method and the image segmentation model training method provided by the application can be applied to different image segmentation tasks such as semantic segmentation, instance segmentation and panoramic segmentation, and can be particularly applied to application scenes such as land parcel segmentation, land feature classification detection, land feature change detection, target detection, focus and organ detection in medical images, video monitoring and object tracking, and shelf vacancy recognition of retail scenes. The image data to be subjected to image segmentation may be various image data such as a remote sensing image, a high-definition image, and a depth image, and is not particularly limited herein.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 2 is a flowchart of an image segmentation method according to an exemplary embodiment of the present application. The execution subject of the method provided by the application is the server for executing the image segmentation method flow. As shown in fig. 2, the method comprises the following specific steps:

step S201, obtaining an image to be segmented and a category name to be selected.

The image segmentation method provided by the embodiment is suitable for an image segmentation scene/task which is used for arbitrarily segmenting one or more segmentation regions from an image, wherein each segmentation region corresponds to one category information, and identifies the category information of an object in each segmentation region, and can be particularly applied to various different image segmentation scenes/tasks. When applied to different image segmentation scenarios/tasks, the images to be segmented may be different types of images. For example, the image to be segmented may be a remote sensing image, a high definition image, a depth image, and the like, and is not limited herein.

When the method is applied to different image segmentation scenes/tasks, the used category systems can be different, the name of the category to be selected refers to the name of the category in the category system used in the currently applied image segmentation scene/task, and the set of the name of the category to be selected is the available category name space for image segmentation.

Step S202, inputting the image and the category name into an image segmentation model, extracting image features of the image through the image segmentation model, mapping the category name into a text embedding vector in a uniform category representation space, and determining a position mask of the image and category information corresponding to the position mask according to the image features and the text embedding vector.

After an image to be segmented and a name of a category to be selected are obtained, the name of the category to be selected is used as input data, the input data and the image to be segmented are input into an image segmentation model, the image segmentation model is used, a unified category name space is replaced by the input name of the category to be selected, the unified category name space is used as a category name space used for image segmentation, the image segmentation process is guided, and image segmentation under the input specific category name space (including the name of the category to be selected) is achieved.

Specifically, the image coding module of the image segmentation model codes (feature extraction) the input image to obtain the image features of the image. The text encoding module of the image segmentation model maps the input category name into a text embedding vector (text embedding) in a unified category representation space, and the text embedding vector contained in the category representation space is a text representation of the category name (also referred to as text embedding, category representation, or category embedding). Furthermore, according to the image characteristics of the image and the text embedded vector of the category name to be selected, the position mask of the image and the category information corresponding to the position mask are determined, the text embedded vector is used as a classification expression used when the image is segmented, the query vector for segmentation corresponding to the segmented region is aligned with the text embedded vector to determine the category name corresponding to the segmented region, and the uniform category name and category expression space do not need to be manually constructed, so that accurate image segmentation can be realized.

The image segmentation model is obtained by training a plurality of data sets with different category namespaces, and can be applied to image segmentation scenes/tasks of the category namespaces using any data set. The specific training process of the image segmentation model is described in detail in the following embodiments.

Step S203, outputting a position mask of the image and category information corresponding to the position mask, where the position mask indicates a segmented region in the image, and the category information corresponding to the position mask indicates category information of the segmented region in the image.

In this embodiment, the image is divided to obtain a position mask of the image and category information corresponding to the position mask. The position mask indicates a segmented region in the image, and the category information corresponding to the position mask indicates the category information of the segmented region in the image. The location mask may be a category area mask in semantic segmentation of the image, indicating a segmented area covered by a category in the image. The location mask may be an object instance mask in the image panorama segmentation for indicating a segmentation area in which a certain object instance is located in the image.

Illustratively, taking the image segmentation scene as an example of dividing the remote sensing image into coverage areas of different ground features and category information of the ground features, for the input remote sensing image, the image segmentation result may include one or more position masks and the category information corresponding to each position mask. Wherein each position mask is used to indicate the position of a segmented region in the remote sensing image, i.e. which pixels in the remote sensing image the segmented region contains. The category information corresponding to the position mask is also the category information corresponding to the ground object in the partitioned area.

For example, the position mask may be a binary mask, the size of the position mask is consistent with the size of the remote sensing image, mask values in the position mask correspond to pixels in the remote sensing image in a one-to-one manner, the mask values may take 0 or 1, and respectively indicate whether the pixel is included in the partitioned area, pixels in the partitioned area correspond to the mask values of 1, and pixels not in the partitioned area correspond to the mask values of 0, so that the position mask may determine a partitioned area including pixels corresponding to the mask values of 1 in the remote sensing image.

Optionally, the server may display the location mask of the image and category information corresponding to the location mask online.

Alternatively, the server may send the location mask of the image and the category information corresponding to the location mask to the end-side device, and the end-side device displays the location mask and the category information.

Optionally, the server may provide a download link to the end-side device, so that the end-side device downloads the position mask of the image and the category information corresponding to the position mask to the local according to the download link, and displays the position mask and the category information locally.

In the embodiment, the image segmentation model is input together with the image through the to-be-selected category name of the current image segmentation scene/task, the image segmentation model automatically maps the input category name into the text embedded vector in the unified category representation space, extracts the image feature of the image, and performs image segmentation according to the image feature and the text embedded vector to obtain the position mask of the image and the category information corresponding to the position mask, so that the method can be suitable for various image segmentation scenes/tasks using different category systems without manually establishing the unified category name and the category representation space, the generalization capability and the robustness of the image segmentation model are improved, and the accuracy of image segmentation is improved.

In an optional embodiment, the position mask of the output image and the category information corresponding to the position mask may also be implemented as follows:

generating segmentation result information of the image according to the position mask of the image and the category information corresponding to the position mask, wherein the segmentation result information comprises the position information and the corresponding category information of a segmentation area in the image; and outputting segmentation result information of the image.

Specifically, a divided region in the image can be determined according to the position mask of the image, and the category information corresponding to the position mask is used as the category information corresponding to the divided region determined by the position mask.

For example, the divided areas corresponding to the position masks may be marked on the image by using areas or borders with different colors, different colors may represent different category information, and the category names may also be directly displayed in the divided areas (or in the vicinity thereof) to form a division result map. The position mask of the image and the category information corresponding to the position mask are displayed in a segmentation result graph mode, so that the image segmentation result can be displayed more intuitively, and a user can conveniently check and distinguish different category segmentation areas.

Illustratively, taking an image segmentation scene for identifying different land areas and category information in a remote sensing image as an example, an image to be segmented is a remote sensing image containing at least one land, different segmentation areas in segmentation result information correspond to different lands, and category information corresponding to the segmentation areas is land object category information of the land corresponding to the segmentation areas. When the segmentation result information of the image is output, the remote sensing image can be displayed, and the position of the segmentation region and the information of the ground object category corresponding to the segmentation region are marked in the displayed remote sensing image according to the segmentation result information.

Illustratively, taking an image segmentation scene for identifying objects appearing in an image and categories to which the objects belong as an example, an image to be segmented includes at least one target object, different segmentation regions in segmentation result information correspond to different objects, and category information corresponding to the segmentation regions is category information to which the objects corresponding to the segmentation regions belong. When the segmentation result information of the image is output, the image is displayed, and the position of the segmentation region where the target object is located and the category information of the target object are marked in the displayed image according to the segmentation result information.

For example, when applied to e-commerce scenes, the method can be used for identifying target commodities appearing in a given image and commodity categories to which the target commodities belong. When the method is applied to an automatic driving scene, the method can be used for identifying objects such as vehicles, roads, roadside equipment, green plants and the like appearing in a given image and marking the positions of different objects and the category information of the objects.

Further, the position information and/or the category information of the divided region in the division result information of the image is updated in response to the correction operation for the position information and/or the category information of the divided region in the division result information of the image.

Illustratively, if the user needs to correct the position of a certain divided region, the user may input new position information of the divided region through a text box, or adjust the position of the divided region by dragging or the like on the displayed image. If the user needs to correct the category information of a certain segmentation area, the user can directly input the category information through the input box, or the user can select correct category information in the displayed selection list of the categories by clicking the category information of the segmentation area to be corrected, so that the correction of the category information is realized.

In this embodiment, the operation and the manner of correcting the position information and/or the category information of the divided area are not particularly limited.

In an alternative embodiment, the image segmentation model may use a model framework as shown in fig. 3, see fig. 3, the image segmentation model comprising an image encoder, a pixel feature extractor and a text encoder.

The image encoder is used for down-sampling the input image to extract a first image feature of the input image, so as to obtain an image feature with a lower dimensionality. Illustratively, the image encoder may use any backbone network model for extracting image features, such as ResNet, resNet improved model, convolutional Neural Network (CNN), etc., and is not limited herein.

The pixel feature extractor is used for transforming the first image features extracted by the image encoder to obtain second image features of the image at the pixel level. The second image feature is pixel-level and has a higher dimension than the first image feature. When the first image feature is changed to obtain the pixel-level image feature of the image, the method may be implemented by using common upsampling methods such as transposition convolution (also called deconvolution), upsampling (unsampling), pooling (unpoiuting), and the like, which is not limited herein.

The text encoder is used for automatically mapping the input category names to the unified category representation to obtain the text embedded vector corresponding to each category name, so that the text embedded vector of the category name contains the semantic relation among different categories, and the category names with similar semantics have closer relation. Therefore, no matter what kind of category name is the category name space of the category system, the file encoder can map the input category name to the unified category representation, so that the method is suitable for image segmentation scenes/tasks of different category systems, and is easy to expand to more image segmentation scenes/tasks.

For example, the weight parameters of the text encoder may be initialized using parameters of a CLIP (contextual Language-Image Pre-training) model, and the weight parameters of the text encoder are fixed, and are not updated during the training of the Image segmentation model. The CLIP model is a pre-training model based on contrasting text-image pairs. The text embedding vector space of the CLIP model is suitable for being used as a uniform category representation, and categories with similar semantics comprise text embedding vectors with closer relations. In addition, the text encoder may also use a CLIP-like model, which is not specifically limited herein.

In particular, the text encoder may introduce learnable (trainable) context information (also referred to as hint templates, hint text) including word vectors corresponding to a plurality of category names (V1, \8230; vm, m indicates the number of word vectors contained as shown in fig. 3). During the training process of the image segmentation model, the context information can be initialized randomly, the context information is optimized during iterative training, and the context information is fixed after the training is finished. And the text encoder maps the input category names to the unified category representation based on the context information to obtain text embedded vectors of the input category names.

The specific implementation manner of the text encoder mapping the input category name to the text embedding vector by using the context information (the prompt template and the prompt text) is similar to the implementation manner of mapping the input text to the text embedding vector in the CLIP model, and is not described here again.

As shown in fig. 3, the image segmentation model further includes a class-guided decoder capable of dynamically adapting to class information of model prediction under different image segmentation scenes/tasks. The classification guide decoder is internally provided with a plurality of initial query vectors for division, and N in fig. 3 indicates the number of query vectors for division). The classification guiding decoder is used for carrying out de-duplication processing on the plurality of query vectors for division, and aggregating the first image feature and text embedding vector with the query vectors for division after de-duplication processing to obtain the aggregated query vectors for division.

Further, the category information corresponding to each query vector for segmentation can be determined by performing alignment processing according to the aggregated query vector for segmentation and the text embedded vector output by the text encoder.

Illustratively, by performing point multiplication on the query vectors for segmentation and the text embedding vectors, the classification prediction probability of each query vector for segmentation corresponding to each category information can be obtained, and one category information with the highest classification prediction probability is determined as the category information corresponding to the query vectors for segmentation.

The aggregated query vector for segmentation output by the classification guidance decoder may be passed through a Multi-Layer Perceptron (MLP) to further extract features for region segmentation (regression task) to enhance region feature information contained in the query vector for segmentation. According to the processed query vectors for segmentation and the second image features output by the pixel feature extractor, the position mask corresponding to each query vector for segmentation after aggregation can be determined.

For example, the processed query vector for segmentation and the feature vector point of each pixel in the second image feature at pixel level output by the pixel feature extractor may be multiplied to obtain the similarity probability between the pixel and each query vector for segmentation. And setting mask values corresponding to pixels with similarity probabilities larger than or equal to the probability threshold as 1 and setting mask values corresponding to pixels with similarity probabilities smaller than the probability threshold as 0 according to the similarity probabilities and the set probability threshold, and obtaining the position mask corresponding to each query vector for segmentation.

For example, referring to fig. 4, the class guide decoder may be stacked by a plurality of identical class guide decoding layers, each of which includes a Multi-Head Self-Attention (MHSA) module, a Cross-Attention (Cross-Attention) module, and a Feed-Forward (FFN) module.

In which multi-headed self-attention is used to perform deduplication processing (self-deduplication) on a plurality of segmentation-use query vectors that are input. And the cross attention module is used for aggregating the first image feature and text embedded vector with the query vector for segmentation after the duplication removal processing to obtain the aggregated query vector for segmentation. The feedforward neural network module is used for carrying out nonlinear transformation on the aggregated query vectors for division so as to improve the network expression capability of the query vectors for division.

The Cross-Attention (Cross-Attention) module may include a Visual-query Cross-Attention (Visual-query Cross-Attention) module and a Text-query Cross-Attention (Text-query Cross-Attention) module. The vision-query cross attention module is used for aggregating the input image features into the input query vector for segmentation to obtain a new query vector for segmentation. The text-query cross attention module is used for aggregating the input text embedding vectors into the input query vectors for segmentation to obtain new query vectors for segmentation. Wherein, the vision-inquiry cross-concern module and the text-inquiry cross-concern module can be realized by adopting the existing cross-concern neural network.

When the text embedding vector and the query vector for segmentation need to be calculated, the dimension of the text embedding vector may be adjusted by using a linear adaptor so that the dimension of the text embedding vector matches the dimension of the query vector for segmentation.

In addition, the text embedded vector mapped by the category name by the text coding module can be stored, the stored text embedded vector can be directly obtained for the same input category name subsequently, the text coding module does not need to be used again for mapping, and the calculation cost introduced by a text coder can be reduced.

In fig. 4, for example, the visual-query cross-attention module in the classification guide decoding layer is before the text-query cross-attention module, that is, the first image features are aggregated into the query vector for segmentation, and then the text embedding vector is aggregated into the query vector for segmentation with the first image features aggregated, so as to obtain the aggregated query vector for segmentation. In other implementation manners, the classification guiding decoding layer may further include a text-query cross attention module in front of the classification guiding decoding layer and a visual-query cross attention module behind the classification guiding decoding layer, that is, the text embedding vector is firstly aggregated into the query vector for segmentation, and then the first image feature is aggregated into the query vector for segmentation aggregated with the text embedding vector, so as to obtain the aggregated query vector for segmentation. The visual-query cross-attention module and the text-query cross-attention module are not specifically limited in their arrangement order (stacking order).

In addition, fig. 4 exemplifies that 6 classification guide decoding layers are stacked into a classification guide decoder, and the number of the classification guide decoding layers stacked in the classification guide decoder is not specifically limited here.

The number of the query vectors for segmentation in the classification guidance decoder can be set according to the size of the category name space in the actual image segmentation scene, and the number of the query vectors for segmentation is greater than or equal to the maximum value of the size of the category name space in the image segmentation scene so as to be suitable for various image segmentation scenes. For example, the number of the query vectors for segmentation may be 100, which generally satisfies the requirement of image segmentation, and the number of the query vectors for segmentation may be set and adjusted according to the requirement, and is not limited specifically herein.

The segmentation-oriented query vectors in the class-guided decoder may be initialized to zero vectors, each associated with a learnable position code.

In addition, in other embodiments, unlike the image segmentation model in fig. 3 that includes an image encoder and a pixel feature extractor, the image segmentation model includes only one image encoding module, and the image encoder module may directly extract a third image feature at a pixel level of an image, and use the third image feature at the pixel level as the first image feature and the second image feature at the same time to implement the processing procedure of image segmentation.

In this embodiment, the classification-oriented decoding follows the standard architecture of the converter, converts the multiple query vectors for segmentation using the multi-head self-attention and cross-attention mechanism and the FFN module, and performs self-deduplication on the query vectors for segmentation by using the multi-head self-attention module, so that the model can use the pairwise relationship between the query vectors for segmentation to perform global prediction/inference on the position masks corresponding to the multiple query vectors for segmentation. Cross-attention between the query vector for segmentation and the image features of the input image can be used to obtain more image information using the entire image as context. And (4) paying attention to the intersection between the query vector for division and the text embedding vector, guiding the query vector for division to a category corresponding to the text embedding vector of the input category name, and predicting/deducing the category information corresponding to the query vector for division.

Referring to fig. 5, fig. 5 is a flowchart of an image segmentation method provided in an exemplary embodiment of the present application, and based on the model framework shown in fig. 3, the image segmentation method of the present embodiment includes the following specific steps:

step S501, an image to be segmented and a category name to be selected are obtained.

Step S502, inputting the image into an image encoder of the image segmentation model, and inputting the name of the category to be selected into a text encoder of the image segmentation model.

And S503, mapping the category name into a text embedded vector by using the trained context information through a text encoder.

In particular, learnable (trainable) context information (also referred to as hint templates, hint text) is introduced, which includes word vectors corresponding to a plurality of category names. And in the initial training stage of the image segmentation model, randomly initializing word vectors contained in the context information, optimizing the context information in iterative training, and fixing the context information after the training is finished.

In the step, the input category names are mapped to a uniform category expression space through a text encoder based on the trained context information, and text embedded vectors of the category names are obtained. The unified category expression space is not a category expression space existing in an actual scene, but is formed in the model training process, and can cover the embedding of the category names of all data sets used in the training process. The category representation space contains a text embedding vector of a plurality of category names, i.e. a text representation of a category name, also called category representation.

Illustratively, the weight parameters of the text encoder may be initialized using parameters of a Pre-trained CLIP (contextual Language-Image Pre-training) model, and the weight parameters of the text encoder are fixed, without updating the weight parameters of the text encoder during the training of the Image segmentation model. The CLIP model is a pre-trained model based on contrasting text-image pairs. In this embodiment, the text embedding vector space of the CLIP model is used as a uniform category representation space, and categories with similar semantics contain text embedding vectors with closer relationships (i.e., category representation). In addition, the text encoder may also use a CLIP-like pre-training model, which is not specifically limited herein.

In the step, the input category names are automatically mapped to a uniform category representation through a text encoder to obtain text embedded vectors corresponding to each category name, so that the text embedded vectors of the category names contain semantic relations among different categories, and the category names with similar semantics have tighter relations. Therefore, no matter what kind of category name of the input category name is the category name space of the category system, the file encoder can map the input category name to the uniform category representation, so that the file encoder is suitable for image segmentation scenes/tasks of different category systems and is easy to expand to more image segmentation scenes/tasks.

Step S504, the image is coded through the image coder, and first image characteristics of the image are obtained.

In the present embodiment, an input image is down-sampled by an image encoder to extract image features of a lower dimension of the input image.

Illustratively, the image encoder may use any backbone network model for extracting image features, such as ResNet, resNet improved model, convolutional Neural Network (CNN), etc., and is not limited herein.

Optionally, in the training start phase, the weight parameters of the image encoder may be initialized using the weight parameters of the pre-trained ImageNet; alternatively, the weight parameters of the image encoder may be initialized using the weight parameters of the pre-trained CLIP. When the text encoder and the image encoder are initialized by using the weight parameters of the pre-trained CLIP, the performance and robustness of the image segmentation model can be improved, such as the overlapping degree (IoU) of the model is improved.

And step S505, carrying out duplicate removal processing on the input query vector for segmentation through a classification guide decoder, and aggregating the first image feature and text embedded vector with the query vector for segmentation after the duplicate removal processing to obtain the aggregated query vector for segmentation.

As shown in fig. 4, the class guide decoder includes multiple stacked class guide decoding layers, and the segmentation query vector input by the first class guide decoding layer is the initial segmentation query vector. The partitioning query vector output by the previous classification guidance decoding layer is used as the partitioning query vector input by the next classification guidance decoding layer. The first image feature and the text embedding vector are used as input of each layer classification guide decoding layer. Each layer classification directs the decoding layer to perform the process of step S505. Inputting the initial query vector for segmentation into a multi-head self-attention module at a first layer of a classified guide decoding layer in a classified guide decoder, and circularly executing the processing of the step S505 for multiple times through the classified guide decoder to obtain the final aggregated query vector for segmentation.

Optionally, in this step, the first image features may be aggregated into the query vector for segmentation, and then the text embedding vector may be aggregated into the query vector for segmentation with the first image features aggregated, so as to obtain an aggregated query vector for segmentation.

Optionally, in this step, the text embedding vector is aggregated into the query vector for segmentation, and then the first image feature is aggregated into the query vector for segmentation aggregated with the text embedding vector, so as to obtain an aggregated query vector for segmentation.

And S506, aligning the aggregated query vector for segmentation and the text embedded vector to obtain category prediction information corresponding to the aggregated query vector for segmentation.

In this step, the category information corresponding to each query vector for segmentation can be determined by performing alignment processing on the aggregated query vector for segmentation and the text embedded vector output by the text encoder.

Specifically, by performing point multiplication on each query vector for segmentation and each text embedding vector, the classification prediction probability of the query vector for segmentation corresponding to the category information corresponding to each text embedding vector can be obtained, and the category information with the maximum classification prediction probability is determined as the category information corresponding to the query vector for segmentation.

And step S507, performing down-sampling on the first image characteristic through the pixel characteristic extractor to obtain a second image characteristic of the image.

In the step, the first image features are downsampled through a pixel feature extractor to obtain second image features of the pixel level of the image, and the second image features of the pixel level are used for performing regression prediction to determine a position mask of the image.

Step S508, determining a position mask corresponding to the aggregated query vector for segmentation according to the aggregated query vector for segmentation and the second image feature.

In this step, feature extraction is performed on the aggregated query vector for segmentation by a multilayer perceptron (MLP) to further extract features for region segmentation (regression task), and region feature information included in the query vector for segmentation is enhanced to obtain a processed query vector for segmentation. Further, the position mask corresponding to the aggregated query vector for division is determined according to the processed query vector for division and the second image feature output by the pixel feature extractor.

Specifically, the processed query vector for segmentation and the feature vector of each pixel in the second image feature at the pixel level output by the pixel feature extractor may be point-multiplied, and the similarity probability of the pixel and each query vector for segmentation may be obtained. And setting mask values corresponding to pixels with similarity probabilities larger than or equal to the probability threshold as 1 and setting mask values corresponding to pixels with similarity probabilities smaller than the probability threshold as 0 according to the similarity probabilities and the set probability threshold, and obtaining the position mask corresponding to each query vector for segmentation.

Step S509, determining the position mask of the image and the category information corresponding to the position mask according to the category prediction information and the position mask corresponding to the aggregated query vector for segmentation.

After the category prediction information and the location mask corresponding to each query vector for segmentation are determined, a segmented region in the image may be determined for the location mask corresponding to each query vector for segmentation, and the location mask of the image and the category information corresponding to the location mask may be obtained using the category prediction information corresponding to the query vector for segmentation as the category prediction information of the segmented region determined by the location mask corresponding to the query vector for segmentation.

And step S510, outputting the position mask of the image and the category information corresponding to the position mask.

This step is consistent with the implementation manner of step S203, and for details, reference is made to the relevant content of step S203, which is not described herein again.

The embodiment provides an exemplary image segmentation model framework, a specific image segmentation process is explained in detail based on the image segmentation model framework, an image segmentation model is input together with an image through a candidate category name of a current image segmentation scene/task, the image segmentation model automatically maps the input category name into a text embedding vector in a unified category expression space, image features of the image are extracted, image segmentation is performed according to the image features and the text embedding vector to obtain a position mask of the image and category information corresponding to the position mask, the unified category name and category expression space do not need to be manually established, the image segmentation model framework can be suitable for various image segmentation scenes/tasks using different category systems, generalization capability and robustness of the image segmentation model are improved, and accuracy of image segmentation is improved.

In an alternative embodiment, the image segmentation method can be applied to a scene segmented by a remote sensing image. Referring to fig. 6, fig. 6 is a flowchart of a remote sensing image segmentation method according to an exemplary embodiment of the present application. As shown in fig. 6, the method comprises the following specific steps:

s601, obtaining a remote sensing image to be segmented and a name of a ground object category to be selected.

Step S602, inputting the remote sensing image and the name of the surface feature category into an image segmentation model, extracting the image characteristics of the remote sensing image through the image segmentation model, mapping the name of the surface feature category into a text embedding vector, and determining the position mask of the remote sensing image and the information of the surface feature category corresponding to the position mask according to the image characteristics and the text embedding vector.

And step S603, outputting a position mask of the remote sensing image and surface feature category information corresponding to the position mask, wherein the position mask indicates a segmentation region in the image, and the surface feature category information corresponding to the position mask indicates surface feature category information of the segmentation region in the image.

In an application scene example, the remote sensing image segmentation method can be used for identifying different land areas and category information in a remote sensing image, the image to be segmented is the remote sensing image containing at least one land, different segmentation areas in segmentation result information correspond to different lands, and the category information corresponding to the segmentation areas is the land object category information of the land corresponding to the segmentation areas.

Specifically, when a position mask of the remote sensing image and ground object category information corresponding to the position mask are output, the position information of a land parcel in the remote sensing image is determined according to the position mask of the remote sensing image; determining the land feature category information corresponding to the land parcel in the remote sensing image according to the land feature category information corresponding to the position mask; and marking the position of the land parcel and the land parcel category information corresponding to the land parcel in the remote sensing image according to the position information of the land parcel in the remote sensing image and the land parcel category information corresponding to the land parcel.

The method can be applied to remote sensing image change detection, specifically, image segmentation is respectively carried out on the two-stage remote sensing images, the land object categories of different land areas in each-stage remote sensing image are identified, and the land areas with the categories changed can be determined by comparing the image segmentation result information of the two-stage remote sensing images. For example, an area where a green space is changed to a building, an area where a water area is changed to a road, and the like.

The method can be applied to road segmentation based on remote sensing images, and the names of the ground object categories to be selected can comprise roads and backgrounds. Specifically, a remote sensing image covering an urban road is acquired, the remote sensing image is subjected to image segmentation, a road area and a background area (non-road area) in the remote sensing image are identified, and the actual position of the road can be calculated according to the road area in the remote sensing image and used for constructing map data.

The difference between this embodiment and the above method embodiment is that the image to be processed is a remote sensing image, the name of the category to be selected is a name of a surface feature category, and a specific processing flow is similar to the implementation flow of the above image segmentation method, which is referred to in detail in the above embodiment of the image segmentation method, and is not described herein again.

Fig. 7 is a flowchart of an image segmentation model training method according to an exemplary embodiment of the present application. The execution subject of the method provided by the application can be the server for training the image segmentation model.

In the present embodiment, the plurality of data sets may be a plurality of different data sets using different classification systems in different image segmentation scenes. Since different data sets use different category systems, the category namespaces (containing the names of the categories to be selected) of the different data sets are different.

The main problem of image segmentation model training based on multiple different data sets is that different data sets use different category systems, and the classification categories of different data sets are inconsistent, including: category overlap, category label (e.g., category ID) conflict, naming differences for categories, etc. For example, the category "human" in the ADE20k dataset is labeled "human" and "rider" in the citrescaps dataset.

In the existing scheme for training the image segmentation model based on multiple data sets, a unified category label is often required to be manually established by using a category label of a hot classification method, namely, a unified category system is established, sample images in each data set are re-labeled, and then the image segmentation model is trained by using the re-labeled data sets, which is time-consuming and easy to make mistakes. Furthermore, a hot taxonomy is inflexible and not extensible.

According to the image segmentation model training method provided by the embodiment, the image segmentation model can be trained by using a plurality of data sets without manually establishing a uniform category label and re-labeling the data sets, so that the image segmentation model is suitable for image segmentation scenes/tasks corresponding to the data sets.

Fig. 8 is a frame diagram of image segmentation model training provided in an exemplary embodiment of the present application, and in fig. 8, taking training using the image segmentation model shown in fig. 3 as an example, a structure and a specific function of the image segmentation model are consistent with those of the image segmentation model used in the above embodiment of the image segmentation method, for specific reference, related descriptions of the above embodiment are given, and details are not repeated here. During model training, the category name of the image segmentation model is input together with the sample image, the category name is the name of the category to be selected of the data set where the sample image is located, and the category name of the category to be selected of the data set usually also comprises the category of the background, which indicates that the category does not belong to any other category to be selected of the data set. In fig. 8, M denotes the number of data sets, and C denotes the number of input sample images. K denotes the number of input category names (containing no background category), and the number of sample images differs for different datasets. N denotes the number of query vectors for segmentation. H × W represents the resolution of the input sample image, which may be different for different sample images.

As shown in fig. 7, the image segmentation model training method provided in this embodiment specifically includes the following steps:

step S701, obtaining a plurality of data sets and category names to be selected of the data sets, wherein the data sets comprise sample images and image segmentation and annotation results of the sample images, and the image segmentation and annotation results comprise position masks of the sample images and category information corresponding to the position masks.

The multiple data sets may be known data sets for various image segmentation scenes/tasks, and different data sets may use different category systems, that is, different data sets may have different names of categories to be selected. For example, the plurality of datasets may include ADE20K, cityscapes, mapilary Vistas, COCO, COCOCOstuff, city landscape, and the like datasets.

Each data set comprises a plurality of sample images and an image segmentation labeling result of each sample image. The image segmentation labeling result comprises a position mask of the sample image and category information corresponding to the position mask.

Step S702, inputting a sample image and a name of a category to be selected of a data set where the sample image is located into an image segmentation model to be trained, extracting image features of the sample image through the image segmentation model, mapping the name of the category to be selected into a text embedding vector, and determining an image segmentation prediction result according to the image features and the text embedding vector, wherein the image segmentation prediction result comprises a prediction result of a position mask of the sample image and a prediction result of category information corresponding to the position mask.

When model training is carried out, the sample image and the name of the category to be selected of the data set where the sample image is located are input into an image segmentation model to be trained, and an image segmentation prediction result of the sample image is predicted through the image segmentation model.

In this step, image features of a sample image are extracted through an image segmentation model, a name of a category to be selected is mapped to a text embedding vector, and a specific process of an image segmentation prediction result is determined according to the image features and the text embedding vector.

And step S703, calculating loss according to the image segmentation prediction result and the image segmentation labeling result of the sample image, and training model parameters of the image segmentation model to obtain the trained image segmentation model.

In this embodiment, the loss is calculated according to the image segmentation prediction result and the image segmentation labeling result, which can be implemented by using the same loss function in a manner of calculating the loss when training the image segmentation model in the prior art, and is not described herein again.

The image segmentation model trained by the method of the embodiment is used for realizing the method flow provided by the image segmentation method embodiment, performing image segmentation on the input image, and determining the position mask of the input image and the category information corresponding to the position mask.

In the embodiment, the to-be-selected category name of the data set and the sample image are input into the image segmentation model together, the image segmentation model automatically maps the input category name into a text embedded vector in a unified category representation space, the image feature of the image is extracted, the image segmentation is performed according to the image feature and the text embedded vector to obtain an image segmentation prediction result, the model parameter of the image segmentation model is trained according to the image segmentation prediction result and the image segmentation labeling result, the unified category name and category representation space do not need to be manually established, the image segmentation model can be trained across a plurality of different data sets, the trained image segmentation model can be applicable to various image segmentation scenes/tasks using different data sets (under different category systems), the generalization capability and robustness of the image segmentation model are improved, and the accuracy of the image segmentation is improved.

In an alternative embodiment, the image segmentation model to be trained may adopt the image segmentation model provided by any one of the above method embodiments. Trainable parameters in the image segmentation model include: the weight parameters of the image encoder, pixel feature extractor, and classification-guided decoder, as well as the context information.

The present embodiment describes in detail a processing flow of the image segmentation model training method with reference to the structure of the image segmentation model. As shown in fig. 9, the detailed steps of the training of the image segmentation model are as follows:

step S901, obtaining a plurality of data sets and names of categories to be selected of the data sets, where the data sets include sample images and image segmentation labeling results of the sample images, and the image segmentation labeling results include position masks of the sample images and category labels corresponding to the position masks.

This step is similar to the implementation type of step S701, and is not described herein again.

In this embodiment, the number and types of training samples can be enriched by performing data enhancement on sample images in the data set.

Since the method of the present embodiment uses a plurality of different data sets in training the image segmentation model, the different data sets typically have different features, such as resolution, style, ratio, color, brightness, etc. In this embodiment, different data enhancement strategies are used to perform data enhancement on sample images in different data sets, and training of an image segmentation model is performed based on the data set after data enhancement.

Specifically, the data enhancement policy corresponding to each data set may be configured in advance. And when each sample image is enhanced, determining a data set to which the sample image belongs, and selecting a data enhancement strategy corresponding to the data set to be used for enhancing the sample image.

Any data enhancement strategy can comprise at least one enhancement mode as follows: random scale dithering, random horizontal flipping, random cropping, and random color dithering.

Referring to fig. 10, it is assumed that images a and D in fig. 10 belong to the same dataset, and images B and C belong to the same dataset, and the resolution, brightness, and the like of the sample images in the two different datasets are different. In fig. 10, enhancement strategies A1, A2, \8230, and enhancement strategies Ai respectively represent i different data enhancement strategies. Images A and D automatically adopt an enhancement strategy A1, and images B and C automatically adopt an enhancement strategy Ai for data enhancement. The enhanced image is input to an image encoder of an image segmentation model.

For example, in resizing sample images in different datasets, for the ADE20K dataset, a resizing size of 512 × 512 is used; for the cityscaps dataset, a crop size of 512 × 1024 is used; for the COCO-Stuff-10k dataset a crop size of 640X 640 was used, for the Malillary Vistas dataset a crop size of 1280X 1280 was used.

And step S902, inputting the sample image into an image encoder of the image segmentation model, and inputting the name of the category to be selected of the data set where the sample image is positioned into a text encoder of the image segmentation model.

And step S903, mapping the category name into a text embedded vector by using the context information through a text encoder.

In this embodiment, learnable (trainable) context information (also referred to as a prompt template, prompt text) is introduced, and the context information includes word vectors corresponding to a plurality of category names. In the training initial stage of the image segmentation model, word vectors contained in the context information are initialized randomly. And optimizing the context information in iterative training, and fixing the context information after the training is finished.

In the step, a text encoder is used for mapping the category names of the input data sets to a uniform category expression space by using the current context information, so that text embedding vectors of the category names are obtained. The unified category expression space is not a category expression space existing in an actual scene, but is formed in the model training process, and can cover the embedding of the category names of all data sets used in the training process.

For example, the weight parameters of the text encoder may be initialized using parameters of a Pre-trained CLIP (textual Language-Image Pre-training) model, and the weight parameters of the text encoder are fixed, and are not updated during the training of the Image segmentation model. The CLIP model is a pre-trained model based on contrasting text-image pairs. In this embodiment, the text embedding vector space of the CLIP model is used as a unified category representation space, and categories with similar semantics include text embedding vectors with closer relationships (i.e., category representation). In addition, the text encoder may also use a CLIP-like pre-training model, which is not specifically limited herein.

The specific implementation manner of the text encoder mapping the input category name to the text embedding vector by using the context information (the prompt template and the prompt text) is similar to the implementation manner of mapping the input text to the text embedding vector in the CLIP model, and is not described herein again.

In the step, the category names of the input data set are automatically mapped to a uniform category expression space through a text encoder, and a text embedded vector corresponding to each category name is obtained, so that the text embedded vectors of the category names contain semantic relations among different categories, and the category names with similar semantics have tighter relations. Therefore, regardless of which data set's category name is input, the file encoder can map the input data set's category name to a uniform category representation space, thereby enabling the use of multiple data sets with different category name spaces to train an image segmentation model, which can be adapted to image segmentation scenarios/tasks of different category systems and is easily extended to more image segmentation scenarios/tasks.

And step S904, encoding the sample image through an image encoder to obtain a first image characteristic of the sample image.

In this step, the sample image is down-sampled by an image encoder to extract image features of a lower dimension of the sample image.

As shown in fig. 4, the classification guide decoder includes a plurality of layers of classification guide decoding layers stacked, and the query vector for segmentation input by the first layer of classification guide decoding layer is the initial query vector for segmentation. The partitioning query vector output by the previous classification guidance decoding layer is used as the partitioning query vector input by the next classification guidance decoding layer.

The first image feature and the text embedded vector are used as the input of each layer of classification guiding decoding layer, the initial query vector for segmentation is input into a multi-head self-attention module of the first layer of classification guiding decoding layer in a classification guiding decoder, the steps S905-S906 are executed once by each layer of classification guiding decoding layer, and the steps S905-S906 are executed for multiple times in a circulating manner by the classification guiding decoder, so that the final aggregated query vector for segmentation is obtained.

Step S905, the classification guiding decoder performs a deduplication process on the initial plurality of query vectors for segmentation, and aggregates the image feature and text embedding vector with the query vectors for segmentation after the deduplication process, to obtain aggregated query vectors for segmentation.

In this step, a multi-head self-attention module in the decoder is guided by classification, and the query vector for segmentation input into the module is subjected to self-duplication removal to remove the duplicated query vector for segmentation.

Referring to fig. 4, the class guide decoder includes a Cross-Attention module per class guide decoding layer, and the Cross-Attention (Cross-Attention) module may include a Visual-query Cross-Attention (Visual-query Cross-Attention) module and a Text-query Cross-Attention (Text-query Cross-Attention) module. The vision-query cross attention module is used for aggregating the input image features into the input query vector for segmentation to obtain a new query vector for segmentation. The text-query cross attention module is used for aggregating the input text embedding vectors into the input query vectors for segmentation to obtain new query vectors for segmentation.

Optionally, as shown in fig. 4, in the step of classifying, guiding, before the visual-query cross-attention module and after the text-query cross-attention module in the decoding layer, the first image feature may be aggregated into the query vector for segmentation, and then the text embedding vector may be aggregated into the query vector for segmentation aggregated with the first image feature, so as to obtain an aggregated query vector for segmentation.

Optionally, the classification guidance decoding layer may further include a text-query cross attention module before the classification guidance decoding layer and a visual-query cross attention module after the classification guidance decoding layer, in which step, the text embedding vector is aggregated into the segmentation query vector, and then the first image feature is aggregated into the segmentation query vector aggregated with the text embedding vector, so as to obtain the aggregated segmentation query vector.

And step S906, aligning the aggregated query vector for division with the text embedded vector to obtain category prediction information corresponding to the aggregated query vector for division.

Specifically, by performing point multiplication on the query vector for segmentation and each text embedding vector, the classification prediction probability of the query vector for segmentation corresponding to the category information corresponding to each text embedding vector can be obtained, and the category information with the largest classification prediction probability is determined as the category information corresponding to the query vector for segmentation.

In step S907, the first image feature is downsampled by the pixel feature extractor to obtain a second image feature of the sample image.

Step S908 determines a position mask corresponding to the aggregated query vector for segmentation based on the aggregated query vector for segmentation and the second image feature.

In this step, feature extraction is performed on the aggregated query vector for segmentation by a multilayer perceptron (MLP) to further extract features for region segmentation (regression task), and region feature information included in the query vector for segmentation is enhanced to obtain a processed query vector for segmentation. Further, a position mask corresponding to each segmentation query vector after aggregation is determined according to the processed segmentation query vector and the second image feature output by the pixel feature extractor.

Specifically, the processed query vector for segmentation and the feature vector point of each pixel in the second image feature at the pixel level output by the pixel feature extractor may be multiplied, and the similarity probability between the pixel and each query vector for segmentation may be obtained. And setting mask values corresponding to pixels with similarity probabilities larger than or equal to the probability threshold as 1 and setting mask values corresponding to pixels with similarity probabilities smaller than the probability threshold as 0 according to the similarity probabilities and the set probability threshold, and obtaining the position mask corresponding to each query vector for segmentation.

Step S909 determines the position mask of the sample image and the category information corresponding to the position mask, based on the category prediction information and the position mask corresponding to the aggregated query vector for segmentation.

After the category prediction information and the location mask corresponding to each partitioning query vector are determined, for the location mask corresponding to each partitioning query vector, one partitioned area in the sample image may be determined, and the category prediction information corresponding to the partitioning query vector may be used as the category prediction information of the partitioned area determined by the location mask corresponding to the partitioning query vector, so as to obtain the location mask of the sample image and the category information corresponding to the location mask.

Step S910, loss is calculated according to the image segmentation prediction result and the image segmentation labeling result of the sample image.

In this step, a contrast Loss (contrast Loss) is calculated based on the position mask of the sample image in the image segmentation prediction result and the position mask of the same sample image in the image segmentation labeling result, and is expressed as

。

Calculating binary mask loss (expressed as binary focal length loss) according to the category information corresponding to the position mask of the sample image in the image segmentation prediction result and the category information corresponding to the position mask of the same sample image in the image segmentation labeling result

) And die loss (expressed as

）。

According to contrast loss

Binary focal length loss

And die loss

The final loss L is calculated as follows:

where M is the number of data sets.

Is the number of sample images in the kth data set.

And

for the over-parameter, 20.0 and 1.0 may be set, respectively.

And S911, updating context information and weight parameters of an image encoder, a pixel feature extractor and a classification guide decoder according to the loss to obtain a trained image segmentation model.

After the loss is calculated, the weight parameters of the image encoder, the pixel feature extractor and the classification-oriented decoder in the image segmentation model are updated according to the back propagation of the loss, and the context information. After training is completed, the weight parameters of the image encoder, pixel feature extractor, and classification-guided decoder are fixed, as well as context information.

By the image segmentation model trained by the method of the embodiment, the method flow provided by the image segmentation method embodiment can be realized, the input image is segmented, and the position mask of the input image and the category information corresponding to the position mask are determined.

Fig. 11 is a structural diagram of a remote sensing image segmentation apparatus according to an exemplary embodiment of the present application. The device provided by the embodiment is applied to executing an image segmentation method or a remote sensing image segmentation method. As shown in fig. 11, the image segmentation apparatus 110 includes: a data acquisition module 1101, a first image segmentation module 1102 and a result output module 1103.

The data obtaining module 1101 is configured to obtain an image to be segmented and a name of a category to be selected.

The first image segmentation module 1102 is configured to input the image and the category name into an image segmentation model, extract image features of the image through the image segmentation model, map the category name into a text embedding vector in a uniform category representation space, and determine a position mask of the image and category information corresponding to the position mask according to the image features and the text embedding vector.

The result output module 1103 is configured to output a position mask of the image and category information corresponding to the position mask, where the position mask indicates a divided region in the image, and the category information corresponding to the position mask indicates category information of the divided region in the image.

In an alternative embodiment, the image segmentation model includes an image encoder, a pixel feature extractor, and a text encoder. In implementing the extraction of image features of an image through an image segmentation model, and mapping a category name to a text embedding vector, the first image segmentation module 1102 is further configured to: encoding the image through an image encoder to obtain a first image characteristic of the image; transforming the first image characteristic through a pixel characteristic extractor to obtain a second image characteristic of a pixel level of the image; and mapping the category names into text embedding vectors in a unified category representation space by a text encoder by using the trained context information.

In an alternative embodiment, the image segmentation model further comprises a classification-guided decoder, which is embedded with an initial plurality of query vectors for segmentation. When determining the position mask and the category information corresponding to the position mask of the image according to the image features and the text embedding vector is implemented, the first image segmentation module 1102 is further configured to: carrying out de-weight processing on a plurality of initial query vectors for segmentation through a classified guide decoder, and aggregating image features and text embedded vectors with the query vectors for segmentation after de-weight processing to obtain aggregated query vectors for segmentation; aligning the aggregated query vector for segmentation and the text embedded vector to obtain category prediction information corresponding to the aggregated query vector for segmentation; determining a position mask corresponding to the aggregated query vector for segmentation according to the aggregated query vector for segmentation and the second image characteristics; and determining the position mask of the image and the category information corresponding to the position mask according to the category prediction information and the position mask corresponding to the aggregated query vector for segmentation.

The apparatus provided in this embodiment may be specifically configured to execute the image segmentation method or the remote sensing image segmentation method provided in any of the above embodiments, and specific functions and technical effects that can be achieved are not described herein again.

Fig. 12 is a block diagram of an image segmentation model training apparatus according to an exemplary embodiment of the present application. The device provided by the embodiment is applied to the method for executing the training of the image segmentation model. As shown in fig. 12, the image segmentation model training apparatus 120 includes: a data set processing module 1201, a second image segmentation module 1202 and a model parameter training module 1203.

The data set processing module 1201 is configured to obtain a plurality of data sets and names of categories to be selected of the data sets, where the data sets include a sample image and an image segmentation and annotation result of the sample image, and the image segmentation and annotation result includes a position mask of the sample image and category information corresponding to the position mask.

The second image segmentation module 1202 is configured to input the sample image and a name of a category to be selected of a data set where the sample image is located into an image segmentation model to be trained, extract image features of the sample image through the image segmentation model, map the name of the category to be selected into a text embedding vector in a uniform category representation space, and determine an image segmentation prediction result according to the image features and the text embedding vector, where the image segmentation prediction result includes a prediction result of a position mask of the sample image and a prediction result of category information corresponding to the position mask.

The model parameter training module 1203 is configured to calculate a loss according to the image segmentation prediction result and the image segmentation labeling result of the sample image, and train a model parameter of the image segmentation model to obtain a trained image segmentation model. The trained image segmentation model is used for carrying out image segmentation on an input image and determining a position mask of the input image and category information corresponding to the position mask.

In an optional embodiment, the image segmentation model includes an image encoder, a pixel feature extractor, and a text encoder, where the text encoder includes context information to be trained, and when the image feature of the sample image is extracted through the image segmentation model and the name of the category to be selected is mapped to a text embedding vector, the second image segmentation module 1202 is further configured to: coding the sample image through an image coder to obtain a first image characteristic of the sample image; and transforming the first image characteristic through a pixel characteristic extractor to obtain a second image characteristic of the sample image at the pixel level. The category names are mapped to text embedding vectors in a unified category representation space using context information by a text encoder.

In an alternative embodiment, the image segmentation model further comprises a classification-guided decoder, which is embedded with an initial plurality of query vectors for segmentation. In implementing the determining of the image segmentation prediction result according to the image feature and the text embedding vector, the second image segmentation module 1202 is further configured to: carrying out duplicate removal processing on a plurality of initial query vectors for segmentation through a classified guide decoder, and aggregating the first image feature and text embedded vector with the query vectors for segmentation after the duplicate removal processing to obtain aggregated query vectors for segmentation; aligning the aggregated query vector for segmentation and the text embedded vector to obtain category prediction information corresponding to the aggregated query vector for segmentation; determining a position mask corresponding to the aggregated query vector for segmentation according to the aggregated query vector for segmentation and the second image characteristics; and determining the position mask of the sample image and the category information corresponding to the position mask according to the category prediction information and the position mask corresponding to the aggregated query vector for segmentation.

In an alternative embodiment, when implementing the model parameters for training the image segmentation model, the model parameter training module 1203 is further configured to: according to the loss, context information is updated, and weight parameters of an image encoder, a pixel feature extractor and a classification guide decoder.

In an alternative embodiment, after obtaining the plurality of data sets is implemented, the data set processing module 1201 is further configured to:

and carrying out data enhancement on sample images in different data sets by using different data enhancement strategies, and carrying out training on an image segmentation model based on the data set subjected to data enhancement.

The apparatus provided in this embodiment may be specifically configured to execute the image segmentation model training method provided in any of the above embodiments, and specific functions and technical effects that can be achieved are not described herein again.

Fig. 13 is a schematic structural diagram of a cloud server according to an exemplary embodiment of the present application. The cloud server is used for operating the method provided by any method embodiment. As shown in fig. 13, the cloud server includes: memory 134, and processor 135.

A memory 134, communicatively coupled to the processor 135, for storing computer programs/computer execution instructions and may be configured to store various other data to support operations on the cloud server. The store 134 may be an Object Storage Service (OSS).

The memory 134 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The processor 135 is coupled to the memory 134, and is configured to execute the computer program/computer execution instruction stored in the memory 134, so as to implement the method provided by any of the above method embodiments, and specific functions and technical effects that can be achieved are not described herein again.

Further, as shown in fig. 13, the cloud server further includes: firewall 131, load balancer 132, communications component 136, power component 138, and other components. Only some of the components are schematically shown in fig. 13, and the cloud server is not meant to include only the components shown in fig. 13.

The communication component of fig. 13 described above is configured to facilitate communication between the device in which the communication component is located and other devices in a wired or wireless manner. The device where the communication component is located can access a wireless network based on a communication standard, such as a WiFi, a 2G, 3G, 4G/LTE, 5G and other mobile communication networks, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

The power supply assembly of fig. 13 provides power to the various components of the device in which the power supply assembly is located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.

The embodiment of the present application further provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when the computer-executable instructions are executed by a processor, the computer-executable instructions are used to implement the solutions provided in any of the above method embodiments, and specific functions and technical effects that can be achieved are not described herein again.

An embodiment of the present application further provides a computer program product, where the computer program product includes: the computer program is stored in a readable storage medium, at least one processor of the electronic device can read the computer program from the readable storage medium, and the at least one processor executes the computer program to enable the electronic device to execute the scheme provided by any one of the above method embodiments, and specific functions and achievable technical effects are not described herein again.

It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant region, and are provided with corresponding operation entries for the user to select authorization or denial.

In some of the flows described in the above embodiments and in the drawings, a number of operations are included which occur in a particular order, but it should be clearly understood that these operations may be performed out of order or in parallel as they occur herein, merely to distinguish between the various operations, and the sequence number itself does not represent any order of execution. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different. The meaning of "a plurality" is two or more unless specifically limited otherwise.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. An image segmentation method, comprising:

acquiring an image to be segmented and a category name to be selected;

inputting the image and the category name into an image segmentation model, extracting first image features of the image through the image segmentation model, converting the first image features into second image features at a pixel level, and mapping the category name into a text embedding vector in a uniform category representation space;

carrying out duplicate removal processing on a plurality of initial query vectors for segmentation, and aggregating the first image feature and the text embedding vector with the query vectors for segmentation after the duplicate removal processing to obtain aggregated query vectors for segmentation;

aligning the aggregated query vector for segmentation with the text embedded vector to obtain category prediction information corresponding to the aggregated query vector for segmentation; determining a position mask corresponding to the aggregated query vector for segmentation according to the aggregated query vector for segmentation and the second image characteristics;

determining a position mask of the image and category information corresponding to the position mask according to category prediction information and the position mask corresponding to the aggregated query vector for segmentation;

outputting a position mask of the image and category information corresponding to the position mask, wherein the position mask indicates a segmented region in the image, and the category information corresponding to the position mask indicates the category information of the segmented region in the image.

2. The method of claim 1, wherein the image segmentation model comprises an image encoder, a pixel feature extractor, and a text encoder,

the extracting, by the image segmentation model, a first image feature of the image, transforming the first image feature into a second image feature at a pixel level, and mapping the category name to a text-embedded vector includes:

encoding the image through the image encoder to obtain a first image characteristic of the image; transforming the first image characteristic through the pixel characteristic extractor to obtain a second image characteristic of the image at a pixel level;

mapping, by the text encoder, the category name to a text embedding vector in a unified category representation space using the trained context information.

3. The method according to claim 1 or 2, wherein the outputting the position mask of the image and the category information corresponding to the position mask comprises:

generating segmentation result information of the image according to the position mask of the image and the category information corresponding to the position mask, wherein the segmentation result information comprises the position information and the corresponding category information of a segmentation area in the image;

and outputting the segmentation result information of the image.

4. The method according to claim 3, wherein after outputting the segmentation result information of the image, the method further comprises:

and updating the position information and/or the category information of the segmentation areas in the segmentation result information of the image in response to the correction operation on the position information and/or the category information of the segmentation areas in the segmentation result information of the image.

5. The method according to claim 3, wherein the outputting segmentation result information of the image comprises:

the image is a remote sensing image, and the remote sensing image is displayed;

and marking the position of the segmentation region and the information of the ground object category corresponding to the segmentation region in the displayed remote sensing image according to the segmentation result information.

6. The method according to claim 3, wherein the outputting segmentation result information of the image comprises:

displaying the image, wherein the image is an image containing at least one target object;

and marking the position of a segmentation region where the target object is located and the category information of the target object in the displayed image according to the segmentation result information.

7. An image segmentation model training method is characterized by comprising the following steps:

acquiring a plurality of data sets and names of categories to be selected of the data sets, wherein the data sets comprise sample images and image segmentation and annotation results of the sample images, and the image segmentation and annotation results comprise position masks of the sample images and category information corresponding to the position masks;

inputting the sample image and the name of the category to be selected of the data set where the sample image is located into an image segmentation model to be trained, extracting a first image feature of the sample image through the image segmentation model, transforming the first image feature into a second image feature at a pixel level, and mapping the name of the category to be selected into a text embedding vector in a uniform category representation space;

determining an image segmentation prediction result according to the category prediction information and the position mask corresponding to the aggregated query vector for segmentation, wherein the image segmentation prediction result comprises a prediction result of the position mask of the sample image and a prediction result of the category information corresponding to the position mask;

calculating loss according to the image segmentation prediction result and the image segmentation labeling result of the sample image, and training model parameters of the image segmentation model to obtain a trained image segmentation model;

the trained image segmentation model is used for carrying out image segmentation on an input image and determining a position mask of the input image and category information corresponding to the position mask.

8. The method of claim 7, wherein the image segmentation model comprises an image encoder, a pixel feature extractor, and a text encoder, the text encoder comprising context information to be trained,

the extracting, by the image segmentation model, a first image feature of the sample image, transforming the first image feature into a second image feature at a pixel level, and mapping the name of the category to be selected as a text embedding vector includes:

encoding the sample image through the image encoder to obtain a first image characteristic of the sample image; transforming the first image features through the pixel feature extractor to obtain second image features of the sample image at the pixel level;

mapping, by the text encoder, the category name to a text embedding vector in a unified category representation space using the context information.

9. A remote sensing image segmentation method is characterized by comprising the following steps:

acquiring a remote sensing image to be segmented and a name of a ground object category to be selected;

inputting the remote sensing image and the name of the surface feature category into an image segmentation model, extracting a first image feature of the remote sensing image through the image segmentation model, converting the first image feature into a second image feature at a pixel level, and mapping the name of the surface feature category into a text embedding vector in a unified category representation space;

determining a position mask of the remote sensing image and ground object category information corresponding to the position mask according to category prediction information and the position mask corresponding to the aggregated query vector for segmentation;

and outputting a position mask of the remote sensing image and the ground object category information corresponding to the position mask, wherein the position mask indicates a segmentation region in the remote sensing image, and the ground object category information corresponding to the position mask indicates the ground object category information of the segmentation region in the remote sensing image.

10. The method according to claim 9, wherein the outputting of the location mask of the remote sensing image and the information of the ground category corresponding to the location mask comprises:

determining the position information of the plot in the remote sensing image according to the position mask of the remote sensing image;

determining the ground object category information corresponding to the parcel in the remote sensing image according to the ground object category information corresponding to the position mask;

and marking the position of the land parcel and the land feature category information corresponding to the land parcel in the remote sensing image according to the position information of the land parcel in the remote sensing image and the land feature category information corresponding to the land parcel.

11. A cloud server, comprising: a processor, and a memory communicatively coupled to the processor;

the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored by the memory to implement the method of any of claims 1-10.

12. A computer-readable storage medium having computer-executable instructions stored therein, which when executed by a processor, are configured to implement the method of any one of claims 1-10.