CN116342884A

CN116342884A - Image segmentation and model training method and server

Info

Publication number: CN116342884A
Application number: CN202310333563.1A
Authority: CN
Inventors: 纪德益; 陶明渊; 叶杰平
Original assignee: Alibaba Cloud Computing Ltd
Current assignee: Alibaba Cloud Computing Ltd
Priority date: 2023-03-28
Filing date: 2023-03-28
Publication date: 2023-06-27
Anticipated expiration: 2043-03-28
Also published as: CN116342884B

Abstract

The application provides a method and a server for image segmentation and model training. According to the method, for a first image to be processed, a first feature image containing space detail features is extracted through a lightweight semantic segmentation coding network, the first image is subjected to reversible downsampling to obtain a second image, the second image is input into a depth semantic segmentation coding network to perform feature extraction to obtain a second feature image, the second feature image is subjected to reversible downsampling and is subjected to inverse processing, and the third feature image with larger resolution is recovered, so that image information of image downsampling loss can be reduced; the first feature map and the third feature map are fused and then input into a segmentation prediction network for image segmentation to obtain an image segmentation result, and the fused feature containing space detail features and advanced semantic features is obtained by fusing the feature maps of the two branches, so that the accuracy of image segmentation can be improved by image segmentation based on the fused feature, and the speed and the efficiency of image segmentation are improved by only one forward reasoning.

Description

Image segmentation and model training method and server

Technical Field

The present disclosure relates to computer technology, and in particular, to a method and a server for image segmentation and model training.

Background

Image segmentation is a process of dividing a digital image into mutually disjoint regions, the main purpose of which is to extract a part of interest from an image, which is a key step in image analysis. The ultrahigh resolution image segmentation is an important branch in an image segmentation technology, and has wide application in medical imaging, automatic driving, remote sensing images and aerial images.

Since the resolution of the ultra-high resolution image is high, the image width and the height are usually thousands or even tens of thousands of pixels, the segmentation calculation of the ultra-high resolution image is too high in resource consumption. The current ultra-high resolution image segmentation method generally adopts a multi-stage processing method, wherein the ultra-high resolution image is segmented into a plurality of image blocks with lower resolution, the image blocks are respectively segmented, and then the image segmentation results are spliced to obtain a complete image segmentation result. However, this method of dividing requires image division for each image block, which consumes much computing resources and has low image division efficiency.

Disclosure of Invention

The application provides a method and a server for image segmentation and model training, which are used for solving the problems of more calculation resources and low efficiency in ultra-high resolution image segmentation.

In a first aspect, the present application provides an image segmentation method, including:

responding to an image segmentation request, and acquiring a first image to be processed;

extracting a first feature map of the first image through a lightweight semantic segmentation coding network, carrying out reversible downsampling on the first image to obtain a second image, inputting the second image into a depth semantic segmentation coding network to carry out feature extraction to obtain a second feature map, and carrying out inverse processing of the reversible downsampling on the second feature map to obtain a third feature map;

and inputting the fused first feature map and third feature map into a segmentation prediction network for prediction to obtain an image segmentation result, and outputting the image segmentation result.

In a second aspect, the present application provides a model training method for image segmentation, including:

the image segmentation model to be trained comprises: a lightweight semantic segmentation coding network, a depth semantic segmentation coding network and a segmentation prediction network,

extracting a first feature map of a sample image through the lightweight semantic segmentation coding network, carrying out reversible downsampling on the sample image to obtain a second image, inputting the second image into the depth semantic segmentation coding network to carry out feature extraction to obtain a second feature map, and carrying out the reversible downsampling on the second feature map to obtain a third feature map;

The first feature map and the third feature map are fused and then input into the segmentation prediction network for prediction, and a first image segmentation result is obtained;

and calculating a first loss according to the first image segmentation result and the image segmentation marking information of the sample image, updating parameters of the image segmentation model according to the first loss to obtain a trained image segmentation model, wherein the image segmentation model is used for encoding and predicting an input image to obtain an image segmentation result.

In a third aspect, the present application provides an image segmentation method, including:

receiving an image segmentation request sent by terminal equipment, wherein the image segmentation request comprises a first image to be processed, and the first image is a remote sensing image, an aerial image or a medical image;

decomposing the first image into a plurality of frequency domain components with different resolutions, merging the frequency domain components with different resolutions, and inputting the merged frequency domain components into a lightweight semantic segmentation coding network for feature extraction to obtain a first feature map;

the first image is subjected to reversible downsampling to obtain a second image, the second image is input into a depth semantic segmentation coding network to perform feature extraction to obtain a second feature map, and the second feature map is subjected to reversible downsampling to obtain a third feature map;

Inputting the fused first feature map and third feature map into a segmentation prediction network for prediction to obtain an image segmentation result;

and sending the image segmentation result to the end-side equipment.

In a fourth aspect, the present application provides a server comprising: a processor, and a memory communicatively coupled to the processor; the memory stores computer-executable instructions; the processor executes computer-executable instructions stored in the memory to implement the method of any of the above aspects.

According to the image segmentation and model training method and the server, for a first image to be processed, a first feature map of the first image is extracted through a lightweight semantic segmentation coding network, the first image is subjected to reversible downsampling to obtain a second image, the second image is input into a depth semantic segmentation coding network to perform feature extraction to obtain a second feature map, and the second feature map is subjected to reversible downsampling to obtain a third feature map; inputting the fused first feature image and third feature image into a segmentation prediction network for image segmentation to obtain an image segmentation result, and outputting the image segmentation result; by inputting an image of a complete size into a lightweight semantic segmentation encoding network, a first feature map containing spatial detail features can be extracted; the downsampled image is subjected to feature extraction through the depth semantic segmentation coding network, a second feature image containing high-dimensional semantic features can be extracted, the second feature image is restored to a third feature image with larger resolution through downsampling inverse processing, image information lost in the downsampling process of the image can be effectively reduced, the resolution of the third feature image is identical to or close to that of the first feature image, and the third feature image can be directly fused; by fusing the feature graphs of the two branches, fusion features containing space detail features and high-dimensional semantic features can be obtained, image segmentation is performed based on the fusion features, the accuracy of an image segmentation result can be improved, the image segmentation result can be obtained through forward reasoning, and the speed and efficiency of image segmentation are greatly improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a diagram of an exemplary image segmentation system architecture suitable for use herein;

FIG. 2 is a flowchart of an image segmentation method according to an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of an image segmentation model according to an exemplary embodiment of the present application;

FIG. 4 is a detailed architecture diagram of an example image segmentation model provided in an exemplary embodiment of the present application;

FIG. 5 is a detailed flow chart of an image segmentation method according to an exemplary embodiment of the present application;

FIG. 6 is a flow chart of a model training method for image segmentation according to an exemplary embodiment of the present application;

FIG. 7 is a diagram of a model training architecture for image segmentation provided in an exemplary embodiment of the present application;

fig. 8 is a flowchart of an image segmentation method according to another exemplary embodiment of the present application;

fig. 9 is a schematic structural diagram of an image segmentation apparatus according to an exemplary embodiment of the present application;

FIG. 10 is a schematic structural diagram of a model training device for image segmentation according to an exemplary embodiment of the present application;

Fig. 11 is a schematic structural diagram of a server according to an embodiment of the present application.

Specific embodiments thereof have been shown by way of example in the drawings and will herein be described in more detail. These drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but to illustrate the concepts of the present application to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

The terms referred to in this application are explained first:

ultra-high resolution image: the image with the resolution reaching a certain resolution threshold can be a remote sensing image, an aerial image, a medical image and the like. In different application scenarios, the definition of the ultra-high resolution image may be different, which generally refers to an image with a resolution of 5000×5000, and in some application scenarios, an image with a resolution of 3000×2000 is also referred to as an ultra-high resolution image. For a scene requiring multi-stage processing (blocking) during image segmentation, the method provided by the application can realize single-stage image segmentation under the condition of not blocking, and the speed and the efficiency of image segmentation are obviously improved on the premise of ensuring the image segmentation accuracy.

Lightweight semantic segmentation encoding network: refers to an encoder for semantic segmentation with a number of encoding layers less than or equal to a first preset number of layers. The inference speed of the lightweight semantic segmentation coding network is faster because the number of the contained coding layers is smaller, and the coding layers usually contain several layers or dozens of coding layers. For example, encoders of lightweight real-time semantic segmentation networks such as real-time semantic segmentation network (Short-Term Dense Concatenate network, STDC for Short), bilateral segmentation network (BiSeNet), image cascade network (Image Cascade Network, ICNet for Short) containing multiple resolution branches, and the like can be used as the lightweight semantic segmentation encoding network.

Depth semantic segmentation coding network: refers to an encoder for semantic segmentation with a number of coding layers greater than or equal to a second preset number of layers, comprising a larger number of coding layers, typically several tens of layers or even more. For example, a pyramid scene analysis network (Pyramid Scene Parsing Network, abbreviated as PSPNet), a deep lab-series speech segmentation network, and the like. Deep Lab is a series of deep learning-based semantic segmentation networks, including deep Lab v1, deep Lab v2, deep Lab v3, deep Lab v3+ and other multiple deep semantic segmentation networks.

The ultrahigh resolution image segmentation is an important branch in an image segmentation technology, and has wide application in medical imaging, automatic driving, remote sensing images and aerial images. Since the resolution of the ultra-high resolution image is high, the image width and the height are usually thousands or even tens of thousands of pixels, the segmentation calculation of the ultra-high resolution image is too resource-consuming.

At present, a multi-stage analysis method is generally adopted in the segmentation method of the ultra-high resolution image, the ultra-high resolution image is segmented into a plurality of image blocks with lower resolution, the image blocks are respectively segmented, and then the image segmentation results are spliced to obtain a complete image segmentation result. However, this method of dividing requires image division for each image block, which consumes much computing resources and has low image division efficiency.

The application provides an image segmentation method, wherein a coding network of an image segmentation model comprises two branches: one branch uses a lightweight semantic division encoding network with a small number of layers, and the other branch uses a depth semantic division encoding network with a large number of layers. When the image is segmented, a first feature image of the first image to be processed is extracted through a lightweight semantic segmentation coding network, the first image is subjected to reversible downsampling to obtain a second image, the second image is input into a depth semantic segmentation coding network to perform feature extraction to obtain a second feature image, and the second feature image is subjected to reversible downsampling and inverse processing to obtain a third feature image; inputting the fused first feature image and third feature image into a segmentation prediction network for image segmentation to obtain an image segmentation result, and outputting the image segmentation result; by inputting an image of a complete size into a lightweight semantic segmentation encoding network, a first feature map containing spatial detail features can be extracted; the downsampled image is subjected to feature extraction through the depth semantic segmentation coding network, a second feature image containing high-dimensional semantic features can be extracted, the second feature image is restored to a third feature image with larger resolution through downsampling inverse processing, image information lost in the downsampling process of the image can be effectively reduced, the resolution of the third feature image is identical to or close to that of the first feature image, and the third feature image can be directly fused; by fusing the feature graphs of the two branches, fusion features containing space detail features and high-dimensional semantic features can be obtained, image segmentation is performed based on the fusion features, the accuracy of an image segmentation result can be improved, the image segmentation result can be obtained through forward reasoning, and the speed and efficiency of image segmentation are greatly improved.

Fig. 1 is a diagram of an exemplary image segmentation system architecture applicable to the present application, and as shown in fig. 1, the system architecture may specifically include a server and an end-side device.

The server may be a local server or a server cluster set in the cloud. Communication links capable of being communicated are arranged between the server and each end side device, and communication connection between the server and each end side device can be achieved. The server stores an image segmentation model, the encoder portion of which includes two branches, one branch using a lightweight semantic segmentation encoding network with a smaller number of layers and the other branch using a depth semantic segmentation encoding network with a greater number of layers. The image segmentation model also includes a segmentation prediction network (i.e., decoder portion). The server may store a trained image segmentation model, which may be trained by the current server or by another server.

The terminal side device is a device used by a user, and specifically may be a hardware device with a network communication function, an operation function and an information display function, which includes, but is not limited to, a smart phone, a tablet computer, a desktop computer, an internet of things device, a platform or a server of a mechanism, and the like.

The terminal side device sends an image segmentation request to the server and provides a first image to be processed. The server performs image segmentation processing on the input first image based on the image segmentation model to obtain an image segmentation result.

The method comprises the steps that a server requests a neural network model with given target tasks and resource limiting conditions, the server responds to an image segmentation request, a first image to be processed is obtained, a first feature image of the first image is extracted through a lightweight semantic segmentation coding network, a second image is obtained through reversible downsampling of the first image, the second image is input into a deep semantic segmentation coding network to conduct feature extraction, a second feature image is obtained, and reversible downsampling of the second feature image is conducted to obtain a third feature image; and inputting the fused first feature map and the fused third feature map into a segmentation prediction network for prediction to obtain an image segmentation result. The server outputs the image segmentation result to the end-side device.

The image segmentation method provided by the application can be particularly applied to image segmentation of various image data such as remote sensing images, aerial images and medical images, and can be applied to various application fields such as remote sensing, electronic commerce, medical treatment and safety monitoring, and is not listed here.

For example, when applied to a ground object segmentation scene of a remote sensing image, the terminal device may be a server of the remote sensing system or a terminal device for providing the remote sensing image. When the remote sensing image is required to be subjected to ground feature segmentation, the terminal side equipment sends an image segmentation request to the server, wherein the request can carry the remote sensing image to be processed. The server responds to the image segmentation request to acquire a remote sensing image to be processed; extracting a first feature map of the remote sensing image through a lightweight semantic segmentation coding network, performing reversible downsampling on the remote sensing image to obtain a second image, inputting the second image into a depth semantic segmentation coding network to perform feature extraction to obtain a second feature map, and performing reversible downsampling on the second feature map to obtain a third feature map; and inputting the fused first feature map and the fused third feature map into a segmentation prediction network for prediction to obtain a ground feature segmentation result of the remote sensing image, wherein the ground feature segmentation result can segment the areas where various ground features are located in the remote sensing image. And outputting the ground object segmentation result of the remote sensing image to the terminal side equipment by the server. The terminal side equipment outputs the ground feature segmentation result of the remote sensing image so as to be checked and used by a user. For example, the feature segmentation result of the remote sensing image is an image mask with the same resolution as the remote sensing image, and the pixel value in the image mask represents the feature class to which the corresponding pixel in the remote sensing image belongs. When the ground feature segmentation result of the remote sensing image is output, different areas covered by the ground feature categories can be marked on the remote sensing image according to the image mask.

For example, when applied to an image search scenario in the field of electronic commerce, the end-side device may be a server of the image search system. Upon receiving an image search instruction from a user, a target article image input by the user is acquired, and the image may be an image with ultra-high resolution containing the target article. The terminal side device takes the target object image as an image to be segmented, and sends an image segmentation request containing the target object image to the server. The server responds to the image segmentation request to acquire the image of the target object; extracting a first feature map of a target object image through a lightweight semantic segmentation coding network, carrying out reversible downsampling on the target object image to obtain a second image, inputting the second image into a depth semantic segmentation coding network to carry out feature extraction to obtain a second feature map, and carrying out reversible downsampling on the second feature map to obtain a third feature map; and inputting the fused first feature image and the fused third feature image into a segmentation prediction network for prediction to obtain an image segmentation result of the target object image, wherein the image segmentation result can segment the region where the target object is located in the target object image. The server outputs the image segmentation result of the target object image to the end-side device. And the terminal side equipment identifies and searches the target object according to the area where the target object is located, obtains the related information of the target object, and outputs the related information of the target object.

For example, when applied to an image segmentation scenario in the medical field, the end-side device may be a server of the medical system. When a medical image segmentation instruction of a user is received, a medical image input by the user is acquired, wherein the medical image can be an image with ultra-high resolution. And the terminal side equipment takes the medical image as an image to be segmented, and sends an image segmentation request containing the medical image to the server. The server responds to the image segmentation request to acquire the medical image; extracting a first feature map of the medical image through a lightweight semantic segmentation coding network, carrying out reversible downsampling on the medical image to obtain a second image, inputting the second image into a depth semantic segmentation coding network for feature extraction to obtain a second feature map, and carrying out reversible downsampling on the second feature map to obtain a third feature map; and inputting the fused first feature map and the fused third feature map into a segmentation prediction network for prediction to obtain an image segmentation result of the medical image, wherein the image segmentation result can segment areas where different tissue structures in the medical image are located. And the server outputs the image segmentation result of the medical image to the end-side equipment. The terminal equipment realizes processing functions such as medical teaching and focus identification according to the areas where different tissue structures are located in the medical image.

The following describes the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 2 is a flowchart of an image segmentation method according to an exemplary embodiment of the present application. The execution subject of the embodiment is a server in the architecture of the image segmentation system, and the method provided by the embodiment is used for realizing accurate and efficient image segmentation of the ultra-high resolution image.

Exemplary, fig. 3 is a schematic architecture diagram of an image segmentation model provided in an embodiment of the present application, and as shown in fig. 3, an encoding portion of the image segmentation model includes two branches: the first branch uses a lightweight semantic segmentation coding network to extract a first feature map containing rich detail features based on a full-size image; and the second branch uses a depth semantic segmentation coding network to perform reversible downsampling on the first image to obtain a second image with smaller resolution, the second image is input into the depth semantic segmentation coding network, a second feature map containing high-dimensional (advanced) semantic features is extracted, and then the second feature map is subjected to reversible downsampling inverse processing to recover a third feature map with larger resolution. The image segmentation model also includes a segmentation prediction network that is a decoding portion. And after the feature maps extracted by the two branches of the coding part are fused, the feature maps are input into a segmentation prediction network of the decoding part for prediction (namely decoding) to obtain an image segmentation result of the first image.

Illustratively, the lightweight semantic segmentation encoding network may be implemented using an encoder of any lightweight semantic segmentation network having a number of encoding layers less than or equal to a first preset number of layers. For example, encoders for lightweight semantic segmentation networks such as real-time semantic segmentation networks (STDC), bilateral segmentation networks (BiSeNet), image cascade networks (ICNet) containing multiple resolution branches, etc. can be used.

Illustratively, the depth semantic segmentation encoding network may be implemented using any encoder that refers to a depth semantic segmentation network having a greater number of encoding layers than or equal to the second preset number of layers. For example, an encoder of any one of the following depth semantic segmentation networks may be used: pyramid scene analysis network (PSPNet), arbitrary speech segmentation network of deep lab series. Deep Lab comprises a series of semantic segmentation networks based on deep learning, and specifically comprises a plurality of deep semantic segmentation networks such as deep Lab v1, deep Lab v2, deep Lab v3, deep Lab v3+ and the like.

The second preset layer number is greater than the first preset layer number, and the first preset layer number and the second preset layer number can be set and adjusted according to actual application scenes and experience values, which are not particularly limited herein.

Based on the image segmentation model architecture shown in fig. 3, as shown in fig. 2, the image segmentation method specifically comprises the following steps:

step S201, in response to the image segmentation request, a first image to be processed is acquired.

In this embodiment, the first image to be processed may be an ultra-high resolution image or a high resolution image. In different application scenarios, there may be images of various different sources. For example, the image may be a remote sensing image, an aerial image, a medical image, or the like captured by a satellite.

The image segmentation request may be a request sent by the end-side device, where the request includes a first image to be processed, or includes storage address information of the first image to be processed. The server may extract the first image to be processed from the request, or extract storage address information of the first image to be processed from the request, and acquire the first image to be processed according to the storage address information of the first image.

In addition, the image segmentation request may be triggered by a user through an interactive interface provided by the end-side device, or automatically triggered by an application running on the end-side device when image segmentation is required, or triggered by other means, which is not specifically limited herein.

Step S202, extracting a first feature map of a first image through a lightweight semantic segmentation coding network.

In the step, the first branch of the image segmentation model coding part is used for extracting the space detail characteristic of the first image with ultrahigh resolution or high resolution by using a lightweight semantic segmentation coding network to obtain a first characteristic diagram, so that the reasoning speed and the reasoning efficiency can be improved.

And step S203, performing reversible downsampling on the first image to obtain a second image, inputting the second image into a depth semantic segmentation coding network to perform feature extraction to obtain a second feature map, and performing reversible downsampling on the second feature map to obtain a third feature map.

In the step, a second branch of the image segmentation model coding part is used for carrying out reversible downsampling on a first image to obtain a second image with smaller resolution; and extracting high-level semantic features of the second image with smaller resolution by using the depth semantic segmentation coding network to obtain a second feature map.

In this embodiment, reversible downsampling is performed in the second branch of the coding part of the image segmentation model, and the second feature map with smaller resolution extracted by the depth semantic segmentation coding network is subjected to reversible downsampling, so that the second feature map is restored to the third feature map with larger resolution, image information lost in the image downsampling process can be reduced, and the expression capability of the third feature map can be improved, so that the image segmentation accuracy is improved while the reasoning speed is improved.

The above-described step S202 and step S203 are implemented using two parallel branch networks of the image segmentation model, which are performed in parallel.

And S204, inputting the fused first feature map and the fused third feature map into a segmentation prediction network for prediction to obtain an image segmentation result, and outputting the image segmentation result.

In the step, when the first feature map and the third feature map are fused, the first feature map and the third feature map can be fused through up-sampling with uniform resolution, and then simple parameters such as splicing or summation and the like are fused, so that the fused feature map of the first feature map and the third feature map can be obtained. And inputting the fusion feature map into a segmentation prediction network for prediction, and obtaining an image segmentation result of the first image.

For example, the image segmentation result of the first image may be an image mask having the same resolution as the first image, the pixel values in the image mask representing the class to which the corresponding pixels in the first image belong. Different classes correspond to different segmented regions in the first image.

For example, the feature segmentation result of the remote sensing image is an image mask with the same resolution as the remote sensing image, and the pixel value in the image mask represents the feature class to which the corresponding pixel in the remote sensing image belongs. When the ground feature segmentation result of the remote sensing image is output, different areas covered by the ground feature categories can be marked on the remote sensing image according to the image mask.

In this embodiment, two branches of the image segmentation model coding part are used, one branch uses a lightweight network to extract the spatial detail features of the image with larger resolution, the other branch uses a depth network to extract the advanced semantic features of the image with smaller resolution after downsampling, and the first feature map and the third feature map extracted by the two branches are fused to obtain a fused feature map, wherein the fused feature map simultaneously contains the spatial detail features and the advanced semantic features and has better expression capability. Further, the image segmentation result is obtained by predicting through the segmentation prediction network according to the fusion feature map, and the image segmentation result can be obtained by performing one-time reasoning in a single-stage mode, so that the reasoning speed and efficiency of the image segmentation model can be remarkably improved, and meanwhile, the accuracy of image segmentation is improved.

In an alternative embodiment, since the lightweight semantic segmentation encoding network used in the first branch of the image segmentation model encoding section contains fewer layers, there is no need to downsample or clip the input ultra-high resolution image to obtain spatial detail features of the full-size image, while maintaining a high inference speed. In this embodiment, in the first branch, the original input first image is replaced by a high-frequency residual, and a lightweight speech segmentation encoding network is input to enhance the spatial detail features of the input image, so that the extracted first feature map contains richer spatial detail features.

In the step S202, the first feature map of the first image is extracted through the lightweight semantic segmentation encoding network, which may be specifically implemented as follows:

carrying out Laplacian pyramid decomposition on the first image to obtain a plurality of frequency domain components (namely high-frequency residual errors) with different resolutions; and after the frequency domain components with different resolutions are fused, inputting the frequency domain components into a lightweight semantic segmentation coding network for feature extraction, and obtaining a first feature map.

Specifically, when the first image is subjected to Laplacian pyramid decomposition to obtain a plurality of frequency domain components (namely high-frequency residuals) with different resolutions, firstly generating a Gaussian blur pyramid of the first image I to obtain

Where n is the number of frequency domain components of different resolutions and n is a positive number greater than 1. And then, calculating high-frequency residuals of Gaussian blur images of adjacent layers in the Gaussian pyramid, and obtaining a plurality of high-frequency residuals with different sizes to form the Laplacian pyramid. Illustratively, the high frequency residual may be calculated according to the following equation (1), resulting in a plurality of frequency domain components +_of different resolutions>

H _i ＝g _i (I)-U(g _i+1 (I) Formula (1)

Wherein H is _i Representing the i-th frequency domain component. I represents a first image. g _i (I) An I-th layer gaussian blur image representing the first image I. g _i+1 (I) The (i+1) -th layer Gaussian blur image representing the first image I is the image of g _i (I) And (5) carrying out Gaussian blur and downsampling. U () represents an upsampling operation.

Illustratively, when the Gaussian pyramid is obtained, the original high-resolution image is subjected to Gaussian blur and downsampled into an original 1/2-size image, a Gaussian blur image is obtained, multi-layer Gaussian blur and downsampling are iteratively performed, and the Gaussian blur images with different sizes are obtained to form the Gaussian pyramid.

Optionally, after the first image is decomposed by using a laplacian pyramid to obtain a plurality of frequency domain components with different resolutions (i.e., high-frequency residuals), the plurality of frequency domain components with different resolutions may be spliced or summed to obtain a fusion result; inputting the fusion result into a lightweight semantic segmentation coding network for feature extraction to obtain a plurality of first feature graphs with different resolutions; the plurality of different resolution first feature maps may include: the final layer of the lightweight semantic segmentation coding network outputs a feature map, and at least one middle layer of the feature map.

Further, when the first feature map is fused with the third feature map in step S204, for the first feature maps with different resolutions, the first feature maps with smaller resolutions are up-sampled to unify the resolutions of the first feature maps and splice or sum up to fuse, so as to obtain a fused feature map of the first feature map, and then the fused feature map of the first feature map and the third feature map are up-sampled to unify the resolutions and splice or sum up to fuse, so as to obtain a fused result of the first feature map and the third feature map.

Optionally, when the first feature map is fused with the third feature map in step S204, the first feature maps with different resolutions and the third feature map may be fused by upsampling to a uniform resolution, and splicing or summing to obtain a fusion result of the first feature map and the third feature map.

In step S204, the feature graphs extracted by the two branches can be fused through simple up-sampling, splicing or summing, so that the fusion process is simple and the efficiency is high.

In this embodiment, in the first branch of the coding portion of the image segmentation model, the original input first image is replaced by the high-frequency residual, and the lightweight speech segmentation coding network is input, so that the spatial detail features of the input image can be enhanced, and the extracted first feature map contains richer spatial detail features, so that the image segmentation efficiency can be improved, and meanwhile, the image segmentation accuracy can be improved.

In an alternative embodiment, in step S203, the reversible downsampling of the first image to obtain the second image may be implemented in the following manner:

performing at least one-stage reversible waveform transformation on the first image to obtain a plurality of sub-band images; and fusing the plurality of sub-band images to obtain a second image.

Optionally, performing one-time reversible waveform transformation on the first image to obtain a plurality of first-level subband images; and fusing the plurality of primary subband images to obtain a second image.

Optionally, performing two-stage reversible waveform transformation on the first image to obtain a plurality of two-stage subband images; and fusing the plurality of secondary sub-band images to obtain a second image.

Optionally, performing three-level reversible waveform transformation on the first image to obtain a plurality of three-level subband images; and fusing the plurality of three-level sub-band images to obtain a second image.

Further, in the process of performing multi-level reversible waveform transformation on the first image, a first convolution operation can be added between any two adjacent reversible waveform transformations, the characteristic extraction is performed on the sub-band image obtained by the previous-level reversible waveform transformation through the first convolution operation, the sub-band image after the first convolution operation can be understood as a characteristic image containing characteristic information, and the next reversible waveform transformation is performed on the sub-band image after the first convolution operation.

Illustratively, taking two-stage reversible waveform transformation as an example, performing reversible waveform transformation on the first image to obtain a plurality of one-stage subband images; respectively carrying out first convolution operation on the first-level sub-band images, and then carrying out reversible waveform transformation again to obtain a plurality of second-level sub-band images; and fusing the plurality of secondary sub-band images to obtain a second image.

Illustratively, the first convolution operation may include at least one fully-connected layer.

In addition, when the plurality of sub-band images are fused to obtain the second image, a third convolution operation can be performed on the plurality of sub-band images according to the number of channels of the depth semantic segmentation coding network to fuse the plurality of sub-band images, so that the number of channels of the second image obtained after fusion is consistent with the number of channels of the depth semantic segmentation coding network. The third convolution operation is used for reducing the channel number of the plurality of sub-band images, and can be implemented by using a convolution layer with the output channel number consistent with the channel number input by the depth semantic segmentation coding network.

Further, in the step S203, the inverse process of the reversible downsampling is performed on the second feature map to obtain a third feature map, which may be specifically implemented as follows:

performing a second convolution operation on the second feature map, and splitting the second feature map into a plurality of sub-band feature maps; and carrying out inverse transformation of at least one level of reversible waveform transformation on the plurality of sub-band feature diagrams split by the second feature diagram to obtain a third feature diagram.

Wherein the second convolution operation is configured to increase the number of channels of the second feature map to split the second feature map into a plurality of subband feature maps. The second convolution operation may be implemented using a convolution layer having a number of output channels equal to the sum of the number of channels of the split plurality of sub-band feature maps.

In the reverse processing process of the reversible downsampling, the number of the reverse transformation of the reversible waveform transformation is the same as the number of the reverse transformation of the reversible waveform transformation in the reversible downsampling, the number of the plurality of sub-band feature images and the number of channels of the sub-band feature images which are required to be split into the second feature images are determined according to the number of the reverse transformation of the reversible waveform transformation, and the number of the channels of the second feature images is adjusted by using a second convolution operation, so that the adjusted number of the channels is equal to the sum of the number of the channels of the plurality of sub-band feature images which are required to be split into the plurality of sub-band feature images, thereby splitting the second feature images into the plurality of sub-band feature images.

After the second characteristic diagram is split into a plurality of sub-band characteristic diagrams, the sub-band characteristic diagrams split by the plurality of second characteristic diagrams are subjected to at least one-stage inverse transformation of reversible waveform transformation, and then a third characteristic diagram with larger resolution can be obtained.

In the case where the first image is subjected to two-stage reversible waveform conversion in the reversible downsampling process, the plurality of subband feature maps split into the second feature map are subjected to inverse transformation of the two-stage reversible waveform conversion at the time of the inverse processing, thereby obtaining a third feature map.

Illustratively, in the case where the first image is subjected to one-stage reversible waveform transformation in the reversible downsampling process, the plurality of subband feature maps split into the second feature map are subjected to inverse transformation of the one-stage reversible waveform transformation in the inverse processing, and the third feature map is obtained.

Illustratively, in the case that three-level reversible waveform transformation is performed on the first image in the reversible downsampling process, when the inverse processing is performed, inverse transformation of the three-level reversible waveform transformation is performed on the plurality of subband feature maps split by the second feature map, so as to obtain a third feature map.

In an alternative embodiment, the reversible waveform transform may be a discrete wavelet transform (Discrete Wavelet Transform, DWT for short). Obtaining a plurality of sub-band images by performing at least one level of discrete wavelet transform on the first image; and fusing the plurality of sub-band images to obtain a second image. Further, when the inverse processing of the reversible waveform transformation is performed, at least one level of inverse discrete wavelet transform (Invert Discrete Wavelet Transform, abbreviated as IWT or IDWT) is performed on the plurality of subband feature maps split from the second feature map, to obtain a third feature map.

In another alternative embodiment, the reversible waveform transform may be a profile wave transform. Obtaining a plurality of sub-band images by performing at least one level of contourlet transformation on the first image; and fusing the plurality of sub-band images to obtain a second image. Further, when the inverse processing of the reversible waveform transformation is performed, at least one level of inverse contour wave transformation is performed on the plurality of sub-band feature maps split from the second feature map, so as to obtain a third feature map.

Illustratively, taking the process of reversible downsampling as an example, the specific implementation process of the step S203 is as follows: performing two-stage discrete wavelet transformation on the first image to obtain 8 secondary subband images; performing convolution operation on the 8 secondary sub-band images, and fusing the 8 secondary sub-band images into a second image; inputting the second image into a depth semantic segmentation coding network for feature extraction to obtain a second feature map; splitting the second characteristic diagram into 8 sub-band characteristic diagrams, and carrying out two-stage discrete wavelet inverse transformation on the 8 sub-band characteristic diagrams to obtain 1 third characteristic diagram. The resolution of the third feature map is greater than the resolution of the second feature map, more closely approximating the resolution of the original first image.

In the embodiment, the resolution of the input image of the depth semantic segmentation coding network is reduced by carrying out reversible downsampling on the first image, so that the reasoning speed and efficiency of the depth semantic segmentation coding network are effectively improved; and the second feature image extracted by the depth semantic segmentation coding network is subjected to reversible downsampling inverse processing, the second feature image is restored to a third feature image with larger resolution, partial image information lost due to downsampling can be restored, and the quality of the third feature image is improved, so that the accuracy and the efficiency of image segmentation can be improved.

For example, fig. 4 is a detailed architecture diagram of an example of an image segmentation model provided in an embodiment of the present application, as shown in fig. 4, for a first image to be processed, in a first branch of an encoding portion of the image segmentation model, the first image is first subjected to laplacian pyramid decomposition to obtain a plurality of frequency domain components with different resolutions, and in fig. 4, the frequency domain components H with two different resolutions ₀ 、H ₁ For the purposes of example, illustrative description will be given of H ₀ Is consistent with the resolution of the first image, H ₁ The resolution of (a) is the first1/4 of the resolution of an image. Will have a smaller resolution of H ₁ Upsampling and then combining with H ₀ After splicing (the resolution consistent with the first image after splicing), inputting the lightweight semantic segmentation coding network for feature extraction, and respectively taking feature images output by the last layer and the middle layer of the lightweight semantic segmentation coding network to obtain two first feature images with the resolution of 1/16 and 1/8 of the resolution of the first image.

In a second branch of the coding part of the image segmentation model, performing Discrete Wavelet Transform (DWT) on the first image to obtain 4 primary subband images, wherein the resolution of the primary subband images is 1/2 of the resolution of the first image; and performing a first convolution operation on each primary subband image, wherein the first convolution operation is used for performing feature extraction so as to strengthen feature information in the subband images. Then, performing Discrete Wavelet Transform (DWT) on the first-stage subband image after the first convolution operation again to obtain 16 secondary subband images, wherein the resolution of the secondary subband images is 1/4 of the resolution of the first image; and after fusing the 16 secondary sub-band images through a third convolution operation, inputting the images into a depth semantic segmentation coding network for feature extraction to obtain a second feature map, wherein the resolution of the second feature map is 1/32 of the resolution of the first image. Further, the second feature map is split into 16 subband feature maps by a second convolution operation, 4 of the subband feature maps are subjected to inverse discrete wavelet transform (IWT) to obtain 4 subband feature maps (resolution 1/16 of the first image resolution), and the 4 subband feature maps are subjected to inverse discrete wavelet transform (IWT) to obtain a third feature map with resolution 1/8 of the first image resolution.

Further, up-sampling a first feature image with the resolution of 1/16 of the first image resolution obtained by the first branch into 1/8 of the first image resolution, then fusing the first feature image with the resolution of 1/8 of the first image resolution in the first branch and a third feature image with the resolution of 1/8 of the first image resolution in the second branch, and inputting the fused feature image into a segmentation prediction network for prediction, thus obtaining an image segmentation result.

Illustratively, the first convolution operation may comprise a Convolutional Neural Network (CNN) implementation of at least one fully-connected layer. The second convolution operation for increasing the number of channels of the second feature map may be implemented using a convolution layer having a number of output channels equal to the sum of the number of channels of the split sub-band feature map. The third convolution operation is used to reduce the number of channels of the 16 secondary subband images, and may be implemented using a convolution layer (CNN) whose number of output channels corresponds to the number of channels input by the depth semantic segmentation encoding network.

Fig. 5 is a detailed flowchart of an image segmentation method according to an exemplary embodiment of the present application, and based on the model architecture shown in fig. 4, the detailed flowchart of the image segmentation method is provided, and specific steps are as follows:

Step S501, in response to an image segmentation request, a first image to be processed is acquired.

This step is similar to step S201 described above, and will not be described again here.

Step S502, carrying out Laplacian pyramid decomposition on the first image to obtain a plurality of frequency domain components with different resolutions.

And step S503, after merging the frequency domain components with different resolutions, inputting the frequency domain components into a lightweight semantic segmentation coding network for feature extraction to obtain a plurality of first feature maps with different resolutions.

In this embodiment, the steps S502 to S503 are the processing flow of the first branch of the coding portion of the image segmentation model, and detailed implementation is described in the foregoing embodiments, which are not repeated here.

Step S504, carrying out reversible waveform transformation on the first image to obtain a plurality of first-level subband images.

Step S505, after the first convolution operation is performed on the first-level sub-band images, reversible waveform transformation is performed again, so as to obtain a plurality of second-level sub-band images.

And step S506, fusing the plurality of secondary sub-band images to obtain a second image.

And step S507, inputting the second image into a depth semantic segmentation coding network for feature extraction to obtain a second feature map.

Step S508, performing a second convolution operation on the second feature map, and splitting the second feature map into a plurality of sub-band feature maps.

Step S509, performing inverse transformation of two-stage reversible waveform transformation on the plurality of sub-band feature maps split from the second feature map, to obtain a third feature map.

In this embodiment, the steps S504-S509 are the processing flow of the second branch of the image segmentation model coding part, and detailed implementation is described in the foregoing embodiment, which is not repeated here.

Step S510, fusing the first feature map and the third feature map to obtain a fused feature map; and inputting the fusion feature map into a segmentation prediction network for prediction to obtain an image segmentation result.

This step is similar to step S204 described above, and will not be described again here.

Step S511, outputting the image segmentation result.

The detailed flowchart of the image segmentation method is provided in this embodiment, and the specific implementation manner and the technical effects that can be achieved refer to the corresponding content in the foregoing embodiment, which is not described herein again.

Fig. 6 is a flowchart of a model training method for image segmentation according to an exemplary embodiment of the present application, where the present embodiment provides a training method for an image segmentation model used in any one of the foregoing image segmentation method embodiments. The image segmentation model comprises: a lightweight semantic segmentation coding network, a depth semantic segmentation coding network and a segmentation prediction network. As shown in fig. 6, the image segmentation model may be trained by:

Step S601, a training set is obtained, wherein the training set comprises a plurality of sample images and image segmentation labeling information of the sample images.

In this embodiment, a corresponding training set is obtained for an image segmentation task of a specific application of an image segmentation model to be trained. The training set comprises a plurality of sample images corresponding to the current image segmentation task and image segmentation annotation information of each sample image. The image segmentation labeling information of the sample image is a reference image segmentation result of the sample image, and specifically may be a mask image, wherein a pixel value in the image mask represents real category information to which a corresponding pixel in the sample image belongs.

And step S602, extracting a first feature map of the sample image through a lightweight semantic segmentation coding network.

In the step, the spatial detail characteristics of the sample image with ultra-high resolution or high resolution are extracted by using a lightweight semantic segmentation coding network through the first branch of the image segmentation model coding part, so that a first characteristic diagram is obtained, and the reasoning speed and the reasoning efficiency can be improved.

And step S603, performing reversible downsampling on the sample image to obtain a second image, inputting the second image into a depth semantic segmentation coding network to perform feature extraction to obtain a second feature map, and performing reversible downsampling on the second feature map to obtain a third feature map.

In the step, a second branch of the image segmentation model coding part is used for carrying out reversible downsampling on a sample image to obtain a second image with smaller resolution; and extracting high-level semantic features of the second image with smaller resolution by using the depth semantic segmentation coding network to obtain a second feature map.

In this embodiment, reversible downsampling is performed in the second branch of the coding part of the image segmentation model, and the second feature map with smaller resolution extracted by the depth semantic segmentation coding network is subjected to reversible downsampling, so that the second feature map is restored to the third feature map with larger resolution, image information lost in the image downsampling process can be reduced, the expression capability of the image segmentation model can be improved, and therefore the accuracy of image segmentation is improved while the reasoning speed is improved.

The above-described step S602 and step S603 are implemented using two parallel branch networks of the image segmentation model, which are performed in parallel.

And step S604, inputting the fused first feature map and the fused third feature map into a segmentation prediction network for prediction, and obtaining a first image segmentation result.

In the step, when the first feature map and the third feature map are fused, the first feature map and the third feature map can be fused through up-sampling with uniform resolution, and then simple parameters such as splicing or summation and the like are fused, so that the fused feature map of the first feature map and the third feature map can be obtained. And inputting the fusion feature map into a segmentation prediction network for prediction to obtain a prediction result, namely obtaining a first image segmentation result of the sample image.

Step S605, calculating a first loss according to the first image segmentation result and the image segmentation labeling information of the sample image, updating parameters of the image segmentation model according to the first loss to obtain a trained image segmentation model, and using the image segmentation model to encode and predict an input image to obtain an image segmentation result.

In this embodiment, the root calculates the cross entropy loss according to the first image segmentation result and the image segmentation labeling information of the sample image, and obtains the first loss. The first penalty may be understood as a task penalty for the image segmentation task. And updating parameters of the image segmentation model according to the first loss, and obtaining a trained image segmentation model after multiple rounds of iterative training until convergence conditions are met.

The convergence condition may be that the number of iterations exceeds a preset iteration number threshold, or the calculated loss is smaller than a preset smaller loss threshold, or the variation of the model parameter between two iterations is smaller than a variation threshold. The convergence condition may be set and adjusted according to the application scenario and experience information, for example, the maximum iteration number may be set to 40K, 80K or 160K, and the convergence condition is not particularly limited herein.

In the image segmentation model training process, the processing process of the image segmentation model on the sample image is consistent with the processing process of the image segmentation model on the first image in the image segmentation method embodiment, and the specific implementation manner and the technical effects that can be achieved are referred to the image segmentation method embodiment, and are not repeated here.

In an alternative embodiment, during the training of the image segmentation model, a super-resolution reconstruction module is added, and the super-resolution reconstruction module is configured to upsample the third feature map extracted by the second branch to reconstruct the original input in the frequency domain, so as to obtain a third image with the same resolution as the sample image of the original input. And calculating a wavelet smoothing loss function according to the third image and the sample image to obtain a second loss. This second penalty may constrain and optimize the process of reversible downsampling and its inverse processing to reduce the downsampled lost image information.

Specifically, the third image obtained by reconstruction may be subjected to the same reversible downsampling as the sample image, so as to obtain at least one level of subband image corresponding to the third image. And calculating a wavelet smoothing loss function according to each level of sub-band image of the third image and each level of sub-band image of the sample image to obtain a second loss.

Illustratively, using a discrete wavelet transform for reversible downsampling as an example, one discrete wavelet transform transforms an input image into 4 subband images, including one low frequency subband and three high frequency subbands. The number of discrete wavelet transform stages is represented by L for a reversible downsampling process in the second branch of the encoded portion of the image segmentation network. L-level discrete wavelet transformation is carried out on the sample image, and 1-L-level sub-band images of the sample image can be obtained. L-level discrete wavelet transformation is carried out on the third image, and 1-L-level sub-band images of the third image can be obtained. The value of L may be 1, 2 or 3, and may be specifically set according to the application scenario and the empirical value, which is not specifically limited herein.

And calculating the L2 loss of the low-frequency sub-band in the sub-band image of the level of the third image and the sample image according to the 1-L sub-band images of the third image and the 1-L sub-band images of the sample image, calculating the L1 loss of the high-frequency sub-band in the sub-band image of the level of the third image and the sample image, weighting and summing the L2 loss of the low-frequency sub-band and the L1 loss of the high-frequency sub-band to obtain the loss of the sub-band image of the level, and summing the losses of the sub-band images of the level L to obtain the second loss.

Illustratively, from the level 1-L subband images of the third image, and the level 1-L subband images of the sample image, a wavelet smoothing loss function may be calculated, resulting in a second loss, using the following equation (2):

wherein L is _wsl Representing a second loss. I _t，b：1 Representing low frequency subbands in an l-level subband image obtained by performing first-level discrete wavelet transform on a sample image, I _l，b：k Representing the kth high frequency subband in the l-level subband image obtained after the first-level discrete wavelet transform of the sample image.

Representing the low frequency subband in the l-level subband image obtained after the first level discrete wavelet transform of the third image,/for the third image>

Representing the kth high frequency subband in the l-level subband image obtained after the first-level discrete wavelet transform of the third image. k may take the values 2,3,4. Lambda (lambda) ₁ And lambda (lambda) ₂ The constraint weights for the low and high frequencies, respectively, may be set based on empirical values, e.g., lambda ₁ ＝1，λ ₂ =0.8, and is not particularly limited herein. I ₂ Representing the calculation of the 2-norm, I ₁ Representing the calculated 1 norm.

In this embodiment, by calculating the L2 loss for the low frequency subband, calculating the L1 loss for the high frequency subband, and constraining the high frequency subband with the L1 loss, the texture distribution of the feature map extracted through the second branch can be made to be identical or more similar to the texture distribution of the original sample image, and overfitting caused by simultaneous use of the L2 loss constraint high frequency subband can be avoided. Because the low-frequency sub-band represents the basic structural details of the image, the L2 loss constraint low-frequency sub-band can enable the spatial details of the feature images extracted by the second branch to be as close as possible to the original sample image, so that the second branch is driven to better extract the spatial detail features, and the expression capability of the feature images extracted by the second branch can be improved.

Further, when updating the parameters of the image segmentation model, the parameters of the image segmentation model are updated according to the first loss and the second loss. The weights of different losses can be set and adjusted according to actual application scenes and experience values, and are not particularly limited herein.

Illustratively, the first and second losses are weighted and summed to obtain a first integrated loss, and parameters of the image segmentation model are updated based on the first integrated loss.

In an optional embodiment, in the image segmentation model training process, a segmentation prediction network for training may be further added, and the second feature map is input into the segmentation prediction network for training to perform prediction, so as to obtain a second image segmentation result. And calculating a third loss according to the second image segmentation result and the image segmentation annotation information of the sample image.

Specifically, the cross entropy loss can be calculated according to the second image segmentation result and the image segmentation labeling information of the sample image, so as to obtain a third loss. The third loss may be understood as a split loss of the second branch.

Further, when updating the parameters of the image segmentation model, the first loss, the second loss and the third loss may be weighted and summed to obtain a second comprehensive loss, and the parameters of the image segmentation model are updated according to the second comprehensive loss.

For example, the weights of the first loss and the third loss may be the same, the second loss using different weights than the first loss and the third loss, the weight of the second loss being small relative to the weight of the first loss. For example, the weights of the first loss and the third loss are 1, and the weight of the second loss is 0.1. In this embodiment, the weights of different losses may be set and adjusted according to the actual application scenario and the experience value, which is not specifically limited herein.

In another alternative embodiment, only the first loss and the third loss may be calculated, and the first loss and the third loss may be weighted and summed to obtain a third comprehensive loss, and the parameters of the image segmentation model are updated according to the third comprehensive loss. The weights of different losses can be set and adjusted according to actual application scenes and experience values, and are not particularly limited herein.

Illustratively, based on the architecture of the image segmentation model shown in fig. 4, an architecture shown in fig. 7 may be constructed at the time of training, as shown in fig. 7, with the addition of a super-resolution reconstruction module and a segmentation prediction network for training at the time of training. The dashed line in fig. 7 represents the flow that needs to be performed during training, and after the training is completed, the super-resolution reconstruction module added during training and the segmentation prediction network used for training are cut off, so as to obtain the image segmentation model architecture shown in fig. 4.

Fig. 8 is a flowchart of an image segmentation method according to another exemplary embodiment of the present application. As shown in fig. 8, the specific steps of the image segmentation method are as follows:

step S801, receiving an image segmentation request sent by a terminal device, where the image segmentation request includes a first image to be processed, and the first image is a remote sensing image, an aerial image or a medical image.

In this embodiment, when image segmentation is required, the end device sends an image segmentation request to the server, where the image segmentation request includes a first image to be processed.

The first image to be processed may be an ultra-high resolution image such as a remote sensing image, an aerial image or a medical image, or may be other types of ultra-high resolution images or high resolution images, which are not limited herein.

Step S802, decomposing the first image into a plurality of frequency domain components with different resolutions, fusing the frequency domain components with different resolutions, and inputting the fused frequency domain components into a lightweight semantic segmentation coding network for feature extraction to obtain a first feature map.

Step 803, performing reversible downsampling on the first image to obtain a second image, inputting the second image into a depth semantic segmentation coding network to perform feature extraction to obtain a second feature map, and performing reversible downsampling on the second feature map to obtain a third feature map.

And step S804, inputting the fused first feature map and third feature map into a segmentation prediction network for prediction, and obtaining an image segmentation result.

The specific implementation manner of the steps S802 to S804 is identical to the implementation manner of the steps S202 to S204, and the specific implementation manner and effect are referred to the relevant content of the foregoing embodiment, which is not repeated here.

Step S805, transmitting the image segmentation result to the end-side device.

In this embodiment, the server transmits the image segmentation result to the end-side device. The terminal side equipment receives the image segmentation result sent by the server and outputs the image segmentation result.

For example, when applied to a remote sensing image feature segmentation scene, the image segmentation result may be a remote sensing image feature segmentation result. The ground feature segmentation result can segment the areas where various ground features are located in the remote sensing image. And outputting the ground object segmentation result of the remote sensing image to the terminal side equipment by the server. The terminal side equipment outputs the ground feature segmentation result of the remote sensing image so as to be checked and used by a user.

For example, when applied to an image search scenario in the field of electronic commerce, the end-side device may be a server of the image search system. The image segmentation result may be an image segmentation result of the target item image, which may segment an area of the target item image where the target item is located. The server outputs the image segmentation result of the target object image to the end-side device. And the terminal side equipment identifies and searches the target object according to the area where the target object is located, obtains the related information of the target object, and outputs the related information of the target object.

For example, when applied to an image segmentation scenario in the medical field, the end-side device may be a server of the medical system. The image segmentation result may be an image segmentation result of the medical image, which may segment out areas of the medical image where different tissue structures are located. And the server outputs the image segmentation result of the medical image to the end-side equipment. The terminal equipment realizes processing functions such as medical teaching and focus identification according to the areas where different tissue structures are located in the medical image.

Fig. 9 is a schematic structural diagram of an image segmentation apparatus according to an exemplary embodiment of the present application. The image segmentation device provided by the embodiment of the application can execute the processing flow provided by the embodiment of the image segmentation method. As shown in fig. 9, the image dividing apparatus 90 includes: an image acquisition module 91, a first encoding module 92, a second encoding module 93 and a prediction module 94.

The image acquisition module 91 is configured to acquire a first image to be processed in response to an image segmentation request.

The first encoding module 92 is configured to extract a first feature map of the first image through the lightweight semantic segmentation encoding network.

The second encoding module 93 is configured to perform reversible downsampling on the first image to obtain a second image, input the second image into the depth semantic segmentation encoding network to perform feature extraction to obtain a second feature map, and perform reversible downsampling on the second feature map to obtain a third feature map.

The prediction module 94 is configured to fuse the first feature map and the third feature map, input the fused first feature map and the fused third feature map to the segmentation prediction network for prediction, obtain an image segmentation result, and output the image segmentation result.

In an alternative embodiment, when performing the reversible downsampling of the first image to obtain the second image, the second encoding module 93 is further configured to: performing at least one-stage reversible waveform transformation on the first image to obtain a plurality of sub-band images; and fusing the plurality of sub-band images to obtain a second image.

In an alternative embodiment, at least one level of reversible waveform transformation is performed on the first image to obtain a plurality of sub-band images; when the plurality of subband images are fused to obtain the second image, the second encoding module 93 is further configured to: performing reversible waveform transformation on the first image to obtain a plurality of primary subband images; respectively carrying out first convolution operation on the first-level sub-band images, and then carrying out reversible waveform transformation again to obtain a plurality of second-level sub-band images; and fusing the plurality of secondary sub-band images to obtain a second image.

In an alternative embodiment, when implementing the inverse process of performing the reversible downsampling on the second feature map to obtain the third feature map, the second encoding module 93 is further configured to: performing a second convolution operation on the second feature map, and splitting the second feature map into a plurality of sub-band feature maps; and carrying out inverse transformation of at least one level of reversible waveform transformation on the plurality of sub-band feature diagrams split by the second feature diagram to obtain a third feature diagram.

In an alternative embodiment, when implementing the inverse transformation of the at least one level reversible waveform transformation on the plurality of subband feature maps split into the second feature map, to obtain the third feature map, the second encoding module 93 is further configured to: and carrying out inverse transformation of two-stage reversible waveform transformation on the plurality of sub-band characteristic diagrams split by the second characteristic diagram to obtain a third characteristic diagram.

In an alternative embodiment, when implementing at least one level of reversible waveform transformation of the first image to obtain a plurality of subband images, the second encoding module 93 is further configured to: at least one level of discrete wavelet transform is performed on the first image to obtain a plurality of subband images.

Accordingly, when implementing the inverse transformation of at least one level of reversible waveform transformation on the plurality of subband feature maps split from the second feature map to obtain the third feature map, the second encoding module 93 is further configured to: and carrying out at least one-stage discrete wavelet inverse transformation on the plurality of sub-band feature diagrams split by the second feature diagram to obtain a third feature diagram.

In an alternative embodiment, when implementing at least one level of reversible waveform transformation of the first image to obtain a plurality of subband images, the second encoding module 93 is further configured to: and performing at least one level of contourlet transformation on the first image to obtain a plurality of sub-band images.

Accordingly, when implementing the inverse transformation of at least one level of reversible waveform transformation on the plurality of subband feature maps split from the second feature map to obtain the third feature map, the second encoding module 93 is further configured to: and carrying out at least one-stage profile wave inverse transformation on the plurality of sub-band characteristic diagrams split from the second characteristic diagram to obtain a third characteristic diagram.

In an alternative embodiment, in implementing the extraction of the first feature map of the first image through the lightweight semantic segmentation encoding network, the first encoding module 92 is further configured to: carrying out Laplacian pyramid decomposition on the first image to obtain a plurality of frequency domain components with different resolutions; and after the frequency domain components with different resolutions are fused, inputting the frequency domain components into a lightweight semantic segmentation coding network for feature extraction, and obtaining a first feature map.

In an alternative embodiment, after implementing the fusion of the frequency domain components with different resolutions, the lightweight semantic segmentation encoding network is input to perform feature extraction, and when obtaining the first feature map, the first encoding module 92 is further configured to: splicing or summing the frequency domain components with different resolutions to obtain a fusion result; and inputting the fusion result into a lightweight semantic segmentation coding network to perform feature extraction to obtain a plurality of first feature graphs with different resolutions.

The device provided in the embodiment of the present application may be specifically used to execute the scheme provided in any one of the embodiments of the image segmentation method, and specific functions and technical effects that can be achieved are not described herein.

Fig. 10 is a schematic structural diagram of a model training apparatus for image segmentation according to an exemplary embodiment of the present application. The image segmentation model training device provided by the embodiment of the application can execute the processing flow provided by the embodiment of the image segmentation model training method. In this embodiment, the image segmentation model to be trained includes: a lightweight semantic segmentation coding network, a depth semantic segmentation coding network and a segmentation prediction network.

As shown in fig. 10, the model training apparatus 100 for image segmentation includes: a first coding unit 1001, a second coding unit 1002, a prediction unit 1003, and a training unit 1004.

Wherein the first encoding unit 1001 is configured to extract a first feature map of a sample image through a lightweight semantic segmentation encoding network.

The second encoding unit 1002 is configured to perform reversible downsampling on the sample image to obtain a second image, input the second image into a depth semantic segmentation encoding network to perform feature extraction to obtain a second feature map, and perform reversible downsampling on the second feature map to obtain a third feature map.

The prediction unit 1003 is configured to fuse the first feature map and the third feature map, input the fused first feature map and the fused third feature map to a segmentation prediction network, and predict the fused first feature map and the third feature map to obtain a first image segmentation result.

The training unit 1004 is configured to calculate a first loss according to the first image segmentation result and the image segmentation labeling information of the sample image, and update parameters of the image segmentation model according to the first loss, so as to obtain a trained image segmentation model. The image segmentation model is used for encoding and predicting the input image to obtain an image segmentation result.

In an alternative embodiment, training unit 1004 is further configured to: up-sampling the third feature map according to the resolution of the sample image to obtain a third image with the same resolution as the sample image; and calculating a wavelet smoothing loss function according to the third image and the sample image to obtain a second loss. In implementing updating parameters of the image segmentation model according to the first penalty, the training unit 1004 is further configured to: and updating parameters of the image segmentation model according to the first loss and the second loss.

In an alternative embodiment, training unit 1004 is further configured to: inputting the second feature map into a segmentation prediction network for prediction to obtain a second image segmentation result; and calculating a third loss according to the second image segmentation result and the image segmentation annotation information of the sample image. In implementing updating parameters of the image segmentation model according to the first loss and the second loss, the training unit 1004 is further configured to: and updating parameters of the image segmentation model according to the first loss, the second loss and the third loss.

The device provided in the embodiment of the present application may be specifically used to execute the scheme provided in any one of the embodiments of the image segmentation model training method, and specific functions and technical effects that can be achieved are not described herein again.

Fig. 11 is a schematic structural diagram of a server according to an embodiment of the present application. As shown in fig. 11, the server includes: a memory 1101 and a processor 1102. Memory 1101 is used to store computer-executable instructions and may be configured to store various other data to support operations on a server. The processor 1102 is communicatively connected to the memory 1101, and is configured to execute computer-executable instructions stored in the memory 1101, so as to implement the technical solution provided in any one of the above method embodiments, and the specific functions and the technical effects that can be implemented are similar, and are not repeated herein.

In fig. 11, a cloud server installed in the cloud is exemplified as a server, and the server may be a local server.

Optionally, as shown in fig. 11, the server further includes: firewall 1103, load balancer 1104, communication component 1105, power component 1106, and other components. Only some of the components are schematically shown in fig. 11, which does not mean that the server only comprises the components shown in fig. 11.

The embodiment of the application further provides a computer readable storage medium, in which computer executable instructions are stored, and when the computer executable instructions are executed by a processor, the computer executable instructions are used to implement the scheme provided by any one of the method embodiments, and specific functions and technical effects that can be implemented are not described herein.

The embodiment of the application also provides a computer program product, which comprises: the computer program is stored in a readable storage medium, and the computer program can be read from the readable storage medium by at least one processor of the server, where execution of the computer program by at least one processor causes the server to execute the solution provided by any one of the method embodiments, and specific functions and technical effects that can be achieved are not described herein. The embodiment of the application provides a chip, which comprises: the processing module and the communication interface, the processing module can execute the technical scheme of the server in the foregoing method embodiment. Optionally, the chip further includes a storage module (e.g. a memory), where the storage module is configured to store the instructions, and the processing module is configured to execute the instructions stored in the storage module, and execution of the instructions stored in the storage module causes the processing module to execute the technical solution provided in any one of the foregoing method embodiments.

The memory may be an object store (Object Storage Service, OSS).

The memory may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The communication component is configured to facilitate wired or wireless communication between the device in which the communication component is located and other devices. The device where the communication component is located may access a wireless network based on a communication standard, such as a mobile hotspot (WiFi), a mobile communication network of a second generation mobile communication system (2G), a third generation mobile communication system (3G), a fourth generation mobile communication system (4G)/Long Term Evolution (LTE), a fifth generation mobile communication system (5G), or a combination thereof. In one exemplary embodiment, the communication component receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

The power supply component provides power for various components of equipment where the power supply component is located. The power components may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the devices in which the power components are located.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, compact disk read-only memory (CD-ROM), optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should be noted that, the user information (including but not limited to user equipment information, user attribute information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with related laws and regulations and standards, and provide corresponding operation entries for the user to select authorization or rejection.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations appearing in a particular order are included, but it should be clearly understood that the operations may be performed out of order or performed in parallel in the order in which they appear herein, merely for distinguishing between the various operations, and the sequence number itself does not represent any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types. The meaning of "a plurality of" is two or more, unless specifically defined otherwise.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards, and provide corresponding operation entries for the user to select authorization or rejection.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. An image segmentation method, comprising:

2. The method of claim 1, wherein the reversibly downsampling the first image to obtain a second image comprises:

performing at least one-stage reversible waveform transformation on the first image to obtain a plurality of sub-band images;

and fusing the plurality of sub-band images to obtain the second image.

3. The method of claim 2, wherein said performing at least one stage of reversible waveform transformation on said first image results in a plurality of subband images; fusing the plurality of sub-band images to obtain the second image, including:

performing reversible waveform transformation on the first image to obtain a plurality of primary subband images;

after the first convolution operation is carried out on the primary sub-band images respectively, reversible waveform transformation is carried out again to obtain a plurality of secondary sub-band images;

and fusing the plurality of secondary sub-band images to obtain the second image.

4. The method according to claim 2, wherein said performing said inverse of said reversible downsampling on said second profile results in a third profile, comprising:

Performing a second convolution operation on the second feature map, and splitting the second feature map into a plurality of sub-band feature maps;

and carrying out at least one level of inverse transformation of the reversible waveform transformation on the plurality of sub-band feature diagrams split by the second feature diagram to obtain the third feature diagram.

5. The method of claim 4, wherein said performing at least one inverse of said reversible waveform transformation on said plurality of sub-band feature maps split from said second feature map to obtain said third feature map comprises:

and carrying out two-stage inverse transformation on the plurality of sub-band characteristic diagrams split by the second characteristic diagram to obtain the third characteristic diagram.

6. The method of claim 4, wherein said performing at least one stage of reversible waveform transformation on said first image results in a plurality of subband images, comprising:

performing at least one level of discrete wavelet transform on the first image to obtain a plurality of sub-band images;

correspondingly, the performing at least one stage of inverse transformation of the reversible waveform transformation on the plurality of sub-band feature diagrams split by the second feature diagram to obtain the third feature diagram includes:

And carrying out at least one-stage discrete wavelet inverse transformation on the plurality of sub-band feature diagrams split by the second feature diagram to obtain the third feature diagram.

7. The method of claim 4, wherein said performing at least one stage of reversible waveform transformation on said first image results in a plurality of subband images, comprising:

performing at least one-stage contourlet transformation on the first image to obtain a plurality of sub-band images;

and carrying out at least one-stage profile wave inverse transformation on the plurality of sub-band characteristic diagrams split by the second characteristic diagram to obtain the third characteristic diagram.

8. The method of any of claims 1-7, wherein the extracting the first feature map of the first image through the lightweight semantic segmentation encoding network comprises:

carrying out Laplacian pyramid decomposition on the first image to obtain a plurality of frequency domain components with different resolutions;

and after the frequency domain components with different resolutions are fused, inputting the frequency domain components into a lightweight semantic segmentation coding network for feature extraction, and obtaining a first feature map.

9. The method of claim 8, wherein the merging the frequency domain components with the plurality of different resolutions, inputting the merged frequency domain components into a lightweight semantic segmentation encoding network for feature extraction, and obtaining a first feature map includes:

splicing or summing the frequency domain components with different resolutions to obtain a fusion result;

and inputting the fusion result into the lightweight semantic segmentation coding network to perform feature extraction to obtain a plurality of first feature graphs with different resolutions.

10. A model training method for image segmentation, comprising:

11. The method as recited in claim 10, further comprising:

up-sampling the third feature map according to the resolution of the sample image to obtain a third image with the same resolution as the sample image;

calculating a wavelet smoothing loss function according to the third image and the sample image to obtain a second loss;

the updating parameters of the image segmentation model according to the first loss comprises:

and updating parameters of the image segmentation model according to the first loss and the second loss.

12. The method as recited in claim 11, further comprising:

inputting the second feature map into the segmentation prediction network for prediction to obtain a second image segmentation result;

calculating a third loss according to the second image segmentation result and the image segmentation marking information of the sample image;

Said updating parameters of said image segmentation model based on said first loss and said second loss, comprising:

and updating parameters of the image segmentation model according to the first loss, the second loss and the third loss.

13. An image segmentation method, comprising:

And sending the image segmentation result to the end-side equipment.

14. A server, comprising: a processor, and a memory communicatively coupled to the processor;

the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored in the memory to implement the method of any one of claims 1-13.