CN115035301A

CN115035301A - Method and device for image segmentation

Info

Publication number: CN115035301A
Application number: CN202210734947.XA
Authority: CN
Inventors: 张凯昱; 杨青
Original assignee: Du Xiaoman Technology Beijing Co Ltd
Current assignee: Du Xiaoman Technology Beijing Co Ltd
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2022-09-09

Abstract

The invention aims to provide a method and a device for image segmentation. The method comprises the following steps: when image segmentation operation is carried out, for the acquired features corresponding to the same feature layer, performing first fusion operation on the features by using hole convolution with different expansion rates to complete fusion of the features with the same feature layer and different sizes; and for the features with different sizes in different feature layers, performing a second fusion operation on the features in a jumping connection mode to complete the fusion of the features with different sizes in different feature layers. The embodiment of the application has the following advantages: the fusion of the features of different feature layers and the features of different sizes is realized, the problem that the features of different sizes on different feature layers cannot be fused in the prior art is solved, and the performance of an algorithm model is improved.

Description

Method and device for image segmentation

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for image segmentation.

Background

The algorithm for processing the image segmentation task integrates pixels with the same property in the image by classifying each pixel in the image so as to extract a region which meets a task target in an image matrix.

Based on the scheme in the prior art, the mainstream image segmentation algorithm generally realizes the target that deep features of an image have a larger receptive field in a convolutional layer cascade mode with a plurality of step lengths larger than 1. Although the cascade of multiple convolutional layers helps to improve the receptive field, the layer-by-layer reduction of the resolution of the image features results in a loss of useful information. On the other hand, the algorithm structure of the cascaded convolutional layer loses the multiplexing of the algorithm to each hierarchical feature, and the performance of the algorithm is further reduced. Moreover, the image segmentation method based on the scheme in the prior art cannot realize the fusion of features with different levels and sizes.

Disclosure of Invention

The invention aims to provide a method and a device for image segmentation.

According to an embodiment of the present application, there is provided a method for image segmentation, wherein the method comprises:

when image segmentation operation is carried out, for the acquired features corresponding to the same feature layer, performing first fusion operation on the features by using hole convolution with different expansion rates to complete fusion of the features with the same feature layer and different sizes;

and for the features with different sizes in different feature layers, performing a second fusion operation on the features in a jumping connection mode to complete the fusion of the features with different sizes in different feature layers.

According to an embodiment of the present application, there is provided an apparatus for image segmentation, wherein the apparatus includes:

when image segmentation operation is carried out, for the acquired features corresponding to the same feature layer, performing first fusion operation on the features by using hole convolution with different expansion rates to complete fusion of the features with different sizes in the same feature layer;

and executing a second fusion operation on the features by adopting a jump connection mode for the features with different sizes of different feature layers so as to complete the fusion of the features with different sizes of different feature layers.

According to an embodiment of the present application, there is provided a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of the embodiment of the present application when executing the program.

According to an embodiment of the present application, there is provided a computer-readable storage medium on which a computer program is stored, characterized in that the program, when executed by a processor, implements the method of the embodiment of the present application.

Compared with the prior art, the embodiment of the application has the following advantages: according to the scheme of the embodiment of the application, when the image is segmented, the features corresponding to the same feature layer and different sizes are fused respectively, so that the features of different feature layers and the features of different sizes are fused, the problem that the features of different sizes on different feature layers cannot be fused in the prior art is solved, and the performance of an algorithm model is improved; moreover, the scheme according to the embodiment of the application expands the expression capability of the model by using a neural network searching method.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:

FIG. 1 shows a flow diagram of a method for image segmentation according to an embodiment of the present application;

fig. 2(a) shows a schematic diagram of an exemplary network architecture according to an embodiment of the present application;

fig. 2(b) shows a schematic diagram of an exemplary ASPP module according to an embodiment of the present application;

FIG. 2(c) is a schematic diagram illustrating the operation of an exemplary downsampling search unit according to an embodiment of the present application;

fig. 3 shows a schematic structural diagram of an apparatus for image segmentation according to an embodiment of the present application.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel, concurrently, or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

The term "computer device" or "computer" in this context refers to an intelligent electronic device that can execute predetermined processes such as numerical calculation and/or logic calculation by running predetermined programs or instructions, and may include a processor and a memory, wherein the processor executes a pre-stored instruction stored in the memory to execute the predetermined processes, or the predetermined processes are executed by hardware such as ASIC, FPGA, DSP, or a combination thereof. Computer devices include, but are not limited to, servers, personal computers, laptops, tablets, smart phones, and the like.

The computer equipment comprises user equipment and network equipment. Wherein the user equipment includes but is not limited to computers, smart phones, PDAs, etc.; the network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a Cloud Computing (Cloud Computing) based Cloud consisting of a large number of computers or network servers, wherein Cloud Computing is one of distributed Computing, a super virtual computer consisting of a collection of loosely coupled computers. The computer equipment can be independently operated to realize the application, and can also be accessed into a network to realize the application through the interactive operation with other computer equipment in the network. The network in which the computer device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.

It should be noted that the user devices, network devices, networks, etc. are merely examples, and other existing or future computer devices or networks may be included within the scope of the present application and are also included herein by reference, as applicable.

The methods discussed below, some of which are illustrated by flow diagrams, may be implemented by hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks may be stored in a machine or computer readable medium such as a storage medium. The processor(s) may perform the necessary tasks.

Specific structural and functional details disclosed herein are merely representative and are provided for purposes of describing example embodiments of the present application. This application may, however, be embodied in many alternate forms and should not be construed as limited to only the embodiments set forth herein.

It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element may be termed a second element, and, similarly, a second element may be termed a first element, without departing from the scope of example embodiments. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being "directly connected" or "directly coupled" to another element, there are no intervening elements present. Other words used to describe the relationship between elements (e.g., "between" versus "directly between", "adjacent" versus "directly adjacent to", etc.) should be interpreted in a similar manner.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be noted that, in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may, in fact, be executed substantially concurrently, or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.

The present invention is described in further detail below with reference to the attached drawing figures.

Fig. 1 shows a flow chart of a method for image segmentation according to an embodiment of the present application. The method includes step S1 and step S2.

In step S1, when performing the image segmentation operation, for the acquired features corresponding to the same feature layer, a first fusion operation is performed on the features using hole convolution with different expansion rates to complete fusion of the features of different sizes in the same feature layer.

Preferably, a first fusion operation is performed on the features using a hole convolution with different dilation rates using an aperture Spatial Pyramid (ASPP) module to complete the fusion of the features of different sizes of the same feature layer.

The ASPP performs cavity convolution in parallel by using different expansion rates according to input characteristics and performs fusion.

In step S2, for features with different sizes at different feature levels, a second merging operation is performed on the features in a jump connection manner to complete merging of the features with different sizes at different feature levels.

Specifically, for the obtained features of different feature layers with different sizes, the features of different sizes of each feature layer are spliced with the features on the symmetrical layer of the feature layer in the upper sampling layer based on a jump connection mode in the U-Net structure, so that the fusion of the features of different sizes of different feature layers is completed.

In the conventional U-Net structure, the fusion of the shallow feature and the deep feature with the same size is realized by fusing the features of the same layer in the coding and decoding algorithm structure, as will be familiar to those skilled in the art.

According to one embodiment, the method comprises step S3.

In step S3, the unit structures of the down-sampling layer and the up-sampling layer are searched by using a neural network search method.

Specifically, the step S3 includes a step S301.

In step S301, search units of a down-sampling layer and an up-sampling layer are constructed on a layer-by-layer type basis.

The candidate operator is formed by conventional convolution, channel-by-channel convolution and point-by-point convolution and pooling operation in the search unit in the lower sampling layer, and the candidate operator is formed by transposition convolution and upsampling and 1x1 convolution in the search unit in the upper sampling layer.

According to one embodiment, the method comprises step S4.

In step S4, the weight parameters and the network structure parameters of the operators of the trained downsampling-layer and upsampling-layer search units are obtained by performing network training.

According to one embodiment, the network training process comprises a model searching stage and a model fine tuning stage, and in the network searching stage, weight parameters and network structure parameters of operators of the search units of the lower sampling layer and the upper sampling layer are alternately trained by using a training set and a verification set until a loss function converges.

In the model fine tuning stage, an operator with the largest structural parameter in each search unit is selected, the selected operator and a convolution kernel with the size of 1 and the number of channels as the number of categories are used for forming a new network, and then the network is trained from the beginning until the model converges.

The following describes embodiments of the present application with reference to fig. 2(a), 2(b), and 2 (c).

Fig. 2(a) shows a schematic diagram of an exemplary network structure according to an embodiment of the present application. Fig. 2(b) shows a schematic diagram of an exemplary ASPP module according to an embodiment of the present application. Fig. 2(c) shows a schematic diagram of the operation of an exemplary downsampling search unit according to an embodiment of the present application.

The network shown in fig. 2(a) includes down-sampling search units corresponding to 4 output feature channels, which are respectively denoted as down, c64,/2, down, c128,/2, down, c256,/2, and down, c512,/2. Where "c 64" indicates that the number of output feature channels is 64, and "/2" indicates that/2 downsampling is performed.

And the network contains 4 upsampled search units, denoted up, c256,2x, up, c128,2x, up, c64,2x and up, c3,2x, respectively. Where "2 x" indicates that 2x upsampling is performed.

The network shown in fig. 2(a) includes 3 ASPP modules, and the number of channels of each convolution operator in each ASPP module is consistent with the number of characteristic channels input by the module. And, as the resolution of the input feature map gradually decreases, the expansion rates of the parallel hole convolutions in the 3 ASPP modules gradually decrease, and the expansion rates corresponding to the 3 ASPP modules are (30, 48, 66), (10, 22, 34), and (6, 12, 18), respectively.

When an image segmentation operation is performed, all the acquired features are used as inputs (inputs) to the network. For the acquired features corresponding to the same feature layer, respectively performing a first fusion operation on the features by using three hole convolutions with different expansion rates through the 3 ASPPs so as to complete the fusion of the features with different sizes in the same feature layer. And for the obtained features of different feature layers with different sizes, splicing the features of different sizes of each feature layer with the features on the symmetrical layer of the feature layer in the upper sampling layer based on a jump connection mode in a U-Net structure so as to complete the feature fusion of different sizes of different feature layers, and further obtaining the features subjected to feature fusion as the output (outputs) of the network. Specifically, as shown, the features passing through the ASPP modules corresponding to the feature channel numbers of 64, 128, and 256 are respectively spliced with the features of the symmetric feature channel numbers of 64, 128, and 256 in the upsampling layer. And performing 1x1 convolution on the spliced features to obtain the output of the network.

Fig. 2(b) shows a schematic diagram of an ASPP module with minimum input feature resolution.

As shown in fig. 2(b), in the ASPP module corresponding to the output eigen-channel number of 64, for the input features, the operations of three hole convolutions are performed in parallel, with the expansion ratios adopted being r30, r48, and r66, respectively, and also the operation of two normal convolutions is performed (i.e., the expansion ratio is "r 1"). Wherein BN represents Batch normalization (Batch normalization), ReLU is an activation function, and globalmeanpool represents a global average pooling layer. And then, splicing the characteristics of the void convolution and the normal convolution to obtain the output of the ASPP module.

In constructing the candidate unit, the method according to the present example initializes a structural parameter for each candidate operator in the search unit in the manner of an FBNet network, and uses a weighted sum of the resampling (for example, using the number-softmax) of the structural parameter and the output of the candidate operator as the output of the candidate unit.

The network shown in fig. 2(a) employs a layer-by-layer type of idea to construct a search unit. In a down-sampling search unit, forming a candidate operator by adopting conventional convolution, channel-by-channel convolution and point-by-point convolution and pooling operation; in the up-sampling search unit, the candidate operator is formed by convolution of transposed convolution and up-sampling and 1x 1.

Fig. 2(c) shows a schematic diagram of the operation of an exemplary downsampling search unit according to an embodiment of the present application. As shown in fig. 2(c), the input (inputs) of each operation in the downsampling search unit is the product of the original input and the structural parameter (processed by the number-softmax) corresponding to the operator. Operations in the downsampled search unit include 3x3 convolution (denoted conv,3x3), 3x3 depth-wise convolution (denoted depthwise-conv, 3x3), 1x1 point-wise convolution (denoted pointwise-conv, 1x1), 3x3 maximum pooling convolution (denoted maxpololing, 3x3), and 3x3 mean pooling convolution (denoted meanpooling, 3x 3).

The training process of the network shown in fig. 2(a) is divided into a model search phase and a fine tuning phase. The search stage is trained according to Algorithm 1(Algorithm1) as shown below, and the weight parameters and the network structure parameters of the operators are alternately trained by using a training set and a verification set in the training process until the loss function converges:

after the model searching is finished, 4 corresponding structural parameters are obtained for each downsampling searching unit; for each upsampling search unit, two corresponding upsampling structure parameters are obtained.

In the model fine tuning stage, firstly, an operator with the largest structure parameter in each search unit is selected as the last selected operator, the selected operator and a convolution kernel with the size of 1 and the number of channels as the number of categories are used for forming a new network, and then the network is trained from the beginning until the model converges. The training process of the fine tuning phase uses algorithm 2 as follows:

it should be noted that the foregoing examples are only for better illustrating the technical solutions of the present invention, and not for limiting the present invention, and those skilled in the art should understand that any implementation manner of the network based on the embodiments of the present application should be included in the scope of the present invention.

According to the method, when the image is segmented, the fusion is carried out based on the features corresponding to the same feature layer and the features corresponding to different feature layers and different sizes respectively, so that the fusion of the features of different feature layers and the features of different sizes is realized, the problem that the features of different sizes on different feature layers cannot be fused in the prior art is solved, and the performance of an algorithm model is improved; moreover, the method according to the embodiment of the application expands the expression capability of the model by using a neural network searching method.

The device comprises: when image segmentation operation is carried out, for the acquired features corresponding to the same feature layer, a first fusion operation is carried out on the features by using hole convolution with different expansion rates to complete fusion of the features with different sizes in the same feature layer (hereinafter referred to as a "first fusion device 1"), and a second fusion operation is carried out on the features with different sizes in different feature layers by adopting a jump connection mode to complete fusion of the features with different sizes in different feature layers (hereinafter referred to as a "second fusion device 2").

Referring to fig. 3, when performing an image segmentation operation, for the acquired features corresponding to the same feature layer, the first fusion device 1 performs a first fusion operation on the features using hole convolutions with different expansion rates to complete the fusion of the features of the same feature layer with different sizes.

For the features with different sizes in different feature layers, the second fusion device 2 performs a second fusion operation on the features in a jump connection manner to complete the fusion of the features with different sizes in different feature layers.

Specifically, for the acquired features of different feature layers and different sizes, the second fusion device 2 splices the features of different sizes of each feature layer with the features on the symmetrical layer of the feature layer in the upper sampling layer based on the jump connection mode in the U-Net structure, so as to complete the fusion of the features of different sizes of different feature layers.

In the conventional U-Net structure, the features of the same layer in the coding and decoding algorithm structure are fused, so that the shallow feature and the deep feature of the same size are fused, as will be familiar to those skilled in the art.

According to one embodiment, the device comprises a network orship device.

The network searching device searches the unit structures of the down-sampling layer and the up-sampling layer by adopting a neural network searching method.

Specifically, the network search apparatus constructs search units of a down-sampling layer and an up-sampling layer on a layer-by-layer type basis.

According to one embodiment, the apparatus comprises a network training apparatus.

The network training device obtains the weight parameters and the network structure parameters of the operators of the search units of the lower sampling layer and the upper sampling layer after training by carrying out network training.

According to one embodiment, the network training process comprises a model searching stage and a model fine tuning stage, in the network searching stage, the network training device uses a training set and a validation set to train the weight parameters and the network structure parameters of the operators of the lower sampling layer searching unit and the upper sampling layer searching unit alternately until the loss function converges.

In the model fine tuning stage, the network training device selects the operator with the largest structural parameter in each search unit, the selected operator and the convolution kernel with the size of 1 and the number of channels as the number of categories form a new network, and then the network is trained from the beginning until the model converges.

According to the device provided by the embodiment of the application, when the image is segmented, the fusion is performed based on the features corresponding to the same feature layer and the different sizes of the different feature layers, so that the fusion of the features of the different feature layers and the features of the different sizes is realized, the problem that the features of the different feature layers and the features of the different sizes cannot be fused in the prior art is solved, and the performance of an algorithm model is improved; moreover, the device according to the embodiment of the application expands the expression capability of the model by using a neural network searching method.

The software program of the present invention can be executed by a processor to implement the steps or functions described above. Also, the software programs (including associated data structures) of the present invention can be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functionality of the present invention may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various functions or steps.

In addition, some of the present invention can be applied as a computer program product, such as computer program instructions, which when executed by a computer, can invoke or provide the method and/or technical solution according to the present invention through the operation of the computer. Program instructions which invoke the methods of the present invention may be stored on fixed or removable recording media and/or transmitted via a data stream on a broadcast or other signal bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the invention herein comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or solution according to embodiments of the invention as described above.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not to denote any particular order.

Claims

1. A method for image segmentation, wherein the method comprises:

and for the features with different sizes of different feature layers, performing second fusion operation on the features in a jump connection mode to complete the fusion of the features with different sizes of different feature layers.

2. The method of claim 1, wherein performing the second merging operation on the features by using the skip connection comprises:

and for the obtained features of different feature layers with different sizes, splicing the features of different sizes of each feature layer with the features on the symmetrical layer of the feature layer in the upper sampling layer based on a jump connection mode in the U-Net structure so as to complete the fusion of the features of different sizes of different feature layers.

3. The method of claim 1, wherein the method comprises:

and searching the unit structures of the down-sampling layer and the up-sampling layer by adopting a neural network searching method.

4. The method of claim 3, wherein the method comprises:

constructing a searching unit of a down-sampling layer and an up-sampling layer based on a layer-by-layer mode;

5. The method according to claim 3 or 4, wherein the method comprises:

and obtaining the weight parameters and the network structure parameters of the operators of the trained lower sampling layer and upper sampling layer searching units by network training.

6. The method of claim 5, wherein the network training process comprises a model search phase and a model fine tuning phase, the method comprising:

in the network searching stage, training weight parameters and network structure parameters of operators of the searching units of the lower sampling layer and the upper sampling layer alternately by using a training set and a verification set until a loss function is converged;

7. An apparatus for image segmentation, wherein the apparatus comprises:

when image segmentation operation is carried out, for the acquired features corresponding to the same feature layer, hole convolution with different expansion rates is used for carrying out first fusion operation on the features so as to complete fusion of the features with the same feature layer and different sizes;

8. The apparatus of claim 7, wherein the apparatus comprises:

and the device is used for searching the unit structures of the down-sampling layer and the up-sampling layer by adopting a neural network searching method.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 6 when executing the program.

10. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 6.