WO2021003936A1

WO2021003936A1 - Image segmentation method, electronic device, and computer-readable storage medium

Info

Publication number: WO2021003936A1
Application number: PCT/CN2019/118294
Authority: WO
Inventors: 陈玥蓉; 韩茂琨; 王健宗
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-07-05
Filing date: 2019-11-14
Publication date: 2021-01-14
Also published as: CN110490203A; CN110490203B

Abstract

Disclosed are an image segmentation method and apparatus, an electronic device, and a computer-readable storage medium, wherein same relate to the technical field of artificial intelligence. The method comprises: acquiring an image to be segmented (102); performing convolution, activation, and pooling processing on the image to be segmented to obtain five pooling feature sets (104); performing up-sampling processing on a specified pooling feature set among the five pooling feature sets according to an up-sampling mode corresponding to a pre-determined down-sampling multiple of the image to be segmented (106); during the process of up-sampling processing, calculating a total mask score according to the intersection over union of a predicted mask and an actual mask and a mask score of the original network classification of the image to be segmented (108); and segmenting, by means of a smooth L2 loss function and on the basis of the total mask score, a final result of the up-sampling processing to obtain a segmented image (110). By means of the method, an output image of a convolutional neural network is restored in terms of pixel dimensions, thereby improving the accuracy of semantic image segmentation.

Description

Image segmentation method, electronic equipment and computer readable storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201910602691.5, and the invention title is "Image Segmentation Method and Apparatus, Electronic Equipment, and Computer-readable Storage Medium" on July 5, 2019, and its entire contents Incorporated in this application by reference.

Technical field

This application relates to the field of artificial intelligence technology, in particular to an image segmentation method and device, electronic equipment, and computer-readable storage media.

Background technique

For Convolutional Neural Networks (CNN) used for classification, some fully connected layers are often added at the end of the network. The output content of the fully connected layer is processed by the softmax function to obtain category probability information.

However, the obtained category probability information is one-dimensional, that is, it can only identify the category of the entire picture, and cannot identify the category of each pixel, especially when processing the edge of the image, the effect is very unsatisfactory.

Therefore, how to further improve the accuracy of image semantic segmentation has become a technical problem to be solved urgently.

Application content

The embodiments of the present application provide an image segmentation method and device, electronic equipment, and computer-readable storage medium, which aim to solve the technical problem of insufficient accuracy of image semantic segmentation in related technologies, and can replace the fully connected layer by the deconvolution layer And add another fully connected layer to classify each pixel of the image to further improve the accuracy of image semantic segmentation.

In the first aspect, an embodiment of the present application provides an image segmentation method, including: acquiring an image to be segmented; performing convolution, activation, and pooling processing on the image to be segmented to obtain five pooling feature sets; The up-sampling method corresponding to the predetermined down-sampling multiple of the image to be segmented is performed on the designated pooling feature set among the five pooling feature sets; in the process of the up-sampling processing, according to the prediction mask The total score of the mask is calculated by the cross-combination ratio with the actual mask and the mask score of the original network classification of the image to be segmented; the final up-sampling process is calculated based on the smooth L2 loss function based on the mask total score The result is segmented, and segmented images are obtained.

In a second aspect, an embodiment of the present application provides an image segmentation device, including: an image acquisition unit for acquiring an image to be segmented; a down-sampling processing unit for convolving, activating, and pooling the image to be segmented Processing to obtain five pooled feature sets; an up-sampling processing unit, configured to perform an up-sampling method corresponding to the predetermined down-sampling multiple of the image to be segmented, to determine the specified pooling feature in the five pooled feature sets The set is up-sampling processing; the mask total score calculation unit is used in the process of the up-sampling processing, according to the intersection ratio of the predicted mask and the actual mask and the mask of the original network classification of the image to be divided The modulus score is used to calculate the total score of the mask; the image segmentation unit is used to segment the final result of the upsampling process based on the total score of the mask through the smooth L2 loss function to obtain a segmented image.

In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and, a memory communicatively connected with the at least one processor; wherein the memory stores the memory that can be processed by the at least one processor; The instruction executed by the device, the instruction is configured to execute the method of any one of the above-mentioned first aspects.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are used to execute the method procedures described in any one of the first aspects.

Through the above technical solutions, in view of the technical problem of insufficient accuracy of image semantic segmentation in related technologies, each pixel of the image can be classified by the way of replacing the fully connected layer with the deconvolution layer.

Specifically, in this technical solution, the convolutional neural network includes a convolutional layer, an activation layer, and a pooling layer, and also includes a deconvolutional layer that replaces the original fully connected layer, where the image to be segmented is obtained After that, the features in the image to be segmented, that is, pixels, can be classified according to different feature types or subjects through the convolutional layer, and then the important features in the classification results are highlighted through the activation layer, and then the pooling layer The data from the activation layer is processed to reduce the size of the parameter matrix, thereby realizing data reduction and reducing the number of parameters to be processed in the next step, which can speed up the calculation speed and prevent overfitting.

In the convolutional neural network of the related technology, after each step of convolution, the output image size will gradually decrease. When finally reaching the fully connected layer, the category probability information obtained is one-dimensional, that is, it can only identify the category of the entire image , Cannot identify the category of each pixel, especially when processing the edge of the image, the effect is very unsatisfactory. Therefore, in the technical solution of the present application, the fully connected layer is replaced by the deconvolution layer, since the deconvolution is equivalent to the reverse of the ordinary convolution, for example, the input blue 2x2 matrix, the size of the convolution kernel is still 3x3. When the deconvolution parameter pad=0 and stride=1, the green 4x4 matrix is output, which is equivalent to completely inverting the convolution. Among them, convolution is down-sampling processing, and deconvolution is up-sampling processing.

Therefore, after each step of deconvolution, that is, the up-sampling process, the dimensions of the output image will be gradually restored, so for each pixel, its feature will be more accurate after each deconvolution. Therefore, through the technical solution of the present application, the output image of the convolutional neural network is restored in pixel dimensions, thereby facilitating effective classification of the features of the output image, and improving the accuracy of image semantic segmentation.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings needed in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained from these drawings without creative work.

Fig. 1 shows a flowchart of an image segmentation method according to an embodiment of the present application;

Figure 2 shows a schematic diagram of image segmentation according to an embodiment of the present application;

Fig. 3 shows a flowchart of an image segmentation method according to another embodiment of the present application;

Fig. 4 shows a block diagram of an image segmentation device according to an embodiment of the present application;

Fig. 5 shows a block diagram of an electronic device according to an embodiment of the present application.

Detailed ways

In order to better understand the technical solutions of the present application, the embodiments of the present application will be described in detail below with reference to the accompanying drawings.

It should be clear that the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The terms used in the embodiments of the present application are only for the purpose of describing specific embodiments, and are not intended to limit the present application. The singular forms of "a", "said" and "the" used in the embodiments of the present application and the appended claims are also intended to include plural forms, unless the context clearly indicates other meanings.

Fig. 1 shows a flowchart of an image segmentation method according to an embodiment of the present application.

As shown in Fig. 1, the process of the image segmentation method of an embodiment of the present application includes:

Step 102: Obtain an image to be divided.

Step 104: Perform convolution, activation and pooling processing on the image to be segmented to obtain five pooled feature sets.

The convolutional neural network includes a convolutional layer, an activation layer and a pooling layer. It also includes a deconvolutional layer that replaces the original fully connected layer. After obtaining the image to be segmented, the convolutional layer can be used to segment the image The features in the image, that is, pixels, are classified according to different feature types or subjects. Then, the important features in the classification results are highlighted through the activation layer, and then the data from the activation layer is reduced by the pooling layer. The processing of the size can achieve data reduction and reduce the number of parameters to be processed in the next step, which can speed up the calculation speed and prevent overfitting.

Step 106: Perform an upsampling process on a specified pooling feature set among the five pooling feature sets according to an upsampling manner corresponding to the predetermined downsampling multiple of the image to be segmented.

In the convolutional neural network of the related technology, after each step of convolution, the output image size will gradually decrease. When finally reaching the fully connected layer, the category probability information obtained is one-dimensional, that is, it can only identify the category of the entire image , Cannot identify the category of each pixel, especially when processing the edge of the image, the effect is very unsatisfactory. Therefore, in the technical solution of the present application, the fully connected layer is replaced by a deconvolution layer, since deconvolution is equivalent to the reverse of ordinary convolution. For example, if a blue 2x2 matrix is input, the size of the convolution kernel is still 3x3. When the deconvolution parameter pad=0 and stride=1, the green 4x4 matrix is output, which is equivalent to completely inverting the convolution. Among them, convolution is down-sampling processing, and deconvolution is up-sampling processing.

Wherein, the up-sampling processing includes interpolation processing and deconvolution processing. Among them, the interpolation processing refers to the use of appropriate interpolation algorithms to insert new elements between pixels on the basis of the original image pixels, and the deconvolution processing refers to the compression of basic wavelets to improve the vertical resolution of the data. It can be seen that these two methods can effectively improve the accuracy of the image.

Step 108: In the process of the up-sampling process, calculate the total mask score according to the intersection ratio of the predicted mask and the actual mask and the mask score of the original network classification of the image to be segmented;

Step 110: Segment the final result of the up-sampling process based on the total score of the mask using a smooth L2 loss function to obtain a segmented image.

After each step of deconvolution, that is, up-sampling, a fully connected layer is added to predict the mask iou, and then the smooth L2 loss function is used to regress the mask iou. When the weight of the smooth L2 loss function is set to 1, the image segmentation effect is optimal. Specifically, in the process of upsampling processing, it is necessary to base on the intersection ratio (iou) of the prediction mask (prediction mask) and the actual mask (ground truth mask) and the original network classification of the image to be segmented Calculate the mask score for the mask score, where the intersection ratio refers to the ratio of the intersection and union of two bounding boxes. The union of the two bounding boxes is the area a, and the intersection is the area b , Then the intersection ratio is equal to the intersection ratio of the predicted mask and the actual mask and the product of the mask score of the original network classification of the image to be segmented. In this way, if the classification score is high, if the calculation used is If the ratio is low, the branch with the total score of the mask will be penalized. Thus, the total score of the mask can be trained to optimization in the upsampling process, and the optimized upsampling result can be obtained.

Finally, it is used to segment the final result of the up-sampling process based on the total score of the mask through a smooth L2 loss function to obtain a segmented image. The smooth L2 loss function is also called the least square error. In general, it is to minimize the sum of squares of the difference between the target value and the estimated value, so that the weight of the feature is not too large, and the weight of the feature is more average, which helps to obtain an optimized segmentation image.

In addition, in an implementation of the present application, optionally, the smooth L2 loss function and the softmax function can be combined to perform image segmentation, that is, in the technology of the smooth L2 loss function segmentation result, the softmax function is used for accurate segmentation. The softmax function, or normalized exponential function, is the normalization of the gradient logarithm of the discrete probability distribution of finite items. The softmax maps the output of multiple neurons to the (0,1) interval, which can be regarded as the current output The probability belonging to each category, so as to facilitate the selection of the category with the highest probability as the target of prediction. Compared with other functions that can complete the maximum selection, the exponent is used in softmax, which can make the large value larger and the small one smaller, increase the discrimination contrast, and make the learning efficiency of the neural network higher.

Through the above technical solutions, in view of the technical problem of insufficient accuracy of image semantic segmentation in related technologies, the deconvolution layer can be used to replace the fully connected layer and add another fully connected layer. Classification can improve the accuracy of image semantic segmentation.

Fig. 2 shows a schematic diagram of image segmentation according to an embodiment of the present application.

As shown in Figure 2, w represents the width and h represents the height. The image to be segmented (image) whose length and width are w and h respectively is convolved and pooled to generate the first pooled feature set (pool1), and the length and The width is reduced to w/2 and h/2, the first pooled feature set is convolved and pooled to generate a second pooled feature set (pool2), the length and width are reduced to w/4 and h/4, the second The pooling feature set is convolved and pooled to generate the third pooling feature set (pool3), the length and width are reduced to w/8 and h/8, and the third pooling feature set is convolution and pooling to generate the third pooling feature set (pool3). Four-pooling feature set (pool4), the length and width are reduced to w/16 and h/16, the fourth pooling feature set is convolved and pooled to generate the fifth pooling feature set (pool5), the length and width are reduced It is w/32 and h/32. At this time, the resolution of the picture is greatly reduced as the length and width are reduced, resulting in a reduction in image quality.

Therefore, deconvolution, that is, up-sampling processing, can be used, because deconvolution is equivalent to the reverse of ordinary convolution. For example, if a blue 2x2 matrix is input, the size of the convolution kernel is still 3x3. When the deconvolution parameter pad=0 and stride=1 are set, the green 4x4 matrix is output, which is equivalent to completely inverting the convolution. It can be seen that the up-sampling process can increase the original resolution, and when applied to the pooled feature set after convolution and pooling, the resolution of the pooled feature set can be restored.

Specifically, in the case that the predetermined downsampling multiple of the image to be segmented is 32 times, the fifth pooling feature set of the five pooling feature sets is subjected to 32 times upsampling processing, and then the The result obtained by the 32-fold upsampling process is divided into softmax, so as to realize the 32-fold restoration of the fifth pooling feature set, and the accuracy of the result obtained by the 32-fold upsampling process is improved.

In the case that the predetermined down-sampling multiple of the image to be segmented is 16 times, the fifth pooling feature set among the five pooling feature sets is subjected to 2 times up-sampling processing to obtain the first up-sampling feature Collection; the first up-sampling feature set and the fourth pooling feature set of the five pooling feature sets are merged to obtain the final result of the upsampling process, and then the final result is performed The softmax segmentation realizes the restoration of the fourth pooling feature set, and improves the accuracy of the result obtained by the 16 times upsampling process.

The 16-fold reduction of the fourth pooled feature set can improve the accuracy of the result to a certain extent. However, since the fifth pooled feature set has been generated, that is, because the fourth pooled feature set has been further filtered And the highlight is in the fifth pooling feature set that is down-sampled by 32 times, so it can be effectively used by reducing it twice to the length and width of w/16 and h/16, respectively, which is the same as the fourth pooling feature set It has the same length and width, so it can be fused with the fourth pooling feature set, and the fusion will perform 16 times upsampling. The fusion mentioned here refers to merging the features of the pixels of the fourth pooling feature set and the features of the pixels obtained after 2 times upsampling of the fifth pooling feature set one by one.

Therefore, compared to the 16-fold restoration of the fourth pooling feature set, the accuracy of the up-sampling processing result is further improved, which is beneficial to further sharpening the edge of the image and improving the accuracy of the classification of the image edge.

In the case that the predetermined down-sampling multiple of the image to be segmented is 8 times, the fifth pooling feature set among the five pooling feature sets is subjected to a 2-fold up-sampling process to obtain the first up-sampling feature The first up-sampling feature set and the fourth pooling feature set of the five pooling feature sets are fused to obtain a fusion result; the fusion result is up-sampled twice to obtain the second Up-sampling feature set; fusing the second up-sampling feature set with the third pooling feature set of the five pooling feature sets to obtain the final result of the up-sampling process, and then to The final result is segmented by softmax, thereby realizing the restoration of the third pooling feature set, and improving the accuracy of the result obtained by the upsampling process by 8 times.

Simply upsampling the third pooled feature set by 8 times can improve the accuracy of the result to a certain extent. However, since the fourth pooled feature set and the fifth pooled feature set have been generated, that is, because the The third pooled feature set is further filtered and highlighted in the 16-fold down-sampled fourth pooled feature set, and the fourth pooled feature set has been further filtered and highlighted in the 32-fold down-sampled fifth pooled feature set Therefore, these down-sampling results can be effectively used to restore the fifth pooling feature set twice to the length and width of w/16 and h/16, that is, the same length and width as the fourth pooling feature set , Which can be directly fused with the fourth pooling feature set, and then perform 16 times upsampling after fusion. The fusion described here refers to combining the features of the pixels of the fourth pooling feature set and the features of the pixels obtained after upsampling the fifth pooling feature set by 2 times, thereby completing the fourth pooling The one-time feature correction of pixels in the feature set makes its features more categorical. Then, the fusion result can be up-sampled twice and restored to the length and width of w/8 and h/8 respectively, which have the same length and width as the third pooling feature set, which is convenient for the third pooling feature set After fusion, 8 times of up-sampling processing is performed, so that the features of the pixels in the third pooling feature set can be corrected through the filtering and highlighting of the fourth pooling feature set and the fifth pooling feature set, so that the final The features of the pixels in the fusion result are more accurate and suitable for classification.

Therefore, compared to the 8-fold restoration of the third pooling feature set alone, the accuracy of the up-sampling processing result can be further improved, which is conducive to further sharpening the image edges and improving the accuracy of image edge classification.

Fig. 3 shows a flowchart of an image segmentation method according to another embodiment of the present application.

As shown in FIG. 3, the process of the image segmentation method of another embodiment of the present application includes:

Step 302: Obtain an image to be divided.

Step 304: Perform convolution, activation and pooling processing on the image to be segmented to obtain five pooled feature sets.

Step 306: Perform an up-sampling process on the designated pooling feature set among the five pooling feature sets according to the up-sampling mode corresponding to the predetermined down-sampling multiple of the image to be segmented.

Step 308: Determine whether the number of fusions in the up-sampling process is the same as the specified number of fusions of the predetermined down-sampling multiple, if the result of the judgment is yes, go to step 310, if the result of the judgment is no, return to step 306, continue Perform upsampling processing including the fusion process.

With reference to the embodiment shown in FIG. 2, it can be seen that when the predetermined down-sampling multiple of the image to be segmented is 32 times, since there is no subsequent more accurate feature set after the fifth pooling feature set, only up-sampling is It can be processed once, and the corresponding designated fusion number is 0. In the case where the predetermined down-sampling multiple of the image to be segmented is 16 times, since the fourth pooling feature set has the fifth pooling feature set with more accurate features, it needs to be compared with the fifth pooling feature set. The result of 2 times upsampling is fused once. Similarly, when the predetermined down-sampling multiple of the image to be segmented is 16 times, since the third pooling feature set has the fourth pooling feature set and the fifth pooling feature set with more accurate features, Need to perform 2 fusions.

Therefore, each predetermined down-sampling multiple corresponds to the number of fusions that need to be achieved. Therefore, the number of fusions in the up-sampling process can be checked to determine whether the up-sampling processing step can be ended and the image segmentation step can be entered, and the number of fusions can be avoided. Reaching the standard means outputting the up-sampling result when the feature reduction level is insufficient. Through this effective monitoring of up-sampling processing, the accuracy of the final result can be further guaranteed.

In step 310, the final result of the upsampling process is segmented through the smooth L2 loss function and the softmax function to obtain segmented images.

Finally, it is used to segment the final result of the up-sampling process based on the total score of the mask through a smooth L2 loss function to obtain a segmented image. The smooth L2 loss function is also called the least square error. In general, it is to minimize the sum of squares of the difference between the target value and the estimated value, so that the weight of the feature is not too large, and the weight of the feature is more average, so that the fully connected layer can be replaced by the deconvolution layer, and Add another fully connected layer to obtain an optimized segmented image.

In the technique of the smooth L2 loss function segmentation result, the softmax function is used for precise segmentation. The softmax function, or normalized exponential function, is the normalization of the gradient logarithm of the discrete probability distribution of finite items. The softmax maps the output of multiple neurons to the (0,1) interval, which can be regarded as the current output The probability belonging to each category, so as to facilitate the selection of the category with the highest probability as the target of prediction. Compared with other functions that can complete the maximum selection, the exponent is used in softmax, which can make the large value larger and the small one smaller, increase the discrimination contrast, and make the learning efficiency of the neural network higher.

In summary, the method of replacing the fully connected layer with the deconvolution layer and adding another fully connected layer to classify each pixel of the image can improve the accuracy of image semantic segmentation.

Fig. 4 shows a block diagram of an image segmentation device according to an embodiment of the present application.

As shown in FIG. 4, the image segmentation device 400 of an embodiment of the present application includes: an image acquisition unit 402 for acquiring an image to be segmented; a down-sampling processing unit 404 for convolving and activating the image to be segmented And pooling processing to obtain five pooled feature sets; the up-sampling processing unit 406 is configured to perform an up-sampling method corresponding to the predetermined down-sampling multiple of the image to be segmented, The specified pooling feature set is subjected to up-sampling processing; the mask total score calculation unit 408 is configured to perform the up-sampling process according to the intersection ratio between the predicted mask and the actual mask and the original There are mask scores for network classification, and the total mask score is calculated; the image segmentation unit 410 is used to segment the final result of the upsampling process based on the mask total score through a smooth L2 loss function to obtain a segmented image.

In the above-mentioned embodiment of the present application, optionally, the up-sampling processing unit 406 includes: a first processing unit, configured to perform processing on all the images when the predetermined down-sampling multiple of the image to be divided is 32 times The fifth pooling feature set among the five pooling feature sets is subjected to 32 times upsampling processing.

In the above-mentioned embodiment of the present application, optionally, the up-sampling processing unit 406 includes: a second processing unit, in the case that the predetermined down-sampling multiple of the image to be divided is 16 times, the five The fifth pooled feature set in the pooled feature sets is subjected to 2 times upsampling processing to obtain the first upsampling feature set; the first fusion unit is used to combine the first upsampling feature set with the five pools The fourth pooled feature set in the optimized feature set is fused to obtain the final result of the upsampling process.

In the above-mentioned embodiment of the present application, optionally, the up-sampling processing unit 406 includes: a second processing unit, configured to perform processing on all the images when the predetermined down-sampling multiple of the image to be divided is 8 times The fifth pooled feature set among the five pooled feature sets is subjected to 2 times upsampling processing to obtain the first upsampled feature set; the first fusion unit is used to combine the first upsampled feature set with the five The fourth pooled feature set in the pooled feature sets is fused to obtain the fusion result; the third processing unit is used to perform 2 times upsampling processing on the fusion result to obtain the second upsampling feature set; the second fusion A unit for fusing the second up-sampling feature set with the third pooling feature set of the five pooling feature sets to obtain the final result of the up-sampling process.

In the foregoing embodiment of the present application, optionally, the up-sampling processing includes interpolation processing and deconvolution processing.

The image segmentation device 400 uses the solution described in any one of the embodiments shown in FIG. 1 to FIG. 3, and therefore, has all the above technical effects, which will not be repeated here.

As shown in FIG. 5, an electronic device 500 of an embodiment of the present application includes at least one memory 502; and a processor 504 communicatively connected to the at least one memory 502; wherein the memory stores the At least one instruction executed by the processor 504, where the instruction is configured to execute the solution described in any one of the foregoing embodiments in FIG. 1 to FIG. 3. Therefore, the electronic device 500 has the same technical effect as any one of the embodiments in FIGS. 1 to 3, and details are not described herein again.

The electronic devices in the embodiments of this application exist in various forms, including but not limited to:

(1) Mobile communication equipment: This type of equipment is characterized by mobile communication functions, and its main goal is to provide voice and data communications. Such terminals include: smart phones (such as iPhone), multimedia phones, functional phones, and low-end phones.

(2) Ultra-mobile personal computer equipment: This type of equipment belongs to the category of personal computers, has calculation and processing functions, and generally also has mobile Internet features. Such terminals include: PDA, MID and UMPC devices, such as iPad.

(3) Portable entertainment equipment: This type of equipment can display and play multimedia content. Such devices include: audio, video players (such as iPod), handheld game consoles, e-books, as well as smart toys and portable car navigation devices.

(4) Server: A device that provides computing services. The structure of a server includes a processor, hard disk, memory, system bus, etc. The server is similar to a general-purpose computer architecture, but because it needs to provide highly reliable services, it is in terms of processing capacity and stability. , Reliability, security, scalability, and manageability.

(5) Other electronic devices with data interaction functions.

In addition, an embodiment of the present application provides a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are used to execute the method flow described in any one of the above-mentioned embodiments in FIGS. 1 to 3.

The technical solutions of the present application are described in detail above in conjunction with the drawings. Through the technical solutions of the present application, the output image of the convolutional neural network can be restored in pixel dimensions, thereby facilitating effective classification of the features of the output image and improving image semantic segmentation. Accuracy.

It should be understood that the term "and/or" used in this article is only an association relationship describing associated objects, which means that there can be three relationships. For example, A and/or B can mean that there is A alone, and both A and B, there are three cases of B alone. In addition, the character "/" in this text generally indicates that the associated objects before and after are in an "or" relationship.

It should be understood that, although the terms first, second, etc. may be used to describe pooling feature sets in the embodiments of the present application, these pooling feature sets should not be limited to these terms. These terms are only used to distinguish pooled feature sets from each other. For example, without departing from the scope of the embodiments of the present application, the first pooling feature set can also be referred to as the second pooling feature set, and similarly, the second pooling feature set can also be referred to as the first pooling feature set. Feature collection.

Depending on the context, the word "if" as used herein can be interpreted as "when" or "when" or "in response to determination" or "in response to detection". Similarly, depending on the context, the phrase "if determined" or "if detected (statement or event)" can be interpreted as "when determined" or "in response to determination" or "when detected (statement or event) )" or "in response to detection (statement or event)".

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined Or it can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

In addition, the functional units in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.

The above-mentioned integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The above-mentioned software functional unit is stored in a storage medium and includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (Processor) execute the method described in each embodiment of the present application Part of the steps. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

The above are only the preferred embodiments of this application and are not intended to limit this application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in this application Within the scope of protection.

Claims

An image segmentation method, characterized in that it comprises:

Obtain the image to be segmented;

Performing convolution, activation and pooling processing on the image to be segmented to obtain five pooling feature sets;

Performing an up-sampling process on a specified pooling feature set among the five pooling feature sets according to an up-sampling manner corresponding to the predetermined down-sampling multiple of the image to be segmented;

In the process of the up-sampling processing, calculating the total mask score according to the intersection ratio of the predicted mask and the actual mask and the mask score of the original network classification of the image to be segmented;

The final result of the upsampling process is segmented based on the total score of the mask by a smooth L2 loss function to obtain segmented images.
The image segmentation method according to claim 1, wherein the specified pooling feature in the five pooling feature sets is determined according to the up-sampling method corresponding to the predetermined down-sampling multiple of the image to be segmented The steps for upsampling the collection include:

In the case that the predetermined down-sampling multiple of the image to be segmented is 32 times, 32-fold up-sampling processing is performed on the fifth pooling feature set among the five pooling feature sets.
The image segmentation method according to claim 1, wherein the specified pooling feature in the five pooling feature sets is determined according to the up-sampling method corresponding to the predetermined down-sampling multiple of the image to be segmented The steps for upsampling the collection include:

In the case that the predetermined down-sampling multiple of the image to be segmented is 16 times, the fifth pooling feature set among the five pooling feature sets is subjected to 2 times up-sampling processing to obtain the first up-sampling feature set;

The first up-sampling feature set and the fourth pooling feature set of the five pooling feature sets are merged to obtain the final result of the up-sampling process.
The image segmentation method according to claim 1, wherein the specified pooling feature in the five pooling feature sets is determined according to the up-sampling method corresponding to the predetermined down-sampling multiple of the image to be segmented The steps for upsampling the collection include:

In the case that the predetermined down-sampling multiple of the image to be segmented is 8 times, the fifth pooling feature set among the five pooling feature sets is subjected to a 2-fold up-sampling process to obtain the first up-sampling feature set;

Fusing the first up-sampling feature set with the fourth pooling feature set among the five pooling feature sets to obtain a fusion result;

Performing 2 times upsampling processing on the fusion result to obtain a second upsampling feature set;

The second upsampling feature set is merged with the third pooling feature set of the five pooling feature sets to obtain the final result of the upsampling process.
The image segmentation method according to any one of claims 1 to 4, wherein:

The up-sampling processing includes interpolation processing and deconvolution processing.
An image segmentation device, characterized in that it comprises:

An image acquisition unit for acquiring an image to be divided;

The down-sampling processing unit is configured to perform convolution, activation and pooling processing on the image to be segmented to obtain five pooling feature sets;

An up-sampling processing unit, configured to perform up-sampling processing on a specified pooling feature set among the five pooling feature sets according to an up-sampling mode corresponding to a predetermined down-sampling multiple of the image to be divided;

The mask total score calculation unit is used to calculate the mask according to the intersection ratio of the predicted mask and the actual mask and the mask score of the original network classification of the image to be divided during the up-sampling process Total Score;

The image segmentation unit is configured to segment the final result of the up-sampling process based on the total score of the mask through a smooth L2 loss function to obtain a segmented image.
The image segmentation device according to claim 6, wherein the up-sampling processing unit comprises:

The first processing unit is configured to perform 32-fold up-sampling processing on the fifth pooling feature set among the five pooling feature sets when the predetermined down-sampling multiple of the image to be divided is 32 times .
The image segmentation device according to claim 6, wherein the up-sampling processing unit comprises:

The second processing unit, when the predetermined down-sampling multiple of the image to be segmented is 16 times, performs 2-fold up-sampling processing on the fifth pooling feature set among the five pooling feature sets to obtain The first upsampling feature set;

The first fusion unit is configured to merge the first up-sampling feature set with the fourth pooling feature set among the five pooling feature sets to obtain the final result of the up-sampling processing.
The image segmentation device according to claim 6, wherein the up-sampling processing unit comprises:

The second processing unit is configured to perform 2-fold up-sampling processing on the fifth pooling feature set among the five pooling feature sets when the predetermined down-sampling multiple of the image to be divided is 8 times , Get the first upsampling feature set;

The first fusion unit is configured to merge the first up-sampling feature set with the fourth pooling feature set among the five pooling feature sets to obtain a fusion result;

A third processing unit, configured to perform 2 times upsampling processing on the fusion result to obtain a second upsampling feature set;

The second fusion unit is configured to fuse the second up-sampling feature set with the third pooling feature set among the five pooling feature sets to obtain the final result of the up-sampling processing.
The image segmentation device according to any one of claims 6 to 9, wherein:

The up-sampling processing includes interpolation processing and deconvolution processing.
An electronic device, characterized by comprising: at least one processor; and a memory connected in communication with the at least one processor;

Wherein, the memory stores instructions executable by the at least one processor, and the instructions are configured to execute the following steps:

Obtain the image to be segmented;

Performing convolution, activation and pooling processing on the image to be segmented to obtain five pooling feature sets;

Performing an up-sampling process on a specified pooling feature set among the five pooling feature sets according to an up-sampling manner corresponding to the predetermined down-sampling multiple of the image to be segmented;

In the process of the up-sampling processing, calculating the total mask score according to the intersection ratio of the predicted mask and the actual mask and the mask score of the original network classification of the image to be segmented;

The final result of the upsampling process is segmented based on the total score of the mask by a smooth L2 loss function to obtain segmented images.
The electronic device according to claim 11, wherein the instruction is configured to perform the following steps:

In the case that the predetermined down-sampling multiple of the image to be segmented is 32 times, 32-fold up-sampling processing is performed on the fifth pooling feature set among the five pooling feature sets.
The electronic device according to claim 11, wherein the instruction is configured to perform the following steps:

In the case that the predetermined down-sampling multiple of the image to be segmented is 16 times, the fifth pooling feature set among the five pooling feature sets is subjected to 2 times up-sampling processing to obtain the first up-sampling feature set;

The first up-sampling feature set and the fourth pooling feature set of the five pooling feature sets are merged to obtain the final result of the up-sampling process.
The electronic device according to claim 11, wherein the instruction is configured to perform the following steps:

In the case that the predetermined down-sampling multiple of the image to be segmented is 8 times, the fifth pooling feature set among the five pooling feature sets is subjected to a 2-fold up-sampling process to obtain the first up-sampling feature set;

Fusing the first up-sampling feature set with the fourth pooling feature set of the five pooling feature sets to obtain a fusion result;

Performing 2 times upsampling processing on the fusion result to obtain a second upsampling feature set;

The second upsampling feature set is merged with the third pooling feature set of the five pooling feature sets to obtain the final result of the upsampling process.
The electronic device according to any one of claims 11 to 14, wherein:

The up-sampling processing includes interpolation processing and deconvolution processing.
A computer-readable storage medium is characterized by storing computer-executable instructions, and the computer-executable instructions are used to execute the following steps:

Obtain the image to be segmented;

Performing convolution, activation and pooling processing on the image to be segmented to obtain five pooling feature sets;

Performing an up-sampling process on a specified pooling feature set among the five pooling feature sets according to an up-sampling manner corresponding to the predetermined down-sampling multiple of the image to be segmented;

In the process of the up-sampling processing, calculating the total mask score according to the intersection ratio of the predicted mask and the actual mask and the mask score of the original network classification of the image to be segmented;

The final result of the upsampling process is segmented based on the total score of the mask by a smooth L2 loss function to obtain segmented images.
The computer-readable storage medium of claim 16, wherein the computer-executable instructions are used to perform the following steps:

In the case that the predetermined down-sampling multiple of the image to be segmented is 32 times, 32-fold up-sampling processing is performed on the fifth pooling feature set among the five pooling feature sets.
The computer-readable storage medium of claim 16, wherein the computer-executable instructions are used to perform the following steps:

In the case that the predetermined down-sampling multiple of the image to be segmented is 16 times, the fifth pooled feature set of the five pooled feature sets is subjected to a 2-fold upsampling process to obtain the first upsampling feature set;

The first up-sampling feature set and the fourth pooling feature set of the five pooling feature sets are merged to obtain the final result of the up-sampling process.
The computer-readable storage medium of claim 16, wherein the computer-executable instructions are used to perform the following steps:

In the case that the predetermined down-sampling multiple of the image to be segmented is 8 times, the fifth pooling feature set among the five pooling feature sets is subjected to a 2-fold up-sampling process to obtain the first up-sampling feature set;

Fusing the first up-sampling feature set with the fourth pooling feature set among the five pooling feature sets to obtain a fusion result;

Performing 2 times upsampling processing on the fusion result to obtain a second upsampling feature set;

The second upsampling feature set is merged with the third pooling feature set of the five pooling feature sets to obtain the final result of the upsampling process.
The computer-readable storage medium according to any one of claims 16 to 19, wherein:

The up-sampling processing includes interpolation processing and deconvolution processing.