WO2022133627A1

WO2022133627A1 - Image segmentation method and apparatus, and device and storage medium

Info

Publication number: WO2022133627A1
Application number: PCT/CN2020/137858
Authority: WO
Inventors: 曹桂平
Original assignee: 广州视源电子科技股份有限公司
Priority date: 2020-12-21
Filing date: 2020-12-21
Publication date: 2022-06-30
Also published as: CN115349139A

Abstract

The embodiments of the present application relate to the technical field of image processing. Provided are an image segmentation method and apparatus, and a device and a storage medium. The image segmentation method comprises: acquiring the current frame of image in video data, wherein a target object is displayed in the video data; inputting the current frame of image into a trained image segmentation model, so as to obtain a first segmented image based on the target object; performing smoothing processing on the first segmented image, so as to obtain a second segmented image based on the target object; and taking the next frame of image in the video data as the current frame of image, and returning to carrying out the operation of inputting the current frame of image into the trained image segmentation model, until a corresponding second segmented image is obtained for each frame of image in the video data. By means of the method, the technical problem in the prior art of it not being possible to accurately carry out image segmentation on online video data can be solved.

Description

Image segmentation method, device, equipment and storage medium

technical field

The embodiments of the present application relate to the technical field of image processing, and in particular, to an image segmentation method, apparatus, device, and storage medium.

Background technique

Image segmentation is one of the common techniques in image processing, which is used to accurately extract the region of interest in the image to be processed, and use the region of interest as the target region image to facilitate subsequent processing of the target region image (such as background replacement). , deducting the image of the target area, etc.). Portrait-based image segmentation is an important application in the field of image segmentation. Portrait-based image segmentation refers to the accurate separation of the portrait area and the background area in the image to be processed. At present, with the development of computer and network technology, it is of great significance to perform portrait-based image segmentation for online video data. In scenarios such as online conferences or online live broadcasts, image segmentation is performed on the online video data to accurately separate the portrait area and the background area in the video data, and then the background image is replaced in the background area to protect user privacy. the goal of.

In the process of realizing the present application, the inventor found that some image segmentation techniques have the following defects: image segmentation mainly includes methods based on threshold, region-based, edge-based, graph theory and energy functional. Among them, the threshold-based method needs to be segmented according to the grayscale features in the image, and its drawback is that it is only suitable for images in which the grayscale values of the portrait area are evenly distributed outside the grayscale values of the background area. The region-based method divides the image into different regions according to the similarity criterion of the spatial neighborhood, and its disadvantage is that it cannot handle complex images. The edge-based method mainly uses the discontinuity of local image features (such as the pixel mutation of the face edge) to obtain the boundary of the portrait region, and its disadvantage is that the computational complexity is high. The methods based on graph theory and energy functional mainly use the energy functional of the image to perform portrait segmentation, but the disadvantage is that the amount of calculation is huge and artificial prior information is required. Due to the defects of the above technology, it cannot be applied to the scene of real-time, simple and accurate image segmentation for online video data.

In conclusion, how to segment any online video data in real time, simply and accurately has become a technical problem that needs to be solved urgently.

SUMMARY OF THE INVENTION

Embodiments of the present application provide an image segmentation method, apparatus, device, and storage medium, so as to solve the technical problem that the above technology cannot accurately perform image segmentation on online video data.

In a first aspect, an embodiment of the present application provides an image segmentation method, including:

Obtain the current frame image in the video data, and the target object is displayed in the video data;

The current frame image is input into the trained image segmentation model to obtain the first segmented image based on the target object;

smoothing the first segmented image to obtain a second segmented image based on the target object;

Take the next frame image in the video data as the current frame image, and return to perform the operation of inputting the current frame image into the trained image segmentation model, until each frame image in the video data obtains the corresponding second image. until the image is divided.

In a second aspect, an embodiment of the present application further provides an image segmentation device, including:

a data acquisition module for acquiring the current frame image in the video data, where the target object is displayed;

a first segmentation module, for inputting the current frame image into a trained image segmentation model to obtain a first segmented image based on the target object;

a second segmentation module, configured to perform smoothing processing on the first segmented image to obtain a second segmented image based on the target object;

The repeated segmentation module is used for taking the next frame image in the video data as the current frame image, and returning to perform the operation of inputting the current frame image into the trained image segmentation model, until each frame image in the video data until the corresponding second segmented image is obtained.

In a third aspect, the embodiments of the present application further provide an image segmentation device, including:

one or more processors;

memory for storing one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors implement the image segmentation method as described in the first aspect.

In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the image segmentation method described in the first aspect.

The above-mentioned image segmentation method, device, equipment and storage medium, by acquiring the video data including the target object, input each frame image of the video data into the image segmentation model to obtain the corresponding first segmented image, and then, for the first segmented image The technical means of obtaining the second segmented image by performing smoothing processing solves the technical problem that some image segmentation technologies cannot accurately segment the online video data. By using the image segmentation model of the auto-encoder and the smoothing process, the online video data can be segmented in real time and accurately, and due to the self-learning of the image segmentation model, it can be applied to the online video data with complex images, and in the application process, It can be directly applied only by deploying the image segmentation model without artificial prior information, which simplifies the complexity of image segmentation and expands the application scenarios of image segmentation methods.

Description of drawings

1 is a flowchart of an image segmentation method provided by an embodiment of the present application;

2 is a flowchart of another image segmentation method provided by an embodiment of the present application;

3 is a schematic structural diagram of an image segmentation model provided by an embodiment of the present application;

4 is a schematic diagram of an original image provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a segmentation result image provided by an embodiment of the present application;

6 is a schematic diagram of an edge result image provided by an embodiment of the present application;

7 is a schematic structural diagram of another image segmentation model provided by an embodiment of the present application;

FIG. 8 is a schematic structural diagram of an image segmentation apparatus provided by an embodiment of the present application;

FIG. 9 is a schematic structural diagram of an image segmentation device provided by an embodiment of the present application.

Detailed ways

The present application will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are used to explain the present application, but not to limit the present application. In addition, it should be noted that, for the convenience of description, the drawings only show some but not all the structures related to the present application.

It should be noted that, in this document, relational terms such as first and second are only used to distinguish one entity or operation or object from another entity or operation or object, and do not necessarily require or imply these entities Or there is any such actual relationship or order between operations or objects. For example, "first" and "second" of the first segmented image and the second segmented image are used to distinguish two different segmented images.

The image segmentation method provided in this embodiment of the present application may be performed by an image segmentation device, which may be implemented in software and/or hardware, and the image segmentation device may be composed of two or more physical entities, or may be one Physical entity composition. For example, the image segmentation device may be a computer, a mobile phone, a tablet, or an interactive smart tablet, and other smart devices with data computing and analysis capabilities.

FIG. 1 is a flowchart of an image segmentation method provided by an embodiment of the present application. Referring to Figure 1, the image segmentation method specifically includes:

Step 110: Acquire the current frame image in the video data, and the target object is displayed in the video data.

The video data is the video data for which image segmentation is currently required, which may be online video data or offline video data. The video data includes multiple frames of images, and each frame of images displays a target object, which can be considered as an object that needs to be separated from the background image. Optionally, the background images of each frame image in the video data may be the same or different, which is not limited in the embodiment, and the target object may change with the playback of the video data, but during the change process, the type of the target object is not limited. Change. For example, when the target object is a human being, the human image in the video data may change (such as replacing a person or adding a new person, etc.), but the target object in the video data is always a human being. In the following embodiments, the target object is a human being as an example. Optionally, the source of the video data is not limited in this embodiment. For example, the video data is a piece of video shot by an image capture device (such as a camera, a camera, etc.) connected to the image segmentation device. For another example, the video data is a conference screen obtained from a network in a video conference scenario. For another example, the video data is a live broadcast image obtained from a network in a live broadcast scenario.

Exemplarily, performing image segmentation on the video data refers to separating the region where the target object is located in each frame of image in the video data. In the embodiment, the target object is exemplarily described as a human being. Exemplarily, the processing of the video data is in units of frames, that is, the images in the video data are acquired frame by frame, and the images are processed to obtain a final image segmentation result. In the embodiment, the currently processed image is recorded as the current frame image, and the processing of the current frame image is taken as an example for description.

Step 120: Input the current frame image into the trained image segmentation model to obtain the first segmented image based on the target object.

The image segmentation model is a pre-trained neural network model, which is used to segment the target object in the current frame image and output the segmentation result corresponding to the current frame image. In the embodiment, the segmentation result is recorded as the first segmented image, and the portrait area and the background area of the current frame image can be determined through the first segmented image, wherein the portrait area can be considered as the area where the target object (human) is located. In one embodiment, the first segmented image is a binary image, and its pixel values include two types: 0 and 1, wherein, the area with a pixel value of 0 belongs to the background area of the current frame image, and the area with a pixel value of 1 belongs to the current frame image. portrait area. It can be understood that in order to facilitate the visual display of the first segmented image, the pixel values are converted into two types: 0 and 255 before displaying the first segmented image, wherein the area with a pixel value of 0 belongs to the background area, and the area with a pixel value of 255 belongs to the background area. Portrait area. The resolution of the first divided image is the same as the resolution of the current frame image. It can be understood that when the image segmentation model has resolution requirements for the input image, that is, when a fixed-resolution image needs to be input, it is necessary to determine whether the resolution of the current frame image meets the resolution requirement. If it does not meet the resolution requirement, then Perform resolution conversion on the current frame image to obtain the current frame image that meets the resolution requirements. At this time, after the first segmented image is obtained, the resolution conversion is also performed on the first segmented image, so that the resolution of the first segmented image is the same as the resolution of the original current frame image (that is, the current frame image before the resolution is converted). same. When the image segmentation model does not have a resolution requirement for the input image, the current frame image can be directly input to the image segmentation model to obtain the first segmented image with the same resolution.

Exemplarily, the structure and parameters of the image segmentation model can be set according to actual conditions. In the embodiment, the image segmentation model adopts an autoencoder structure. Among them, autoencoder (autoencoder) is a kind of artificial neural network used in semi-supervised learning and unsupervised learning. The autoencoder includes two parts: an encoder (encoder) and a decoder (decoder), wherein the encoder is used to extract the features in the image, and the decoder is used to decode the extracted features to obtain the learning result (for example, the first split image). Optionally, the encoder adopts a lightweight network to reduce the amount of data processing and calculation when extracting features, and to speed up the processing speed. The decoder can be implemented by residual blocks combined with channel obfuscation, upsampling, etc., to achieve fully automatic real-time image segmentation. In the embodiment, the features of the current frame image at different resolutions can be extracted by the encoder, and then the decoder performs operations such as upsampling, fusion, decoding, etc. on each feature to reuse each feature, thereby obtaining an accurate first segmentation. image.

Optionally, the image segmentation model is deployed under the forward inference framework. The specific type of the forward reasoning framework can be set according to the actual situation, for example, the forward reasoning framework is the openvino framework. Among them, when deployed in the forward inference framework, the image segmentation model has a low dependence on the GPU, is relatively portable, and does not occupy a large storage space.

Step 130: Smooth the first segmented image to obtain a second segmented image based on the target object.

In the embodiment, there are different degrees of edge jaggedness in the first segmented image, wherein the edge jaggedness can be understood as jagged edges between the portrait area and the background area, which makes the separation of the portrait area and the background area too stiff. In the embodiment, in order to reduce the influence of edge jaggedness, the first segmented image is smoothed, that is, the edge jaggedness in the first segmented image is smoothed, so as to obtain a segmented image with smoother edges. The segmented image is denoted as the second segmented image. The second segmented image can also be considered as the final segmented result of the current frame image. It can be understood that the second segmented image is also a binary image, and its pixel values include two types: 0 and 1. The area with a pixel value of 0 belongs to the background area of the current frame image, and the area with a pixel value of 1 belongs to the background area of the current frame image. Portrait area.

The technical means used in the smoothing processing can be set according to the actual situation. In the embodiment, the smoothing processing is implemented by means of Gaussian smoothing filtering. Exemplarily, the Gaussian kernel function is used in the Gaussian smoothing filtering to process the first segmented image to obtain the second segmented image. The Gaussian kernel function is a commonly used kernel function. At this time, the smoothing process can be expressed as: S ₂ =S ₁ *G, where S ₂ represents the second segmented image, S ₁ represents the first segmented image, and G represents the Gaussian kernel function.

Step 140, take the next frame image in the video data as the current frame image, and return to perform the operation of inputting the current frame image to the trained image segmentation model, until each frame image in the video data obtains the corresponding second segmentation image. .

Exemplarily, after the second divided image is obtained, it can be considered that the image segmentation of the current frame image has been completed, and therefore, the next frame image in the video data can be processed. The processing procedure is to take the next frame of image as the current frame image, and repeat steps 110 to 130 to obtain the second segmented image of the current frame image again, and then obtain the next frame of image again, and repeat the above process until Image segmentation is achieved until each frame of image in the video data obtains a corresponding second segmented image.

It can be understood that, after the second segmented image is obtained, the current frame image can be processed according to actual needs. In the embodiment, taking background replacement as an example, at this time, after smoothing the first segmented image to obtain the second segmented image based on the target object, the method further includes: acquiring a target background image, where the target background image includes the target background; The background of the current frame image is replaced according to the target background image and the second segmented image, so as to obtain a new image of the current frame.

The target background refers to a new background used after the background is replaced. The target background image is the image that contains the target background. Optionally, the target background image and the second segmented image have the same resolution. The target background image may be an image selected by the user of the image segmentation device, or may be a default image of the image segmentation device. Exemplarily, after the target background image is acquired, the background of the current frame image is replaced to obtain a replaced image. In the embodiment, the image after the background replacement is recorded as the new image of the current frame. Exemplarily, the background replacement method is: determining the pixels where the portrait area is located and the pixel points where the background area is located in the current frame image by using the second segmented image, after that, retaining the portrait area and replacing the corresponding background area with the one in the target background image. Correlate the target background to get a new image for the current frame. The background replacement can be expressed as: I'=I×S ₂ +(1-S ₂ )×B, where S ₂ represents the second segmented image, I' represents the new image of the current frame, I represents the current frame image, and B represents the target background image. In the above formula, the portrait area can be retained by I×S ₂ (that is, after the current frame image is multiplied by the second segmented image, the pixels in the current frame image correspond to the pixels with the pixel value of 1 in the second segmented image, which are Retained), the background area can be replaced by (1-S ₂ )×B (that is, after the target background image is multiplied by the second segmented image, the pixels in the target background image correspond to the pixels whose pixel value is 0 in the second segmented image point, which is preserved). It can be understood that when the video data is segmented, after each frame of the second segmented image is obtained, a new image of the current frame corresponding to the current frame of image can be obtained through the second segmented image. After that, each new image of the current frame can form new video data after background replacement.

It can be understood that, in the embodiment, in order to facilitate the understanding of the technical solution, the video data is limited to include the target object. In practical applications, the video data may not include the target object. 0 for the segmented image.

In the above, by acquiring the video data containing the target object, each frame of the video data is input into the image segmentation model to obtain the corresponding first segmented image, and then the first segmented image is smoothed to obtain the second segmented image It solves the technical problem that some image segmentation technologies cannot accurately segment online video data. By using the image segmentation model of the auto-encoder and the smoothing process, the video data can be accurately segmented, especially the online video data, which ensures the processing speed of the online video data. And due to the self-learning of the image segmentation model, it can be applied to video data with complex images, and in the application process, it can be directly applied only by deploying the image segmentation model without artificial prior information, which simplifies the complexity of image segmentation. The application scenarios of image segmentation methods are expanded.

It can be understood that the above-mentioned image segmentation method can be considered as an application process of an image segmentation model. In practical applications, the performance of the image segmentation model can directly affect the results of image segmentation. Therefore, in addition to the application of the image segmentation model, the training process of the image segmentation model is also an important part. Exemplarily, FIG. 2 is a flowchart of another image segmentation method provided by an embodiment of the present application. This image segmentation method is based on the above-mentioned image segmentation method, and exemplifies the training process of the image segmentation model. Referring to Figure 2, the image segmentation method specifically includes:

Step 210: Acquire a training data set, where the training data set includes multiple original images.

The training data refers to the data that the image segmentation model learns when training the image segmentation model. In the embodiment, the training data is in the form of images, so the training data is referred to as the original image, and the original image and the video data contain the same type of target objects. Illustratively, a training dataset refers to a dataset containing a large number of original images. That is, during the training process, a large number of original images are selected from the training data set for the image segmentation model to learn, so as to improve the accuracy of the image segmentation model.

Exemplarily, the video data contains a large number of images. If the original images are collected according to the video data, the images need to be collected frame by frame based on the video data, which will consume a lot of work and production costs, and each collected original image will contain a large number of repetitions. content, which is not conducive to the training of image segmentation models. Therefore, in the embodiment, the training data set is constructed by replacing the video data with independent original images. At this point, the constructed training dataset can contain original images with different portrait poses in different scenes. Wherein, the scene is preferably a natural scene. That is, a plurality of natural scenes are preselected, and in each natural scene, a plurality of images containing human beings are captured by an image acquisition device as original images, wherein the postures of the human beings in the plurality of images are different. Optionally, in order to reduce the influence of the parameters of the image acquisition device (such as the position of the image acquisition device, the aperture size, the degree of focus, etc.) and the lighting in the natural environment on the performance of the image segmentation model, when constructing the training data set, the same natural In the same portrait pose of the scene, multiple original images under different lighting and different shooting parameters are collected to ensure the performance of the image segmentation model when processing video data in different scenes, different portrait poses, different lighting and different shooting parameters.

It is understandable that an existing public image data set can also be used as the training data set, for example, the public data set Supervisely can be used as the training data set, or the public data set EG1800 can be used as the training data set.

Step 220 , constructing a label data set according to the training data set, the label data set includes a plurality of segmentation label images and a plurality of edge label images, and an original image corresponds to a segmentation label image and an edge label image.

Exemplarily, the label data can be understood as reference data for determining whether the image segmentation model is accurate, which plays a role of supervision. If the output result of the image segmentation model is more similar to the corresponding label data, it means that the accuracy of the image segmentation model is higher, that is, the performance is better; otherwise, the accuracy of the image segmentation model is lower. It can be understood that the process of training the image segmentation model is the process of making the output result of the image segmentation model more similar to the corresponding label data.

In one embodiment, after the original image is input into the image segmentation model, the image segmentation model outputs a segmented image and an edge image corresponding to the original image, wherein the segmented image refers to a binary image obtained by performing image segmentation on the target object in the original image. , in the embodiment, the segmented image output by the image segmentation model in the training process is recorded as the segmentation result image. The edge image refers to a binary image representing the edge between the portrait area and the background area in the original image. In the embodiment, the edge image output by the image segmentation model in the training process is recorded as the edge result image. In order to accurately train the image segmentation model, in the embodiment, the label data is set according to the output result of the image segmentation module, including the segmentation label image and the edge label image, wherein the segmentation label image corresponds to the segmentation result image, and is used to play a role in the segmentation result image. For reference, the edge label image corresponds to the edge result image, and is used for reference to the edge result image. Each original image has corresponding segmented label images and edge label images, and each segmented label image and edge label image constitutes a label dataset.

Exemplarily, both the edge label image and the segmented label image can be obtained from the above-mentioned original image. For example, by using manual labeling, the portrait area, background area and edge area are marked in each original image, and then the edge label image and segmentation label image are obtained according to the portrait area, background area and edge area. For another example, a human-marking method is used to mark a portrait region and a background region in each original image, and then a segmented label image is obtained according to the portrait region and the background region, and an edge label image is obtained according to the segmented label image.

In the embodiment, an exemplary description is given in the manner of obtaining a segmented label image by manual annotation and obtaining an edge label image by segmenting the label image. In this embodiment, step 220 includes steps 221-225:

Step 221: Acquire an annotation result for the original image.

The labeling result refers to the result obtained after labeling the portrait area and background area in the original image. In the embodiment, the labeling result is obtained by manual labeling, that is, the portrait area and the background area are manually marked in the above-mentioned original image, and then the image segmentation device obtains the labeling result according to the marked portrait area and background area.

Step 222: Obtain a corresponding segmented label image according to the labeling result.

Exemplarily, according to the labeling result, the pixel value of each pixel included in the portrait region in the original image is changed to 255, and the pixel value of each pixel included in the background region in the original image is changed to 0, thereby obtaining the segmented label image. Understandably, the segmented label image is a binary image.

Step 223: Perform an erosion operation on the segmented label image to obtain an erosion image.

The erosion operation can be understood as reducing and refining the white area (ie, the portrait area) with a pixel value of 255 in the segmented label image. In the embodiment, the image obtained by performing the erosion operation on the segmented label image is recorded as the erosion image. It can be understood that the number of pixels occupied by the white area in the eroded image is smaller than the number of pixels occupied by the white area in the segmented label image, and the white area in the segmented label image can completely cover the white area in the eroded image.

Step 224 , perform a Boolean operation on the segmented label image and the eroded image to obtain an edge label image corresponding to the original image.

Boolean operations include union, intersection, and subtraction. A plurality of objects that perform Boolean operations are operation objects. In the embodiment, the operation objects include the segmented label image and the eroded image, more specifically, the segmented label image and the white area in the eroded image. The result obtained by the Boolean operation may be recorded as a Boolean object. In the embodiment, the Boolean object is an edge label image. Illustratively, union means that the resulting Boolean object contains the volume of the two operands. Since the white area in the segmented label image can completely cover the white area in the eroded image, the Boolean object obtained by combining the segmented label image and the eroded image is the white area in the segmented label image. Intersection means that the resulting Boolean object only contains the common volume of the two operation objects (that is, only contains overlapping positions). Since the white area in the segmented label image can completely cover the white area in the eroded image, therefore, for the segmented label image and The resulting Boolean object after intersecting the eroded image is the white area in the eroded image. Subtraction means that the Boolean object contains the volume of the operation object from which the intersection volume is subtracted. For example, the Boolean object obtained after subtracting the segmented label image and the eroded image is to remove the white color corresponding to the eroded image in the white area of the segmented label image. The resulting white area after the area. It can be understood that since the eroded image is an image obtained by reducing the white area in the segmented label image, the edges of the eroded image and the white area in the segmented label image are highly similar. After subtraction, you can get the white area that only represents the edge, that is, the edge label image. At this time, the edge label image can be expressed as: GT _edge =GT-GT _erode , where GT _edge represents the edge label image, GT represents the segmentation label image, and GT _erode represents the erosion image. It can be understood that the edge label image is a binary image, and the resolution is equal to the resolution of the segmentation label image.

Step 225: Obtain a label data set according to the segmented label image and the edge label image.

After the segmented label images and edge label images of each original image are obtained according to the above steps, a label data set is formed by each segmented label icon and each edge label image. It can be understood that the segmentation label image and the edge label image can be considered as Ground Truth, that is, the correct label.

Step 230: Train the image segmentation model according to the training data set and the label data set.

Exemplarily, an original image is input to the image segmentation model, and a loss function is constructed according to the output result of the image segmentation model and the corresponding label data in the label dataset, and then the model parameters of the image segmentation model are updated according to the loss function. After that, another original image is input into the updated image segmentation model to construct the loss function again, and the model parameters of the image segmentation model are updated again according to the loss function, and the above training process is repeated until the loss function converges. Among them, when the value of the loss function obtained by successive calculations is within the set range, it can be considered that the loss function has converged, and then the accuracy of the output result of the image segmentation model can be determined to be stable. Therefore, it can be considered that the image segmentation model has been trained.

Exemplarily, the specific structure of the image segmentation model can be set according to the actual situation. In the embodiment, the image segmentation model includes: a normalization module, an encoding module, a channel confusion module, a residual module, a multiple upsampling module, an output module and an edge module as an example for description. For ease of understanding, the image segmentation model is exemplarily described with the structure shown in FIG. 3 . 3 is a schematic structural diagram of an image segmentation model provided by an embodiment of the present application. 3 , the image segmentation model includes a normalization module 21 , an encoding module 22 , four channel obfuscation modules 23 , three residual modules 24 , four multiple upsampling modules 25 , an output module 26 and an edge module 27 . In this embodiment, step 230 includes steps 231-2310:

Step 231: Input the original image to the normalization module to obtain a normalized image.

In the embodiment, description is given by taking an example that the resolution of the original image is 224×224. For example, FIG. 4 is a schematic diagram of an original image provided by an embodiment of the present application. Referring to Figure 4, the original image contains a portrait area, and it should be noted that the original image used in Figure 4 comes from the public dataset Supervisely.

Exemplarily, normalization refers to the process of performing a series of standard processing and transformation on the image to transform the image into a fixed standard form. In this case, the obtained standard image is called a normalized image. Normalization is divided into linear normalization and nonlinear normalization. In the embodiment, the original image is processed by means of linear normalization. Among them, linear normalization is to normalize the pixel values in each image from [0, 255] to [-1, 1], and the resolution of the obtained normalized image is equal to the resolution of the image before linear normalization . It can be understood that the normalization module is a module that implements a linear normalization operation. After the original image is input to the normalization module, the normalization module outputs a normalized image with a pixel value of [-1, 1].

Step 232, using the coding module to obtain the multi-layer image features of the normalized image, and the resolutions of the image features of each layer are different.

The encoding module is used to extract features in the normalized image. In the embodiment, the extracted features are recorded as image features. It can be understood that the image features may reflect information such as color features, texture features, shape features, and spatial relationship features in the normalized image, including global information and/or local information. Exemplarily, the encoding module is a lightweight network, where the lightweight network refers to a neural network with a small amount of parameters, a small amount of computation, and a short inference time. The type of the lightweight network used by the encoding module can be selected according to the actual situation. In the embodiment, referring to FIG. 3 , the encoding module 12 is a MobileNetV2 network as an example for description.

In one embodiment, the normalized image can output multi-layer image features after passing through MobileNetV2, wherein the resolutions of the image features of each layer are different and there is a multiple relationship, and optionally, the resolution of the image features of each layer is smaller than the original image. resolution. In one embodiment, the image features of each layer are arranged from top to bottom in an order from high to low resolution, that is, the image features with the highest resolution are located in the highest layer, and the image features with the lowest resolution are located in the lowest layer. It can be understood that the number of layers of the image features output by the encoding module can be set according to the actual situation. For example, when the resolution of the original image is 224×224, the encoding module outputs four layers of image features. At this time, referring to FIG. 3 , among the four-layer image features output by the encoding module 22, the resolution of the highest layer (first layer) image feature is 112×112 (the image feature of this layer is denoted as Feature112×112 in FIG. 3 ), the second The resolution of the high-level (second layer) image features is 56 × 56 (the image features of this layer are marked as Feature56 × 56 in Figure 3), and the resolution of the next-lowest layer (third layer) image features is 28 × 28 (Figure 3). The image feature of this layer is marked as Feature28×28), and the resolution of the image feature of the lowest layer (fourth layer) is 14×14 (the image feature of this layer is marked as Feature14×14 in FIG. 3 ). Understandably, the image features of each layer contain more and more information from the bottom to the top. The same multiple relationship exists between the image features of each adjacent layer, and the resolution of the image features of each layer is smaller than that of the original image. It can be understood that the corresponding relationship between the resolution and the level in the embodiment is only for explaining the image segmentation model, rather than limiting the image segmentation model.

It should be noted that the number of channels included in each layer of image features is not limited in the embodiment.

Understandably, the encoding module can be understood as an encoder in the image segmentation model.

Step 233: Input the image features of each layer into the corresponding channel confusion module respectively to obtain multi-layer confusion features, and each layer of image features corresponds to a channel confusion module.

The channel confusion module is used to fuse the features between the channels in the layer to enrich the information contained in the image features of each layer and ensure the accuracy of the image segmentation model without increasing the subsequent calculation amount. It can be understood that each layer of image features corresponds to one channel confusion module. As shown in Figure 3, the four-layer image features correspond to four channel confusion modules 23. Each channel confusion module 23 is used to fuse the image features between multiple channels in the corresponding layer. .

In one embodiment, the channel confusion module is composed of a 1×1 convolution layer, a batch normalization (BN) layer, and an activation function layer, wherein the activation function layer adopts a Relu activation function. Among them, the 1×1 convolution layer is used to realize the confusion of image features between channels, and the BN layer + activation function layer can make the confused image features more stable. It can be understood that the structure of the above channel confusion module is only an exemplary description, and in practical applications, other structures may also be set for the channel confusion module.

Exemplarily, the features output by the channel confusion module are recorded as confusion features. It can be understood that each layer of image features has corresponding confusion features, and the resolution of the confusion features and image features in the same layer is the same. In one embodiment, except for the confusing features with the lowest resolution, other confusing features are central layer features, that is, other layers may be considered as network central layers. Taking Figure 3 as an example, after passing through the respective channel obfuscation modules 23, the obfuscated features of the lowest layer are represented as Decode 14×14, and the obfuscated features of other layers are represented as Center28×28, Center56×56, Center112×112 respectively. Among them, the digital part represents the resolution.

It can be understood that the confusion feature output by the channel confusion module can also be regarded as the feature obtained after decoding the image feature, that is, the channel confusion module can also realize the function of decoding in addition to the confusion feature.

Step 234: Upsampling the confusion features of each layer except the confusion features at the highest resolution level, and fuses them with the confusion features of a higher resolution to obtain a fusion feature corresponding to a higher resolution.

Upsampling can be understood as enlarging the feature to enlarge the resolution of the feature. In the embodiment, the up-sampling is implemented by a linear interpolation method, that is, a suitable interpolation algorithm is used to insert new elements between the obfuscated features, so as to expand the resolution of the obfuscated features.

In this step, the resolution of the confusion feature can be enlarged by up-sampling, so that the enlarged resolution is equal to the one-level higher resolution. Among them, the higher-level resolution refers to a resolution higher than the current up-sampling resolution and only higher than the current up-sampling resolution. At this time, the up-sampling resolution can be considered as its higher-level resolution A lower resolution of the rate. For example, in Fig. 3, except for the lowest layer, the resolution of each other layer is one level higher than the resolution of the next layer. It can be understood that, since the resolution of the confusion feature of any layer has a multiple relationship with its higher-level resolution, the multiple of upsampling can be determined according to the multiple. For example, if the resolution of the confusion feature of a certain layer is 0.5 times the resolution of the higher level, then the resolution of the confusion feature of this layer can be enlarged by means of double upsampling. After that, the confusing features of the higher resolution are fused with the corresponding up-sampled confusion features of the lower resolution through Skip Connection, so as to reuse the confusing features and ensure the use of more information in the subsequent processing process. Characteristics. It can be understood that image segmentation is a kind of dense pixel prediction (Dense Prediction), therefore, the original image segmentation model requires richer features. In the embodiment, the fused feature is recorded as a fused feature. In this case, except for the confusion feature with the lowest resolution, each layer of confusion features has a corresponding fusion feature. Among them, the operation of feature fusion can be understood as a concatenate (vector splicing) operation. It can be understood that the size of the fusion feature of each layer is the sum of the upsampling size of the confusion feature of this layer and the confusion feature of the lower resolution. For example, C in [NCHW] before the fusion of the confusion feature of this layer is 3. C is 3 in [NCHW] before fusion after upsampling of the confusion feature of lower resolution . Among them, N is the number, C is the number of channels, H is the height, W is the width, and H×W can be understood as the resolution. It should be noted that since the highest resolution does not have a higher-order resolution, there is no need to upsample the obfuscated feature with the highest resolution.

For example, referring to the image segmentation model shown in Figure 3, the confusion feature Decode 14×14 of the lowest layer is doubled up-sampling and the resolution is doubled, that is, a feature with a resolution of 28×28 is obtained. After that, the lowest layer The confusion feature Center28×28 of the higher first-level resolution (ie the second lower layer) is fused with the 28×28 features after the double upsampling of the lowest layer through skip connection to obtain the fusion feature of the second lower layer. In the same way, double upsampling the confusing feature Center28×28 of the second lower layer and double the resolution, that is, to obtain a feature with a resolution of 56 × 56. After that, the higher resolution of the second lower layer (that is, the second higher layer) The confusion feature Center56×56 is fused with the 56×56 features after double upsampling of the second lower layer through skip connection to obtain the fusion features of the second higher layer. In the same way, the fusion features of the highest level are obtained.

Step 235: Input the fusion features of each layer into the corresponding residual modules respectively to obtain multi-layer first decoding features, each layer of fusion features corresponds to a residual module, and the confusing feature with the lowest resolution is used as the first decoding feature with the lowest resolution. feature.

The residual module is used to further extract and decode the fusion features, and the residual module may include one or more residual blocks (Residual Block, RS Block). In the embodiment, the residual module includes one residual block as an example. description, and the structure of the residual block can be set in the actual situation. It can be understood that each layer of fusion features corresponds to a residual module, and the features output by the residual module after processing have the same resolution as the fusion features of this layer. Since the residual module can further extract and decode the fusion feature, that is, the feature output by the residual module is the decoding feature, therefore, in the embodiment, the feature output by the residual module is recorded as the first decoding feature.

It is understandable that since the confusion feature with the lowest resolution has no corresponding fusion feature, there is no need to set a residual module in the layer with the lowest resolution. At this time, the confusion feature with the lowest resolution can be directly regarded as the first decoding of this layer. feature. Correspondingly, after the fusion features of other layers pass through the corresponding residual modules, the corresponding first decoding features can be obtained.

Taking FIG. 3 as an example, it includes three residual modules 24, and the first decoding feature output after the fusion feature of the sub-bottom layer is input to the residual module is denoted as RS Block28×28, that is, the resolution of the first decoding feature is 28× 28. The first decoded feature output after the fusion feature of the next layer is input to the residual module is denoted as RS Block56×56, that is, the resolution of the first decoded feature is 56×56. The first decoded feature output after the fusion feature of the highest layer is input to the residual module is denoted as RS Block112×112, that is, the resolution of the first decoded feature is 112×112. And the first decoding feature of the lowest layer is Decode14×14.

Step 236: Input the first decoding feature of each layer into the corresponding multiple upsampling module to obtain multiple second decoding features, each layer of the first decoding feature corresponds to a multiple upsampling module, and each second decoding feature is the same as the The original image has the same resolution.

Exemplarily, the multiple upsampling module is configured to perform multiple upsampling on the first decoded feature, so that the resolution after the multiple upsampling is equal to the resolution of the original image. The specific multiple of multiple upsampling can be determined according to the resolution of the first decoding feature and the resolution of the original image. For example, the resolution of the first decoding feature is 14×14, and the resolution of the original image is 224 ×224, then, the first decoded feature needs to be upsampled by 16 times to obtain a decoded feature with a resolution of 224×224.

It can be understood that for the two-class image segmentation model, the final output binary image (segmentation result image) is used to distinguish the foreground (such as a portrait area) and the background. Therefore, the segmentation task of the image segmentation model belongs to the two-class segmentation task. , at this time, before obtaining the segmentation result image, it is necessary to obtain the decoding feature with the number of channels 2. In the embodiment, in addition to performing multiple upsampling on the first decoding feature, the multiple upsampling module also needs to change the number of channels of the multiple upsampled first decoding feature to 2. For the first decoding feature of each layer, it only changes the resolution after multiple upsampling, but the number of channels does not change. Therefore, in the embodiment, a 1×1 volume is set in the multiple upsampling module A convolution layer, that is, after performing multiple upsampling on the first decoded feature, a 1×1 convolutional layer is connected to change the number of channels of the multiplely upsampled first decoded feature to 2. In practical applications, the image segmentation model can also perform multi-classification segmentation tasks. At this time, before obtaining the final output image, it is also necessary to obtain decoding features with the number of channels equal to the number of classifications. For example, if the image segmentation model performs a five-category segmentation task, before finally outputting a five-category segmentation result image, it is necessary to obtain 5-channel decoding features. It should be noted that when using the corresponding segmentation label image for supervision, in order to facilitate the calculation of the loss function, the pixel values of the pixels in the segmentation label image need to be converted from 0 and 255 to 0 and 1, that is, the pixel value of 0 is converted to 0. A pixel with a pixel value of 255 is converted to 1. At this time, when training the segmentation network model, in order to make the image segmentation model finally output 2-channel decoding features, it is necessary to change the ground truth of the segmentation label image into one-hot encoding form, that is, each category has one channel, and each The pixel of the channel has a value of 1 when it belongs to the current category, and the value of other channels is 0.

In the embodiment, for the convenience of description, the feature output by the multiple upsampling module is denoted as the second decoding feature. It can be understood that the first decoding feature of each layer corresponds to a multiple upsampling module, and the second decoding feature with 2 channels and the same resolution as the original image can be obtained through the multiple upsampling module. The second decoding feature can be considered as a network prediction output obtained after decoding the image features of the current layer.

For example, referring to FIG. 3 , after the first decoding features of the four layers pass through the corresponding multiple upsampling modules 25 respectively, four second decoding features with a resolution of 224×224 and a channel number of 2 can be obtained. In FIG. 3 , the The four second decoding features are all recorded as 224×224.

It can be understood that the second decoding feature of each layer can be regarded as a temporary output result obtained after decoding the image feature of the layer. The final segmentation result image can be obtained by temporarily outputting the result.

Step 237: Combine the multi-layer second decoding features and input them to the output module to obtain a segmentation result image.

Since the image segmentation model ultimately needs to output a segmentation result image, after obtaining the second decoding feature, the output module integrates the second decoding features of each layer to obtain a segmentation result image (ie, a binary image). Exemplarily, the second decoding features of each layer are first fused (ie, concatenate), so that the output module can obtain more abundant features, thereby restoring a more accurate image. After that, the output module uses the fused second decoding feature to obtain the segmentation result image. At this time, the specific process of the output module is: connect the fused second decoding feature to a 1×1 convolutional layer to obtain a 2-channel decoding feature. It can be understood that the fused second decoding feature is only a The two decoding features are merged together, and through the 1×1 convolutional layer in the decoding module, the fused second decoding feature can be further decoded to output the final decoding feature after referring to the second decoding feature of each layer. The decoding feature has 2 channels, which is used to describe the result of the binary classification, that is, to describe whether each pixel in the original image is a portrait area or a background area. After that, after passing the decoded features through the softmax function and the argmax function, the segmentation result image is obtained. That is, the output module consists of a 1×1 convolutional layer and an activation function layer. Among them, the activation function layer consists of the softmax function and the argmax function. Among them, the data processed by the softmax function can be understood as the output data of the logic layer, that is, the meaning represented by the decoding features output by the 1×1 convolution layer is interpreted, and the description of the logic layer is obtained. When the label is in one-hot form, the argmax function is a common function to obtain the output result, that is, the corresponding segmentation result image is output by the argmax function.

For example, referring to FIG. 3, the four second decoding features are fused and input to the output module 26. At this time, a 1×1 convolutional layer is first passed to obtain 2-channel decoding features (referred to as Refine224× in FIG. 3 ). 224), after that, the segmentation result image is obtained through the activation function layer (denoted as output224×224 in Figure 3).

It can be understood that the pixel value of each pixel in the segmentation result image output by the image segmentation model is 0 or 1, wherein, the pixel with a pixel value of 0 is a pixel in the background area, and a pixel with a pixel value of 1 is in the portrait area. pixel. In order to facilitate the visualization of the segmentation result image, when displaying the segmentation result image, the pixel value of each pixel is multiplied by 255. For example, FIG. 5 is a schematic diagram of a segmentation result image provided by an embodiment of the present application. After inputting the training data shown in FIG. 4 into the image segmentation model shown in FIG. 3, a segmentation result image is obtained. After multiplying by 255, the segmentation result image shown in Figure 5 can be obtained.

Step 238: Input the first decoded feature with the highest resolution to the edge module to obtain an edge result image.

In order to improve the learning ability of the image segmentation model for the edge between the portrait area and the background area, in the embodiment, an edge module is set in the image segmentation model to perform additional supervision on the first decoding feature with the highest resolution through the edge module, that is, It acts as a regularization constraint to improve the ability of the image segmentation model to learn edges. The specific structural embodiment of the edge module is not limited. In the embodiment, the edge module is taken as an example of a 1×1 convolutional layer for description. Exemplarily, after inputting the first decoded feature with the highest resolution into the edge module, an edge feature with 2 channels and the same resolution as the original image can be obtained, through which an edge feature that only expresses the edge can be obtained. the binary image. In the embodiment, the binary image expressing the edge is recorded as the edge result image. It can be understood that the pixel value of each pixel in the edge result image is 0 or 1, wherein, the pixel with the pixel value of 1 represents the pixel where the edge is located, and the pixel with the pixel value of 0 represents the pixel where the non-edge is located. It should be noted that the first decoded feature with the highest resolution has richer detailed information, therefore, more accurate edge features can be obtained through the first decoded feature with the highest resolution.

For example, as shown in FIG. 3 , after the first decoding feature RS Block112×112 of the highest layer passes through the edge module 27 , an edge feature with a resolution of 224×224 can be obtained, which is denoted as edge224×224 in FIG. 3 .

In order to facilitate the visualization of the edge result image, the pixel value of each pixel is multiplied by 255 when the edge result image is displayed. For example, FIG. 6 is a schematic diagram of an edge result image provided by an embodiment of the present application. The training data shown in FIG. 4 is input into the image segmentation model shown in FIG. 3 to obtain an edge result image. After multiplying by 255, the edge result image shown in Figure 6 can be obtained.

It can be understood that, except for the normalization module and the encoding module, other modules can be considered as modules constituting the decoder.

Step 239: Construct a loss function according to each second decoding feature, the edge result image, the corresponding segmentation label image and the edge label image, and update the model parameters of the image segmentation model according to the loss function.

The loss function of the segmentation network model is composed of a segmentation loss function and an edge loss function. The segmentation loss function can reflect the segmentation ability of the segmentation network model, and the segmentation loss function is obtained according to the second decoding feature of each layer and the segmentation label image. At this time, a sub-loss function can be obtained based on the second decoding feature of each layer and the segmentation label image, and the segmentation loss function can be obtained by combining the sub-loss functions of each layer. It can be understood that the calculation method of each sub-loss function is the same. In one embodiment, the sub-loss function is calculated by the Iou function, and the Iou function can be defined as: the ratio of the area of the intersection of the predicted pixel region (that is, the second decoding feature) and the label pixel region (that is, the segmented label image) to the area of the union. , that is, the Iou function can reflect the overlapping similarity between the binary image corresponding to the second decoding feature and the segmented label image, and at this time, the sub-loss function calculated by the Iou function can reflect the loss of overlapping similarity. Exemplarily, the edge loss function can reflect the ability of the segmentation network model to learn edges, and the edge loss function is obtained from the edge result image and the edge label image. In one embodiment, since the pixel points of the edge account for a very low proportion of the pixels of the entire original image, the edge loss function adopts the Focal loss loss, which is a common loss function, which can reduce a large number of simple negative effects. The weight of the sample in training can also be understood as a difficult sample mining.

Exemplarily, the loss function of the segmentation network model is expressed as:

Among them, Loss represents the loss function of the segmentation network model, n represents the total number of layers corresponding to the second decoding feature,

represents the sub-loss function calculated from the second decoding feature with the highest resolution and the corresponding segmentation label image,

represents the sub-loss function calculated from the second decoded feature with the lowest resolution and the segmented label image,

A _n represents the second decoding feature with the lowest resolution, B represents the corresponding segmentation label image, Iou _n represents the overlap similarity between A _n and B, and loss _edge is the Focal loss loss function.

Exemplarily, the image segmentation model has a total of n layers (n≥2), that is, there are n layers of second decoding features. At this time, n sub-loss functions can be obtained according to the second decoding features of the n layers and the segmentation label image. The first layer has the highest resolution, and its corresponding sub-loss function is recorded as

The resolution of the second layer is the second highest, and its corresponding sub-loss function is recorded as

By analogy, the resolution of the nth layer is the lowest, and its corresponding sub-loss function is recorded as

Since each sub-loss function is calculated in the same manner, the embodiment takes the n-th layer sub-loss function as an example for description. Exemplary,

which is

Represents the loss of the nth layer Iou function.

A _n represents the second decoding feature of the nth layer, B represents the corresponding segmented label image, A _n ∩ B represents the intersection of A _n and B, A _n ∪ B represents the union of A _n and B, and Iou _n represents A _n and the overlapping similarity of B, at this time,

Represents the loss of overlapping similarity. It can be understood that the more similar the binary image corresponding to the second decoding feature and the segmentation label image, the smaller the corresponding sub-loss function, the better the segmentation ability of the image segmentation model, and the higher the segmentation accuracy. Exemplarily, loss _edge represents an edge loss function, and in an embodiment, loss _edge is a Focal loss loss function. loss _edge (p _t )=-α _t (1-p _t ) ^γ log(p _t ). Among them, p _t represents the predicted probability value that the pixel in the edge result image is an edge, α _t represents the balance weight coefficient, which is used to balance positive and negative samples, and γ represents the modulation coefficient, which is used to control the weight of difficult and easy-to-classify samples. The values of α _t and γ can be set according to the actual situation. The loss _edge can be obtained according to the loss _edge (p _t ) of each pixel in the edge result image. Specifically, the mean value is calculated after adding the loss _edge (p _t ) of each pixel point, and the calculated mean value is used as the loss _edge .

After the loss function is obtained, the model parameters of the image segmentation model can be updated according to the loss function, so that the performance of the updated image segmentation model is higher.

Step 2310: Select the next original image, and return to perform the operation of inputting the original image to the normalization module until the loss function converges.

It can be understood that after modifying the model parameters of the image segmentation model through the loss function, it can be considered that one training is over. At this time, select an original image and the corresponding segmentation label image and edge label image to train the image segmentation model to calculate the loss again. function and modify the model parameters according to the loss function. After many times of training, if the value of the loss function calculated by the current consecutive times is within the preset value range, it means that the loss function converges, that is, the image segmentation model is stable. It can be understood that the specific value of the preset numerical range can be set according to the actual situation.

After the image segmentation model is stabilized, it is determined that the training is over, and then the image segmentation model can be applied to segment the portraits in the video data.

On the basis of the above embodiment, after training the image segmentation model according to the training data set and the label data set, the method further includes: when the image segmentation model is not a network model recognizable by the forward inference framework, converting the image segmentation model into a forward inference model Framework-aware network models.

The image segmentation model is trained in a corresponding framework, which is usually a framework such as tensorflow and pytorch. In the embodiment, the pytorch framework is used as an example for description. The pytorch framework is mainly used for model design, training and testing. Since the image segmentation model application runs in real-time in the image segmentation device, and the pytorch framework occupies a large amount of memory, if the image segmentation model under the pytorch framework is run in an application of the image segmentation device, it will greatly increase the occupied by the application. of storage space. At the same time, when running the image segmentation model under the pytorch framework, the dependence on the Graphics Processing Unit (GPU) is relatively high. If the GPU is not installed in the image segmentation device, the image segmentation model will have a slower processing speed. The forward inference framework is generally aimed at a specific platform (such as an embedded platform), and different platforms have different hardware configurations. When the forward inference framework is deployed on a platform, it can combine the hardware configuration of the platform and make reasonable use of resources. Optimization acceleration, that is, the forward inference framework can perform optimization acceleration when running its on-premise model. The forward inference model is mainly used for the prediction process of the model, wherein the prediction process includes the testing process of the model and the prediction process (application process) of the model, but does not include the training process of the model, and the forward inference framework relies on GPU It is low-level and lightweight, and does not make the application take up a large amount of storage space. Therefore, when applying an image segmentation model, run the image segmentation model in a forward inference framework. In one embodiment, before applying the image segmentation model, it is determined whether the image segmentation model is running in the forward inference framework. If the image segmentation model runs in the forward inference framework, the image segmentation model is directly applied. If the image segmentation model does not run in the forward inference framework, the image segmentation model is converted into a network model recognizable in the forward inference framework. Exemplarily, the specific type of the forward reasoning framework can be set according to the actual situation, for example, the forward reasoning framework is an openvino framework. At this time, the specific means to convert the image segmentation model under the pytorch framework into the image segmentation model under the openvino framework can be: using the existing pytorch conversion tool to convert the image segmentation model to the Open Neural Network Exchange (Open Neural Network Exchange, ONNX) model, and then use the openvino conversion tool to convert the ONNX model into an image segmentation model under the openvino framework. Among them, ONNX is a standard for representing deep learning models, which enables models to be transferred between different frameworks.

On the basis of the above embodiment, after the loss function of the image segmentation model converges, the method further includes: deleting the edge module.

It is understandable that the advantage of setting the edge module in the training process is to improve the learning ability of the image segmentation model for the edge, thereby ensuring the accuracy of the segmentation result image. In the application process of the image segmentation model, since only the first segmented image needs to be output, there is no need to output the edge result image, and the image segmentation model already has the ability to learn edges. Therefore, when applying the image segmentation model, the edge module can be deleted. , that is, cancel the data processing process of the edge module when applying the image segmentation model, so as to reduce the data processing amount of the image segmentation model and improve the processing speed.

As mentioned above, by collecting original images in different scenarios, it is possible to avoid the problem of consuming a lot of work and production costs when capturing original images frame by frame based on video data, and there is less repetition of content between original images in different scenarios, which is conducive to improving the image quality. The learning ability of the segmentation model. The encoding module of the image segmentation model adopts a lightweight network, which can reduce the amount of data processing during encoding. At the same time, the channel confusion module can confuse the image features between channels without significantly increasing the amount of calculation, so as to enrich the channels in the channel. feature information to ensure the accuracy of the image segmentation model. Moreover, by up-sampling the confusing features and fusing the confusing features of a higher resolution, the detailed features at different resolutions can be enriched, and the accuracy of the image segmentation model can be further ensured. In addition, by using the fusion feature and the second decoding feature, the features of each layer can be reused and deeply supervised, which improves the utilization of the information contained in the features, enhances the efficiency of information transmission, and improves the role of label data supervision. By setting the edge module, the learning ability of the image segmentation model for the edge is improved, and the accuracy of the image segmentation model is further ensured. In the application process, the edge module is deleted to reduce the calculation amount of the image segmentation model. Converting the image segmentation model into an image segmentation model under the forward inference framework can reduce the low dependence of the image segmentation model on the GPU, and reduce the storage space occupied by the application running the image segmentation model. In the application process, the trained image segmentation model can accurately segment the portrait area in the video data without human prior or interaction. After testing, under the environment of ordinary PC integrated graphics card, the processing time of each frame of image in the video data It only takes about 20ms to realize real-time automatic portrait segmentation.

Based on the above embodiments, the image segmentation model further includes: a decoding module. Correspondingly, after step 235, the method further includes: inputting the first decoding feature with the highest resolution to the decoding module to obtain a corresponding new first decoding feature.

FIG. 7 is a schematic structural diagram of another image segmentation model provided by an embodiment of the present application. Compared with the image segmentation model shown in FIG. 3 , the image segmentation model shown in FIG. 7 further includes a decoding module 28 .

Exemplarily, after the first decoding feature with the highest resolution is obtained through the residual module, a decoding module is passed to further decode the first decoding feature with the highest resolution, that is, a new first decoding feature is obtained. , the new first decoding feature can be considered as the first decoding feature finally obtained by the highest resolution level, and then the new first decoding feature is input to the multiple upsampling module and edge set in the highest resolution level Module, it can be understood that the number of channels and the resolution of the new first decoding feature are the same as the number of channels and the resolution of the original first decoding feature. For example, the first decoding feature after the decoding module 28 in FIG. 7 is denoted as Refine112×112, and its resolution is the same as that of RS Block112×112. In one embodiment, the decoding module is a convolutional network, and the number and structure of the convolutional layers are not limited. Through the decoding module, the accuracy of the first decoding feature of the highest layer can be improved, thereby improving the accuracy of the image segmentation model. It should be noted that, for the first decoding feature, the lower the resolution, the more advanced the semantic feature it has, and the higher the resolution, the richer the detailed feature it has. For the first decoding feature with the highest resolution, there will be sawtooth phenomenon when directly up-sampling it, that is, the detail feature will appear sawtooth phenomenon. Therefore, a decoding module is added to it, so as to make the final obtained new first decoding feature transition. It is more uniform and avoids the appearance of jaggedness. The first decoding features included in other layers basically do not appear aliasing after up-sampling. Even if a decoding module is set for it, the accuracy of the image segmentation model will not be affected. Therefore, there is no need to set a decoding module for other layers. It can be understood that, in practical applications, if the aliasing phenomenon occurs after the up-sampling of the first decoding features of other layers, a decoding module may also be set for them, so as to improve the accuracy of the image segmentation model.

It can be understood that in the above image segmentation methods, the target object is described as a human being, and in practical applications, the target object can also be any other object.

FIG. 8 is a schematic structural diagram of an image segmentation apparatus provided by an embodiment of the present application. Referring to FIG. 8 , the image segmentation apparatus includes: a data acquisition module 301 , a first segmentation module 302 , a second segmentation module 303 and a repeated segmentation module 304 .

Among them, the data acquisition module 301 is used to acquire the current frame image in the video data, and the target object is displayed in the video data; the first segmentation module 302 is used to input the current frame image into the trained image segmentation model, to Obtain the first segmented image based on the target object; the second segmentation module 303 is used for smoothing the first segmented image to obtain the second segmented image based on the target object; the repeating segmentation module 304 is used for the The next frame image in the video data is taken as the current frame image, and the operation of inputting the current frame image into the trained image segmentation model is returned to execute until each frame image in the video data obtains a corresponding second segmentation image.

On the basis of the above-mentioned embodiment, it also includes: a training acquisition module for acquiring a training data set, the training data set includes a plurality of original images; a label construction module for constructing a label data set according to the training data set, the label The dataset contains multiple segmentation label images and multiple edge label images, one of the original images corresponds to one segmented label image and one edge label image; the model training module is used for training according to the training dataset and the label dataset Image segmentation model.

On the basis of the above embodiment, taking the image segmentation model shown in FIG. 3 as an example, the image segmentation model includes: a normalization module 21, an encoding module 22, a channel confusion module 23, a residual module 24, and a multiple upsampling module 25 , an output module 26 and an edge module 27 . At this time, the above model training module includes: a normalization unit for inputting the original image into the normalization module 21 to obtain a normalized image; an encoding unit for obtaining the normalized image by using the encoding module 22 The multi-layer image features of the transformed image, and the resolution of each layer of image features is different; the channel confusion unit is used to input the image features of each layer into the corresponding channel confusion module 23 respectively, so as to obtain multi-layer confusion features, each layer of the image The feature corresponds to a channel confusion module 23; the fusion unit is used to upsample the confusion features of each layer except the confusion features with the highest resolution, and fuse them with the confusion features of a higher resolution to obtain a higher resolution. The fusion feature corresponding to the resolution, as shown in Figure 3, except for the confusion feature of the highest layer, the confusion feature of each other layer is upsampled and then fused with the confusion feature of the previous layer to obtain the fusion feature of the upper layer; the residual unit , which is used to input the fusion features of each layer into the corresponding residual module 24 respectively, so as to obtain the multi-layer first decoding feature, each layer of fusion features corresponds to a residual module 24, and the confusion feature with the lowest resolution is regarded as the first decoding feature with the lowest resolution. a decoding feature; a multiple upsampling unit for inputting the first decoding feature of each layer to the corresponding multiple upsampling module 25 respectively, so as to obtain the second decoding feature of multiple layers, and the first decoding feature of each layer corresponds to a multiple The upsampling module, each second decoding feature has the same resolution as the original image; the segmentation output unit is used to combine the multi-layer second decoding features and input to the output module 26 to obtain the segmentation result image; the edge output unit, with In inputting the first decoding feature with the highest resolution to this edge module 27, to obtain an edge result image; a parameter updating unit, for each of the second decoding features, edge result image, corresponding segmentation label image and edge label image Construct a loss function, and update the model parameters of the image segmentation model according to the loss function; the image selection unit is used to select the next original image, and returns to perform the operation of inputting the original image to the normalization module until the loss until the function converges.

On the basis of the above embodiment, referring to FIG. 7 , the image segmentation model further includes: a decoding module 28 . Correspondingly, the above-mentioned model training module also includes: a decoding unit, which is used to input the fusion features of each layer into the corresponding residual module 24 respectively, so as to obtain the multi-layer first decoding features, and then convert the first decoding features with the highest resolution. Input to the decoding module 28 to obtain the corresponding new first decoding feature.

On the basis of the above embodiment, the encoding module includes the MobileNetV2 network.

On the basis of the above embodiment, the loss function is expressed as:

Among them, Loss represents the loss function, n represents the total number of layers corresponding to the second decoding feature,

represents the sub-loss function calculated from the second decoding feature with the lowest resolution and the segmented label image,

On the basis of the above-mentioned embodiment, an edge deletion module is further included, and after the loss function of the image segmentation model converges, the edge deletion module is further included.

On the basis of the above-mentioned embodiment, it also includes: a frame conversion module, after training the image segmentation model according to the training data set and the label data set, when the image segmentation model is not a network model identifiable by the forward inference framework, Transform the image segmentation model into a network model recognizable by the forward inference framework.

On the basis of the above-mentioned embodiment, the label construction module includes: a label acquisition unit, used for obtaining the labeling result for the original image; a segmentation label obtaining unit, used for obtaining a corresponding segmented label image according to the labeling result; an erosion unit, using is used to perform the erosion operation on the segmented label image to obtain the eroded image; the Boolean unit is used to perform the Boolean operation on the segmented label image and the eroded image to obtain the edge label image corresponding to the original image; the data set construction unit is used for to obtain a label dataset according to the segmented label image and the edge label image.

On the basis of the above-mentioned embodiment, the method further includes: a target background acquisition module, configured to perform smoothing processing on the first segmented image to obtain a target background image after obtaining a second segmented image based on the target object. It contains a target background; a background replacement module is used to replace the background of the current frame image according to the target background image and the second divided image, so as to obtain a new image of the current frame.

The image segmentation device provided above can be used to execute the image segmentation method provided by any of the above embodiments, and has corresponding functions and beneficial effects.

It is worth noting that, in the above embodiments of the image segmentation apparatus, the units and modules included are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be realized; The specific names of the functional units are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present application.

FIG. 9 is a schematic structural diagram of an image segmentation device provided by an embodiment of the present application. As shown in FIG. 9 , the image segmentation device includes a processor 40, a memory 41, an input device 42, and an output device 44; the number of processors 40 in the image segmentation device may be one or more, and one processor 40 is used in FIG. 9 . For example. The processor 40 , the memory 41 , the input device 42 , and the output device 43 in the image segmentation device may be connected by a bus or in other ways, and the connection by a bus is taken as an example in FIG. 9 .

As a computer-readable storage medium, the memory 41 can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the image segmentation method in the embodiments of the present application (for example, data acquisition in the image segmentation device). module 301, a first segmentation module 302, a second segmentation module 303, and a repeated segmentation module 304). The processor 40 executes various functional applications and data processing of the image segmentation device by running the software programs, instructions and modules stored in the memory 41 , that is, to implement the above-mentioned image segmentation method.

The memory 41 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the image dividing apparatus, and the like. In addition, memory 41 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, the memory 41 may further include memory located remotely from the processor 40, and these remote memories may be connected to the image segmentation apparatus through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

The input device 42 may be used to receive input numerical or character information, and to generate key signal input related to user settings and function control of the image segmentation apparatus. The output device 43 may include a display device such as a display screen.

The above-mentioned image segmentation device includes an image segmentation device, which can be used to execute any image segmentation method, and has corresponding functions and beneficial effects.

In addition, the embodiments of the present application also provide a storage medium containing computer-executable instructions, when the computer-executable instructions are executed by a computer processor, for performing relevant operations in the image segmentation method provided by any embodiment of the present application , and has corresponding functions and beneficial effects.

As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product.

Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram. These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams. These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory. Memory may include non-persistent memory in computer readable media, random access memory (RAM) and/or non-volatile memory in the form of, for example, read only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media includes both persistent and non-permanent, removable and non-removable media, and storage of information may be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.

It should also be noted that the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device comprising a series of elements includes not only those elements, but also Other elements not expressly listed, or which are inherent to such a process, method, article of manufacture, or apparatus are also included. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article of manufacture or apparatus that includes the element.

Note that the above are only preferred embodiments of the present application and applied technical principles. Those skilled in the art will understand that the present application is not limited to the specific embodiments described herein, and various obvious changes, readjustments and substitutions can be made by those skilled in the art without departing from the protection scope of the present application. Therefore, although the present application has been described in detail through the above embodiments, the present application is not limited to the above embodiments, and can also include more other equivalent embodiments without departing from the concept of the present application. The scope is determined by the scope of the appended claims.

Claims

An image segmentation method, comprising:

Obtain the current frame image in the video data, and the target object is displayed in the video data;

The current frame image is input into the trained image segmentation model to obtain the first segmented image based on the target object;

smoothing the first segmented image to obtain a second segmented image based on the target object;

Take the next frame image in the video data as the current frame image, and return to perform the operation of inputting the current frame image into the trained image segmentation model, until each frame image in the video data obtains the corresponding second image. until the image is divided.
The image segmentation method according to claim 1, wherein, further comprising:

Obtain a training data set, the training data set includes a plurality of original images;

Build a label data set according to the training data set, the label data set includes a plurality of segmentation label images and a plurality of edge label images, and one of the original images corresponds to a segmentation label image and an edge label image;

The image segmentation model is trained based on the training dataset and the label dataset.
The image segmentation method according to claim 2, wherein the image segmentation model comprises: a normalization module, an encoding module, a channel confusion module, a residual module, a multiple upsampling module, an output module and an edge module;

The training of the image segmentation model according to the training data set and the label data set includes:

Inputting the original image to the normalization module to obtain a normalized image;

Using the encoding module to obtain the multi-layer image features of the normalized image, and the resolution of the image features of each layer is different;

The image features of each layer are respectively input to the corresponding channel confusion module to obtain multi-layer confusion features, and each layer of the image features corresponds to a channel confusion module;

In addition to the confusion features with the highest resolution, the confusion features of each other layer are up-sampled, and fused with the confusion features of the higher resolution to obtain the fusion features corresponding to the higher resolution;

The fusion features of each layer are respectively input into the corresponding residual modules to obtain multi-layer first decoding features, each layer of the fusion features corresponds to a residual module, and the confusing feature with the lowest resolution is used as the first decoding feature with the lowest resolution. decoding features;

The first decoding features of each layer are respectively input to the corresponding multiple upsampling modules to obtain multi-layer second decoding features, the first decoding features of each layer correspond to a multiple upsampling module, and each second decoding feature The decoded feature is the same resolution as the original image;

Combining multiple layers of the second decoding features and inputting them to the output module to obtain a segmentation result image;

Inputting the first decoded feature with the highest resolution to the edge module to obtain an edge result image;

Construct a loss function according to each of the second decoding features, the edge result image, the corresponding segmentation label image and the edge label image, and update the model parameters of the image segmentation model according to the loss function;

The next original image is selected, and the operation of inputting the original image to the normalization module is performed back until the loss function converges.
The image segmentation method according to claim 3, wherein the image segmentation model further comprises: a decoding module;

After inputting the fusion features of each layer into the corresponding residual modules to obtain the multi-layer first decoding features, the method further includes:

The first decoding feature with the highest resolution is input to the decoding module to obtain a corresponding new first decoding feature.
The image segmentation method according to claim 3, wherein the encoding module comprises a MobileNetV2 network.
The image segmentation method according to claim 3, wherein the loss function is expressed as:

Among them, Loss represents the loss function, n represents the total number of layers corresponding to the second decoding feature,
represents the sub-loss function calculated from the second decoding feature with the highest resolution and the corresponding segmentation label image,
represents the sub-loss function calculated according to the second decoding feature with the lowest resolution and the segmented label image,

A n represents the second decoding feature with the lowest resolution, B represents the corresponding segmentation label image, Iou n represents the overlap similarity between A n and B, and loss edge is the Focal loss loss function.
The image segmentation method according to claim 3, wherein after the loss function of the image segmentation model converges, the method further comprises:

Delete the edge module.
The image segmentation method according to claim 2, wherein after the image segmentation model is trained according to the training data set and the label data set, the method further comprises:

When the image segmentation model is not a network model recognizable by the forward inference framework, the image segmentation model is converted into a network model recognizable by the forward inference framework.
The image segmentation method according to claim 2, wherein the constructing a label data set according to the training data set comprises:

obtaining an annotation result for the original image;

Obtain a corresponding segmented label image according to the labeling result;

performing an erosion operation on the segmented label image to obtain an eroded image;

performing a Boolean operation on the segmented label image and the eroded image to obtain an edge label image corresponding to the original image;

A label dataset is obtained according to the segmented label image and the edge label image.
The image segmentation method according to claim 1, wherein after performing smoothing processing on the first segmented image to obtain a second segmented image based on the target object, the method further comprises:

obtaining a target background image, where the target background image includes the target background;

Background replacement is performed on the current frame image according to the target background image and the second segmented image to obtain a new image of the current frame.
An image segmentation device, comprising:

a data acquisition module for acquiring the current frame image in the video data, where the target object is displayed;

a first segmentation module, for inputting the current frame image into a trained image segmentation model to obtain a first segmented image based on the target object;

a second segmentation module, configured to perform smoothing processing on the first segmented image to obtain a second segmented image based on the target object;

The repeated segmentation module is used for taking the next frame image in the video data as the current frame image, and returning to perform the operation of inputting the current frame image into the trained image segmentation model, until each frame image in the video data until the corresponding second segmented image is obtained.
An image segmentation device, comprising:

one or more processors

memory for storing one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors implement the image segmentation method according to any one of claims 1-10.
A computer-readable storage medium on which a computer program is stored, wherein when the program is executed by a processor, the image segmentation method according to any one of claims 1-10 is implemented.