WO2022133627A1 - Procédé et appareil de segmentation d'image, et dispositif et support de stockage - Google Patents

Procédé et appareil de segmentation d'image, et dispositif et support de stockage Download PDF

Info

Publication number
WO2022133627A1
WO2022133627A1 PCT/CN2020/137858 CN2020137858W WO2022133627A1 WO 2022133627 A1 WO2022133627 A1 WO 2022133627A1 CN 2020137858 W CN2020137858 W CN 2020137858W WO 2022133627 A1 WO2022133627 A1 WO 2022133627A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
module
segmentation
label
features
Prior art date
Application number
PCT/CN2020/137858
Other languages
English (en)
Chinese (zh)
Inventor
曹桂平
Original Assignee
广州视源电子科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州视源电子科技股份有限公司 filed Critical 广州视源电子科技股份有限公司
Priority to PCT/CN2020/137858 priority Critical patent/WO2022133627A1/fr
Priority to CN202080099096.5A priority patent/CN115349139A/zh
Publication of WO2022133627A1 publication Critical patent/WO2022133627A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation

Definitions

  • the embodiments of the present application relate to the technical field of image processing, and in particular, to an image segmentation method, apparatus, device, and storage medium.
  • Image segmentation is one of the common techniques in image processing, which is used to accurately extract the region of interest in the image to be processed, and use the region of interest as the target region image to facilitate subsequent processing of the target region image (such as background replacement). , deducting the image of the target area, etc.).
  • Portrait-based image segmentation is an important application in the field of image segmentation.
  • Portrait-based image segmentation refers to the accurate separation of the portrait area and the background area in the image to be processed.
  • it is of great significance to perform portrait-based image segmentation for online video data. In scenarios such as online conferences or online live broadcasts, image segmentation is performed on the online video data to accurately separate the portrait area and the background area in the video data, and then the background image is replaced in the background area to protect user privacy. the goal of.
  • image segmentation mainly includes methods based on threshold, region-based, edge-based, graph theory and energy functional.
  • the threshold-based method needs to be segmented according to the grayscale features in the image, and its drawback is that it is only suitable for images in which the grayscale values of the portrait area are evenly distributed outside the grayscale values of the background area.
  • the region-based method divides the image into different regions according to the similarity criterion of the spatial neighborhood, and its disadvantage is that it cannot handle complex images.
  • the edge-based method mainly uses the discontinuity of local image features (such as the pixel mutation of the face edge) to obtain the boundary of the portrait region, and its disadvantage is that the computational complexity is high.
  • the methods based on graph theory and energy functional mainly use the energy functional of the image to perform portrait segmentation, but the disadvantage is that the amount of calculation is huge and artificial prior information is required. Due to the defects of the above technology, it cannot be applied to the scene of real-time, simple and accurate image segmentation for online video data.
  • Embodiments of the present application provide an image segmentation method, apparatus, device, and storage medium, so as to solve the technical problem that the above technology cannot accurately perform image segmentation on online video data.
  • an embodiment of the present application provides an image segmentation method, including:
  • the current frame image is input into the trained image segmentation model to obtain the first segmented image based on the target object;
  • an embodiment of the present application further provides an image segmentation device, including:
  • a data acquisition module for acquiring the current frame image in the video data, where the target object is displayed
  • a first segmentation module for inputting the current frame image into a trained image segmentation model to obtain a first segmented image based on the target object
  • a second segmentation module configured to perform smoothing processing on the first segmented image to obtain a second segmented image based on the target object
  • the repeated segmentation module is used for taking the next frame image in the video data as the current frame image, and returning to perform the operation of inputting the current frame image into the trained image segmentation model, until each frame image in the video data until the corresponding second segmented image is obtained.
  • the embodiments of the present application further provide an image segmentation device, including:
  • processors one or more processors
  • memory for storing one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the image segmentation method as described in the first aspect.
  • an embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the image segmentation method described in the first aspect.
  • the above-mentioned image segmentation method, device, equipment and storage medium by acquiring the video data including the target object, input each frame image of the video data into the image segmentation model to obtain the corresponding first segmented image, and then, for the first segmented image
  • the technical means of obtaining the second segmented image by performing smoothing processing solves the technical problem that some image segmentation technologies cannot accurately segment the online video data.
  • the online video data can be segmented in real time and accurately, and due to the self-learning of the image segmentation model, it can be applied to the online video data with complex images, and in the application process, It can be directly applied only by deploying the image segmentation model without artificial prior information, which simplifies the complexity of image segmentation and expands the application scenarios of image segmentation methods.
  • FIG. 1 is a flowchart of an image segmentation method provided by an embodiment of the present application.
  • FIG. 2 is a flowchart of another image segmentation method provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of an image segmentation model provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of an original image provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a segmentation result image provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of an edge result image provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of another image segmentation model provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an image segmentation apparatus provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of an image segmentation device provided by an embodiment of the present application.
  • first and second are only used to distinguish one entity or operation or object from another entity or operation or object, and do not necessarily require or imply these entities Or there is any such actual relationship or order between operations or objects.
  • first and second of the first segmented image and the second segmented image are used to distinguish two different segmented images.
  • the image segmentation method provided in this embodiment of the present application may be performed by an image segmentation device, which may be implemented in software and/or hardware, and the image segmentation device may be composed of two or more physical entities, or may be one Physical entity composition.
  • the image segmentation device may be a computer, a mobile phone, a tablet, or an interactive smart tablet, and other smart devices with data computing and analysis capabilities.
  • FIG. 1 is a flowchart of an image segmentation method provided by an embodiment of the present application.
  • the image segmentation method specifically includes:
  • Step 110 Acquire the current frame image in the video data, and the target object is displayed in the video data.
  • the video data is the video data for which image segmentation is currently required, which may be online video data or offline video data.
  • the video data includes multiple frames of images, and each frame of images displays a target object, which can be considered as an object that needs to be separated from the background image.
  • the background images of each frame image in the video data may be the same or different, which is not limited in the embodiment, and the target object may change with the playback of the video data, but during the change process, the type of the target object is not limited. Change.
  • the target object is a human being
  • the human image in the video data may change (such as replacing a person or adding a new person, etc.), but the target object in the video data is always a human being.
  • the target object is a human being as an example.
  • the source of the video data is not limited in this embodiment.
  • the video data is a piece of video shot by an image capture device (such as a camera, a camera, etc.) connected to the image segmentation device.
  • the video data is a conference screen obtained from a network in a video conference scenario.
  • the video data is a live broadcast image obtained from a network in a live broadcast scenario.
  • performing image segmentation on the video data refers to separating the region where the target object is located in each frame of image in the video data.
  • the target object is exemplarily described as a human being.
  • the processing of the video data is in units of frames, that is, the images in the video data are acquired frame by frame, and the images are processed to obtain a final image segmentation result.
  • the currently processed image is recorded as the current frame image, and the processing of the current frame image is taken as an example for description.
  • Step 120 Input the current frame image into the trained image segmentation model to obtain the first segmented image based on the target object.
  • the image segmentation model is a pre-trained neural network model, which is used to segment the target object in the current frame image and output the segmentation result corresponding to the current frame image.
  • the segmentation result is recorded as the first segmented image
  • the portrait area and the background area of the current frame image can be determined through the first segmented image, wherein the portrait area can be considered as the area where the target object (human) is located.
  • the first segmented image is a binary image, and its pixel values include two types: 0 and 1, wherein, the area with a pixel value of 0 belongs to the background area of the current frame image, and the area with a pixel value of 1 belongs to the current frame image. portrait area.
  • the pixel values are converted into two types: 0 and 255 before displaying the first segmented image, wherein the area with a pixel value of 0 belongs to the background area, and the area with a pixel value of 255 belongs to the background area. portrait area.
  • the resolution of the first divided image is the same as the resolution of the current frame image. It can be understood that when the image segmentation model has resolution requirements for the input image, that is, when a fixed-resolution image needs to be input, it is necessary to determine whether the resolution of the current frame image meets the resolution requirement. If it does not meet the resolution requirement, then Perform resolution conversion on the current frame image to obtain the current frame image that meets the resolution requirements.
  • the resolution conversion is also performed on the first segmented image, so that the resolution of the first segmented image is the same as the resolution of the original current frame image (that is, the current frame image before the resolution is converted). same.
  • the image segmentation model does not have a resolution requirement for the input image, the current frame image can be directly input to the image segmentation model to obtain the first segmented image with the same resolution.
  • the structure and parameters of the image segmentation model can be set according to actual conditions.
  • the image segmentation model adopts an autoencoder structure.
  • autoencoder is a kind of artificial neural network used in semi-supervised learning and unsupervised learning.
  • the autoencoder includes two parts: an encoder (encoder) and a decoder (decoder), wherein the encoder is used to extract the features in the image, and the decoder is used to decode the extracted features to obtain the learning result (for example, the first split image).
  • the encoder adopts a lightweight network to reduce the amount of data processing and calculation when extracting features, and to speed up the processing speed.
  • the decoder can be implemented by residual blocks combined with channel obfuscation, upsampling, etc., to achieve fully automatic real-time image segmentation.
  • the features of the current frame image at different resolutions can be extracted by the encoder, and then the decoder performs operations such as upsampling, fusion, decoding, etc. on each feature to reuse each feature, thereby obtaining an accurate first segmentation. image.
  • the image segmentation model is deployed under the forward inference framework.
  • the specific type of the forward reasoning framework can be set according to the actual situation, for example, the forward reasoning framework is the openvino framework.
  • the image segmentation model when deployed in the forward inference framework, the image segmentation model has a low dependence on the GPU, is relatively portable, and does not occupy a large storage space.
  • Step 130 Smooth the first segmented image to obtain a second segmented image based on the target object.
  • the edge jaggedness can be understood as jagged edges between the portrait area and the background area, which makes the separation of the portrait area and the background area too stiff.
  • the first segmented image is smoothed, that is, the edge jaggedness in the first segmented image is smoothed, so as to obtain a segmented image with smoother edges.
  • the segmented image is denoted as the second segmented image.
  • the second segmented image can also be considered as the final segmented result of the current frame image. It can be understood that the second segmented image is also a binary image, and its pixel values include two types: 0 and 1.
  • the area with a pixel value of 0 belongs to the background area of the current frame image, and the area with a pixel value of 1 belongs to the background area of the current frame image.
  • Portrait area The area with a pixel value of 0 belongs to the background area of the current frame image, and the area with a pixel value of 1 belongs to the background area of the current frame image. portrait
  • the smoothing processing is implemented by means of Gaussian smoothing filtering.
  • the Gaussian kernel function is used in the Gaussian smoothing filtering to process the first segmented image to obtain the second segmented image.
  • the Gaussian kernel function is a commonly used kernel function.
  • Step 140 take the next frame image in the video data as the current frame image, and return to perform the operation of inputting the current frame image to the trained image segmentation model, until each frame image in the video data obtains the corresponding second segmentation image.
  • the processing procedure is to take the next frame of image as the current frame image, and repeat steps 110 to 130 to obtain the second segmented image of the current frame image again, and then obtain the next frame of image again, and repeat the above process until Image segmentation is achieved until each frame of image in the video data obtains a corresponding second segmented image.
  • the current frame image can be processed according to actual needs.
  • the method further includes: acquiring a target background image, where the target background image includes the target background; The background of the current frame image is replaced according to the target background image and the second segmented image, so as to obtain a new image of the current frame.
  • the target background refers to a new background used after the background is replaced.
  • the target background image is the image that contains the target background.
  • the target background image and the second segmented image have the same resolution.
  • the target background image may be an image selected by the user of the image segmentation device, or may be a default image of the image segmentation device.
  • the background of the current frame image is replaced to obtain a replaced image.
  • the image after the background replacement is recorded as the new image of the current frame.
  • the background replacement method is: determining the pixels where the portrait area is located and the pixel points where the background area is located in the current frame image by using the second segmented image, after that, retaining the portrait area and replacing the corresponding background area with the one in the target background image. Correlate the target background to get a new image for the current frame.
  • the portrait area can be retained by I ⁇ S 2 (that is, after the current frame image is multiplied by the second segmented image, the pixels in the current frame image correspond to the pixels with the pixel value of 1 in the second segmented image, which are Retained), the background area can be replaced by (1-S 2 ) ⁇ B (that is, after the target background image is multiplied by the second segmented image, the pixels in the target background image correspond to the pixels whose pixel value is 0 in the second segmented image point, which is preserved).
  • a new image of the current frame corresponding to the current frame of image can be obtained through the second segmented image. After that, each new image of the current frame can form new video data after background replacement.
  • the video data is limited to include the target object.
  • the video data may not include the target object. 0 for the segmented image.
  • each frame of the video data is input into the image segmentation model to obtain the corresponding first segmented image, and then the first segmented image is smoothed to obtain the second segmented image
  • the video data can be accurately segmented, especially the online video data, which ensures the processing speed of the online video data.
  • the self-learning of the image segmentation model it can be applied to video data with complex images, and in the application process, it can be directly applied only by deploying the image segmentation model without artificial prior information, which simplifies the complexity of image segmentation.
  • the application scenarios of image segmentation methods are expanded.
  • FIG. 2 is a flowchart of another image segmentation method provided by an embodiment of the present application. This image segmentation method is based on the above-mentioned image segmentation method, and exemplifies the training process of the image segmentation model. Referring to Figure 2, the image segmentation method specifically includes:
  • Step 210 Acquire a training data set, where the training data set includes multiple original images.
  • the training data refers to the data that the image segmentation model learns when training the image segmentation model.
  • the training data is in the form of images, so the training data is referred to as the original image, and the original image and the video data contain the same type of target objects.
  • a training dataset refers to a dataset containing a large number of original images. That is, during the training process, a large number of original images are selected from the training data set for the image segmentation model to learn, so as to improve the accuracy of the image segmentation model.
  • the video data contains a large number of images. If the original images are collected according to the video data, the images need to be collected frame by frame based on the video data, which will consume a lot of work and production costs, and each collected original image will contain a large number of repetitions. content, which is not conducive to the training of image segmentation models. Therefore, in the embodiment, the training data set is constructed by replacing the video data with independent original images. At this point, the constructed training dataset can contain original images with different portrait poses in different scenes. Wherein, the scene is preferably a natural scene.
  • a plurality of natural scenes are preselected, and in each natural scene, a plurality of images containing human beings are captured by an image acquisition device as original images, wherein the postures of the human beings in the plurality of images are different.
  • the parameters of the image acquisition device such as the position of the image acquisition device, the aperture size, the degree of focus, etc.
  • the lighting in the natural environment on the performance of the image segmentation model when constructing the training data set, the same natural In the same portrait pose of the scene, multiple original images under different lighting and different shooting parameters are collected to ensure the performance of the image segmentation model when processing video data in different scenes, different portrait poses, different lighting and different shooting parameters.
  • an existing public image data set can also be used as the training data set, for example, the public data set Supervisely can be used as the training data set, or the public data set EG1800 can be used as the training data set.
  • Step 220 constructing a label data set according to the training data set, the label data set includes a plurality of segmentation label images and a plurality of edge label images, and an original image corresponds to a segmentation label image and an edge label image.
  • the label data can be understood as reference data for determining whether the image segmentation model is accurate, which plays a role of supervision. If the output result of the image segmentation model is more similar to the corresponding label data, it means that the accuracy of the image segmentation model is higher, that is, the performance is better; otherwise, the accuracy of the image segmentation model is lower. It can be understood that the process of training the image segmentation model is the process of making the output result of the image segmentation model more similar to the corresponding label data.
  • the image segmentation model outputs a segmented image and an edge image corresponding to the original image, wherein the segmented image refers to a binary image obtained by performing image segmentation on the target object in the original image.
  • the segmented image output by the image segmentation model in the training process is recorded as the segmentation result image.
  • the edge image refers to a binary image representing the edge between the portrait area and the background area in the original image.
  • the edge image output by the image segmentation model in the training process is recorded as the edge result image.
  • the label data is set according to the output result of the image segmentation module, including the segmentation label image and the edge label image, wherein the segmentation label image corresponds to the segmentation result image, and is used to play a role in the segmentation result image.
  • the edge label image corresponds to the edge result image, and is used for reference to the edge result image.
  • both the edge label image and the segmented label image can be obtained from the above-mentioned original image.
  • the portrait area, background area and edge area are marked in each original image, and then the edge label image and segmentation label image are obtained according to the portrait area, background area and edge area.
  • a human-marking method is used to mark a portrait region and a background region in each original image, and then a segmented label image is obtained according to the portrait region and the background region, and an edge label image is obtained according to the segmented label image.
  • step 220 includes steps 221-225:
  • Step 221 Acquire an annotation result for the original image.
  • the labeling result refers to the result obtained after labeling the portrait area and background area in the original image.
  • the labeling result is obtained by manual labeling, that is, the portrait area and the background area are manually marked in the above-mentioned original image, and then the image segmentation device obtains the labeling result according to the marked portrait area and background area.
  • Step 222 Obtain a corresponding segmented label image according to the labeling result.
  • the pixel value of each pixel included in the portrait region in the original image is changed to 255, and the pixel value of each pixel included in the background region in the original image is changed to 0, thereby obtaining the segmented label image.
  • the segmented label image is a binary image.
  • Step 223 Perform an erosion operation on the segmented label image to obtain an erosion image.
  • the erosion operation can be understood as reducing and refining the white area (ie, the portrait area) with a pixel value of 255 in the segmented label image.
  • the image obtained by performing the erosion operation on the segmented label image is recorded as the erosion image. It can be understood that the number of pixels occupied by the white area in the eroded image is smaller than the number of pixels occupied by the white area in the segmented label image, and the white area in the segmented label image can completely cover the white area in the eroded image.
  • Step 224 perform a Boolean operation on the segmented label image and the eroded image to obtain an edge label image corresponding to the original image.
  • Boolean operations include union, intersection, and subtraction.
  • a plurality of objects that perform Boolean operations are operation objects.
  • the operation objects include the segmented label image and the eroded image, more specifically, the segmented label image and the white area in the eroded image.
  • the result obtained by the Boolean operation may be recorded as a Boolean object.
  • the Boolean object is an edge label image.
  • union means that the resulting Boolean object contains the volume of the two operands.
  • the Boolean object obtained by combining the segmented label image and the eroded image is the white area in the segmented label image.
  • Intersection means that the resulting Boolean object only contains the common volume of the two operation objects (that is, only contains overlapping positions). Since the white area in the segmented label image can completely cover the white area in the eroded image, therefore, for the segmented label image and The resulting Boolean object after intersecting the eroded image is the white area in the eroded image.
  • Subtraction means that the Boolean object contains the volume of the operation object from which the intersection volume is subtracted.
  • the Boolean object obtained after subtracting the segmented label image and the eroded image is to remove the white color corresponding to the eroded image in the white area of the segmented label image.
  • the resulting white area after the area It can be understood that since the eroded image is an image obtained by reducing the white area in the segmented label image, the edges of the eroded image and the white area in the segmented label image are highly similar. After subtraction, you can get the white area that only represents the edge, that is, the edge label image.
  • Step 225 Obtain a label data set according to the segmented label image and the edge label image.
  • a label data set is formed by each segmented label icon and each edge label image. It can be understood that the segmentation label image and the edge label image can be considered as Ground Truth, that is, the correct label.
  • Step 230 Train the image segmentation model according to the training data set and the label data set.
  • an original image is input to the image segmentation model, and a loss function is constructed according to the output result of the image segmentation model and the corresponding label data in the label dataset, and then the model parameters of the image segmentation model are updated according to the loss function.
  • another original image is input into the updated image segmentation model to construct the loss function again, and the model parameters of the image segmentation model are updated again according to the loss function, and the above training process is repeated until the loss function converges.
  • the value of the loss function obtained by successive calculations is within the set range, it can be considered that the loss function has converged, and then the accuracy of the output result of the image segmentation model can be determined to be stable. Therefore, it can be considered that the image segmentation model has been trained.
  • the specific structure of the image segmentation model can be set according to the actual situation.
  • the image segmentation model includes: a normalization module, an encoding module, a channel confusion module, a residual module, a multiple upsampling module, an output module and an edge module as an example for description.
  • the image segmentation model is exemplarily described with the structure shown in FIG. 3 .
  • 3 is a schematic structural diagram of an image segmentation model provided by an embodiment of the present application. 3 , the image segmentation model includes a normalization module 21 , an encoding module 22 , four channel obfuscation modules 23 , three residual modules 24 , four multiple upsampling modules 25 , an output module 26 and an edge module 27 .
  • step 230 includes steps 231-2310:
  • Step 231 Input the original image to the normalization module to obtain a normalized image.
  • FIG. 4 is a schematic diagram of an original image provided by an embodiment of the present application.
  • the original image contains a portrait area, and it should be noted that the original image used in Figure 4 comes from the public dataset Supervisely.
  • normalization refers to the process of performing a series of standard processing and transformation on the image to transform the image into a fixed standard form.
  • the obtained standard image is called a normalized image.
  • Normalization is divided into linear normalization and nonlinear normalization.
  • the original image is processed by means of linear normalization.
  • linear normalization is to normalize the pixel values in each image from [0, 255] to [-1, 1], and the resolution of the obtained normalized image is equal to the resolution of the image before linear normalization .
  • the normalization module is a module that implements a linear normalization operation. After the original image is input to the normalization module, the normalization module outputs a normalized image with a pixel value of [-1, 1].
  • Step 232 using the coding module to obtain the multi-layer image features of the normalized image, and the resolutions of the image features of each layer are different.
  • the encoding module is used to extract features in the normalized image.
  • the extracted features are recorded as image features.
  • the image features may reflect information such as color features, texture features, shape features, and spatial relationship features in the normalized image, including global information and/or local information.
  • the encoding module is a lightweight network, where the lightweight network refers to a neural network with a small amount of parameters, a small amount of computation, and a short inference time.
  • the type of the lightweight network used by the encoding module can be selected according to the actual situation.
  • the encoding module 12 is a MobileNetV2 network as an example for description.
  • the normalized image can output multi-layer image features after passing through MobileNetV2, wherein the resolutions of the image features of each layer are different and there is a multiple relationship, and optionally, the resolution of the image features of each layer is smaller than the original image. resolution.
  • the image features of each layer are arranged from top to bottom in an order from high to low resolution, that is, the image features with the highest resolution are located in the highest layer, and the image features with the lowest resolution are located in the lowest layer. It can be understood that the number of layers of the image features output by the encoding module can be set according to the actual situation. For example, when the resolution of the original image is 224 ⁇ 224, the encoding module outputs four layers of image features. At this time, referring to FIG.
  • the resolution of the highest layer (first layer) image feature is 112 ⁇ 112 (the image feature of this layer is denoted as Feature112 ⁇ 112 in FIG. 3 )
  • the second The resolution of the high-level (second layer) image features is 56 ⁇ 56 (the image features of this layer are marked as Feature56 ⁇ 56 in Figure 3)
  • the resolution of the next-lowest layer (third layer) image features is 28 ⁇ 28 ( Figure 3).
  • the image feature of this layer is marked as Feature28 ⁇ 28)
  • the resolution of the image feature of the lowest layer (fourth layer) is 14 ⁇ 14 (the image feature of this layer is marked as Feature14 ⁇ 14 in FIG. 3 ).
  • the image features of each layer contain more and more information from the bottom to the top.
  • the encoding module can be understood as an encoder in the image segmentation model.
  • Step 233 Input the image features of each layer into the corresponding channel confusion module respectively to obtain multi-layer confusion features, and each layer of image features corresponds to a channel confusion module.
  • the channel confusion module is used to fuse the features between the channels in the layer to enrich the information contained in the image features of each layer and ensure the accuracy of the image segmentation model without increasing the subsequent calculation amount. It can be understood that each layer of image features corresponds to one channel confusion module. As shown in Figure 3, the four-layer image features correspond to four channel confusion modules 23. Each channel confusion module 23 is used to fuse the image features between multiple channels in the corresponding layer. .
  • the channel confusion module is composed of a 1 ⁇ 1 convolution layer, a batch normalization (BN) layer, and an activation function layer, wherein the activation function layer adopts a Relu activation function.
  • the 1 ⁇ 1 convolution layer is used to realize the confusion of image features between channels, and the BN layer + activation function layer can make the confused image features more stable.
  • the features output by the channel confusion module are recorded as confusion features. It can be understood that each layer of image features has corresponding confusion features, and the resolution of the confusion features and image features in the same layer is the same. In one embodiment, except for the confusing features with the lowest resolution, other confusing features are central layer features, that is, other layers may be considered as network central layers.
  • the obfuscated features of the lowest layer are represented as Decode 14 ⁇ 14, and the obfuscated features of other layers are represented as Center28 ⁇ 28, Center56 ⁇ 56, Center112 ⁇ 112 respectively.
  • the digital part represents the resolution.
  • the confusion feature output by the channel confusion module can also be regarded as the feature obtained after decoding the image feature, that is, the channel confusion module can also realize the function of decoding in addition to the confusion feature.
  • Step 234 Upsampling the confusion features of each layer except the confusion features at the highest resolution level, and fuses them with the confusion features of a higher resolution to obtain a fusion feature corresponding to a higher resolution.
  • Upsampling can be understood as enlarging the feature to enlarge the resolution of the feature.
  • the up-sampling is implemented by a linear interpolation method, that is, a suitable interpolation algorithm is used to insert new elements between the obfuscated features, so as to expand the resolution of the obfuscated features.
  • the resolution of the confusion feature can be enlarged by up-sampling, so that the enlarged resolution is equal to the one-level higher resolution.
  • the higher-level resolution refers to a resolution higher than the current up-sampling resolution and only higher than the current up-sampling resolution.
  • the up-sampling resolution can be considered as its higher-level resolution A lower resolution of the rate.
  • the resolution of each other layer is one level higher than the resolution of the next layer. It can be understood that, since the resolution of the confusion feature of any layer has a multiple relationship with its higher-level resolution, the multiple of upsampling can be determined according to the multiple.
  • the resolution of the confusion feature of a certain layer is 0.5 times the resolution of the higher level
  • the resolution of the confusion feature of this layer can be enlarged by means of double upsampling.
  • the confusing features of the higher resolution are fused with the corresponding up-sampled confusion features of the lower resolution through Skip Connection, so as to reuse the confusing features and ensure the use of more information in the subsequent processing process.
  • Characteristics It can be understood that image segmentation is a kind of dense pixel prediction (Dense Prediction), therefore, the original image segmentation model requires richer features.
  • the fused feature is recorded as a fused feature. In this case, except for the confusion feature with the lowest resolution, each layer of confusion features has a corresponding fusion feature.
  • the operation of feature fusion can be understood as a concatenate (vector splicing) operation.
  • the size of the fusion feature of each layer is the sum of the upsampling size of the confusion feature of this layer and the confusion feature of the lower resolution.
  • C in [NCHW] before the fusion of the confusion feature of this layer is 3.
  • C is 3 in [NCHW] before fusion after upsampling of the confusion feature of lower resolution .
  • N is the number
  • C is the number of channels
  • H is the height
  • W is the width
  • H ⁇ W can be understood as the resolution. It should be noted that since the highest resolution does not have a higher-order resolution, there is no need to upsample the obfuscated feature with the highest resolution.
  • the confusion feature Decode 14 ⁇ 14 of the lowest layer is doubled up-sampling and the resolution is doubled, that is, a feature with a resolution of 28 ⁇ 28 is obtained.
  • the lowest layer The confusion feature Center28 ⁇ 28 of the higher first-level resolution (ie the second lower layer) is fused with the 28 ⁇ 28 features after the double upsampling of the lowest layer through skip connection to obtain the fusion feature of the second lower layer.
  • double upsampling the confusing feature Center28 ⁇ 28 of the second lower layer and double the resolution that is, to obtain a feature with a resolution of 56 ⁇ 56.
  • the confusion feature Center56 ⁇ 56 is fused with the 56 ⁇ 56 features after double upsampling of the second lower layer through skip connection to obtain the fusion features of the second higher layer. In the same way, the fusion features of the highest level are obtained.
  • Step 235 Input the fusion features of each layer into the corresponding residual modules respectively to obtain multi-layer first decoding features, each layer of fusion features corresponds to a residual module, and the confusing feature with the lowest resolution is used as the first decoding feature with the lowest resolution. feature.
  • the residual module is used to further extract and decode the fusion features, and the residual module may include one or more residual blocks (Residual Block, RS Block).
  • the residual module includes one residual block as an example. description, and the structure of the residual block can be set in the actual situation. It can be understood that each layer of fusion features corresponds to a residual module, and the features output by the residual module after processing have the same resolution as the fusion features of this layer. Since the residual module can further extract and decode the fusion feature, that is, the feature output by the residual module is the decoding feature, therefore, in the embodiment, the feature output by the residual module is recorded as the first decoding feature.
  • the confusion feature with the lowest resolution has no corresponding fusion feature, there is no need to set a residual module in the layer with the lowest resolution.
  • the confusion feature with the lowest resolution can be directly regarded as the first decoding of this layer. feature.
  • the corresponding first decoding features can be obtained.
  • FIG. 3 it includes three residual modules 24, and the first decoding feature output after the fusion feature of the sub-bottom layer is input to the residual module is denoted as RS Block28 ⁇ 28, that is, the resolution of the first decoding feature is 28 ⁇ 28.
  • the first decoded feature output after the fusion feature of the next layer is input to the residual module is denoted as RS Block56 ⁇ 56, that is, the resolution of the first decoded feature is 56 ⁇ 56.
  • the first decoded feature output after the fusion feature of the highest layer is input to the residual module is denoted as RS Block112 ⁇ 112, that is, the resolution of the first decoded feature is 112 ⁇ 112.
  • the first decoding feature of the lowest layer is Decode14 ⁇ 14.
  • Step 236 Input the first decoding feature of each layer into the corresponding multiple upsampling module to obtain multiple second decoding features, each layer of the first decoding feature corresponds to a multiple upsampling module, and each second decoding feature is the same as the The original image has the same resolution.
  • the multiple upsampling module is configured to perform multiple upsampling on the first decoded feature, so that the resolution after the multiple upsampling is equal to the resolution of the original image.
  • the specific multiple of multiple upsampling can be determined according to the resolution of the first decoding feature and the resolution of the original image. For example, the resolution of the first decoding feature is 14 ⁇ 14, and the resolution of the original image is 224 ⁇ 224, then, the first decoded feature needs to be upsampled by 16 times to obtain a decoded feature with a resolution of 224 ⁇ 224.
  • the final output binary image (segmentation result image) is used to distinguish the foreground (such as a portrait area) and the background. Therefore, the segmentation task of the image segmentation model belongs to the two-class segmentation task. , at this time, before obtaining the segmentation result image, it is necessary to obtain the decoding feature with the number of channels 2.
  • the multiple upsampling module in addition to performing multiple upsampling on the first decoding feature, the multiple upsampling module also needs to change the number of channels of the multiple upsampled first decoding feature to 2. For the first decoding feature of each layer, it only changes the resolution after multiple upsampling, but the number of channels does not change.
  • a 1 ⁇ 1 volume is set in the multiple upsampling module A convolution layer, that is, after performing multiple upsampling on the first decoded feature, a 1 ⁇ 1 convolutional layer is connected to change the number of channels of the multiplely upsampled first decoded feature to 2.
  • the image segmentation model can also perform multi-classification segmentation tasks. At this time, before obtaining the final output image, it is also necessary to obtain decoding features with the number of channels equal to the number of classifications. For example, if the image segmentation model performs a five-category segmentation task, before finally outputting a five-category segmentation result image, it is necessary to obtain 5-channel decoding features.
  • the pixel values of the pixels in the segmentation label image need to be converted from 0 and 255 to 0 and 1, that is, the pixel value of 0 is converted to 0.
  • a pixel with a pixel value of 255 is converted to 1.
  • the feature output by the multiple upsampling module is denoted as the second decoding feature.
  • the first decoding feature of each layer corresponds to a multiple upsampling module
  • the second decoding feature with 2 channels and the same resolution as the original image can be obtained through the multiple upsampling module.
  • the second decoding feature can be considered as a network prediction output obtained after decoding the image features of the current layer.
  • the second decoding feature of each layer can be regarded as a temporary output result obtained after decoding the image feature of the layer.
  • the final segmentation result image can be obtained by temporarily outputting the result.
  • Step 237 Combine the multi-layer second decoding features and input them to the output module to obtain a segmentation result image.
  • the output module integrates the second decoding features of each layer to obtain a segmentation result image (ie, a binary image).
  • a segmentation result image ie, a binary image.
  • the second decoding features of each layer are first fused (ie, concatenate), so that the output module can obtain more abundant features, thereby restoring a more accurate image.
  • the output module uses the fused second decoding feature to obtain the segmentation result image.
  • the specific process of the output module is: connect the fused second decoding feature to a 1 ⁇ 1 convolutional layer to obtain a 2-channel decoding feature.
  • the fused second decoding feature is only a
  • the two decoding features are merged together, and through the 1 ⁇ 1 convolutional layer in the decoding module, the fused second decoding feature can be further decoded to output the final decoding feature after referring to the second decoding feature of each layer.
  • the decoding feature has 2 channels, which is used to describe the result of the binary classification, that is, to describe whether each pixel in the original image is a portrait area or a background area. After that, after passing the decoded features through the softmax function and the argmax function, the segmentation result image is obtained. That is, the output module consists of a 1 ⁇ 1 convolutional layer and an activation function layer.
  • the activation function layer consists of the softmax function and the argmax function.
  • the data processed by the softmax function can be understood as the output data of the logic layer, that is, the meaning represented by the decoding features output by the 1 ⁇ 1 convolution layer is interpreted, and the description of the logic layer is obtained.
  • the argmax function is a common function to obtain the output result, that is, the corresponding segmentation result image is output by the argmax function.
  • the four second decoding features are fused and input to the output module 26.
  • a 1 ⁇ 1 convolutional layer is first passed to obtain 2-channel decoding features (referred to as Refine224 ⁇ in FIG. 3 ).
  • the segmentation result image is obtained through the activation function layer (denoted as output224 ⁇ 224 in Figure 3).
  • the pixel value of each pixel in the segmentation result image output by the image segmentation model is 0 or 1, wherein, the pixel with a pixel value of 0 is a pixel in the background area, and a pixel with a pixel value of 1 is in the portrait area. pixel.
  • the pixel value of each pixel is multiplied by 255.
  • FIG. 5 is a schematic diagram of a segmentation result image provided by an embodiment of the present application. After inputting the training data shown in FIG. 4 into the image segmentation model shown in FIG. 3, a segmentation result image is obtained. After multiplying by 255, the segmentation result image shown in Figure 5 can be obtained.
  • Step 238 Input the first decoded feature with the highest resolution to the edge module to obtain an edge result image.
  • an edge module is set in the image segmentation model to perform additional supervision on the first decoding feature with the highest resolution through the edge module, that is, It acts as a regularization constraint to improve the ability of the image segmentation model to learn edges.
  • the specific structural embodiment of the edge module is not limited.
  • the edge module is taken as an example of a 1 ⁇ 1 convolutional layer for description. Exemplarily, after inputting the first decoded feature with the highest resolution into the edge module, an edge feature with 2 channels and the same resolution as the original image can be obtained, through which an edge feature that only expresses the edge can be obtained. the binary image.
  • the binary image expressing the edge is recorded as the edge result image.
  • the pixel value of each pixel in the edge result image is 0 or 1, wherein, the pixel with the pixel value of 1 represents the pixel where the edge is located, and the pixel with the pixel value of 0 represents the pixel where the non-edge is located.
  • the first decoded feature with the highest resolution has richer detailed information, therefore, more accurate edge features can be obtained through the first decoded feature with the highest resolution.
  • edge224 ⁇ 224 an edge feature with a resolution of 224 ⁇ 224 can be obtained, which is denoted as edge224 ⁇ 224 in FIG. 3 .
  • FIG. 6 is a schematic diagram of an edge result image provided by an embodiment of the present application.
  • the training data shown in FIG. 4 is input into the image segmentation model shown in FIG. 3 to obtain an edge result image. After multiplying by 255, the edge result image shown in Figure 6 can be obtained.
  • Step 239 Construct a loss function according to each second decoding feature, the edge result image, the corresponding segmentation label image and the edge label image, and update the model parameters of the image segmentation model according to the loss function.
  • the loss function of the segmentation network model is composed of a segmentation loss function and an edge loss function.
  • the segmentation loss function can reflect the segmentation ability of the segmentation network model, and the segmentation loss function is obtained according to the second decoding feature of each layer and the segmentation label image.
  • a sub-loss function can be obtained based on the second decoding feature of each layer and the segmentation label image, and the segmentation loss function can be obtained by combining the sub-loss functions of each layer. It can be understood that the calculation method of each sub-loss function is the same.
  • the sub-loss function is calculated by the Iou function, and the Iou function can be defined as: the ratio of the area of the intersection of the predicted pixel region (that is, the second decoding feature) and the label pixel region (that is, the segmented label image) to the area of the union.
  • the Iou function can reflect the overlapping similarity between the binary image corresponding to the second decoding feature and the segmented label image, and at this time, the sub-loss function calculated by the Iou function can reflect the loss of overlapping similarity.
  • the edge loss function can reflect the ability of the segmentation network model to learn edges, and the edge loss function is obtained from the edge result image and the edge label image.
  • the edge loss function adopts the Focal loss loss, which is a common loss function, which can reduce a large number of simple negative effects.
  • the weight of the sample in training can also be understood as a difficult sample mining.
  • the loss function of the segmentation network model is expressed as:
  • Loss represents the loss function of the segmentation network model
  • n represents the total number of layers corresponding to the second decoding feature
  • n represents the sub-loss function calculated from the second decoding feature with the highest resolution and the corresponding segmentation label image
  • a n represents the second decoding feature with the lowest resolution
  • B represents the corresponding segmentation label image
  • Iou n represents the overlap similarity between A n and B
  • loss edge is the Focal loss loss function.
  • the image segmentation model has a total of n layers (n ⁇ 2), that is, there are n layers of second decoding features.
  • n sub-loss functions can be obtained according to the second decoding features of the n layers and the segmentation label image.
  • the first layer has the highest resolution, and its corresponding sub-loss function is recorded as
  • the resolution of the second layer is the second highest, and its corresponding sub-loss function is recorded as
  • the resolution of the nth layer is the lowest, and its corresponding sub-loss function is recorded as Since each sub-loss function is calculated in the same manner, the embodiment takes the n-th layer sub-loss function as an example for description.
  • Exemplary which is Represents the loss of the nth layer Iou function.
  • a n represents the second decoding feature of the nth layer
  • B represents the corresponding segmented label image
  • a n ⁇ B represents the intersection of A n and B
  • a n ⁇ B represents the union of A n and B
  • Iou n represents A n and the overlapping similarity of B, at this time, Represents the loss of overlapping similarity.
  • loss edge represents an edge loss function
  • loss edge is a Focal loss loss function.
  • loss edge (p t ) - ⁇ t (1-p t ) ⁇ log(p t ).
  • p t represents the predicted probability value that the pixel in the edge result image is an edge
  • ⁇ t represents the balance weight coefficient, which is used to balance positive and negative samples
  • represents the modulation coefficient, which is used to control the weight of difficult and easy-to-classify samples.
  • the values of ⁇ t and ⁇ can be set according to the actual situation.
  • the loss edge can be obtained according to the loss edge (p t ) of each pixel in the edge result image. Specifically, the mean value is calculated after adding the loss edge (p t ) of each pixel point, and the calculated mean value is used as the loss edge .
  • the model parameters of the image segmentation model can be updated according to the loss function, so that the performance of the updated image segmentation model is higher.
  • Step 2310 Select the next original image, and return to perform the operation of inputting the original image to the normalization module until the loss function converges.
  • the image segmentation model After the image segmentation model is stabilized, it is determined that the training is over, and then the image segmentation model can be applied to segment the portraits in the video data.
  • the method further includes: when the image segmentation model is not a network model recognizable by the forward inference framework, converting the image segmentation model into a forward inference model Framework-aware network models.
  • the image segmentation model is trained in a corresponding framework, which is usually a framework such as tensorflow and pytorch.
  • the pytorch framework is used as an example for description.
  • the pytorch framework is mainly used for model design, training and testing. Since the image segmentation model application runs in real-time in the image segmentation device, and the pytorch framework occupies a large amount of memory, if the image segmentation model under the pytorch framework is run in an application of the image segmentation device, it will greatly increase the occupied by the application. of storage space.
  • the forward inference framework is generally aimed at a specific platform (such as an embedded platform), and different platforms have different hardware configurations. When the forward inference framework is deployed on a platform, it can combine the hardware configuration of the platform and make reasonable use of resources. Optimization acceleration, that is, the forward inference framework can perform optimization acceleration when running its on-premise model.
  • the forward inference model is mainly used for the prediction process of the model, wherein the prediction process includes the testing process of the model and the prediction process (application process) of the model, but does not include the training process of the model, and the forward inference framework relies on GPU It is low-level and lightweight, and does not make the application take up a large amount of storage space. Therefore, when applying an image segmentation model, run the image segmentation model in a forward inference framework. In one embodiment, before applying the image segmentation model, it is determined whether the image segmentation model is running in the forward inference framework. If the image segmentation model runs in the forward inference framework, the image segmentation model is directly applied.
  • the image segmentation model is converted into a network model recognizable in the forward inference framework.
  • the specific type of the forward reasoning framework can be set according to the actual situation, for example, the forward reasoning framework is an openvino framework.
  • the specific means to convert the image segmentation model under the pytorch framework into the image segmentation model under the openvino framework can be: using the existing pytorch conversion tool to convert the image segmentation model to the Open Neural Network Exchange (Open Neural Network Exchange, ONNX) model, and then use the openvino conversion tool to convert the ONNX model into an image segmentation model under the openvino framework.
  • ONNX is a standard for representing deep learning models, which enables models to be transferred between different frameworks.
  • the method further includes: deleting the edge module.
  • the advantage of setting the edge module in the training process is to improve the learning ability of the image segmentation model for the edge, thereby ensuring the accuracy of the segmentation result image.
  • the edge module can be deleted. , that is, cancel the data processing process of the edge module when applying the image segmentation model, so as to reduce the data processing amount of the image segmentation model and improve the processing speed.
  • the encoding module of the image segmentation model adopts a lightweight network, which can reduce the amount of data processing during encoding.
  • the channel confusion module can confuse the image features between channels without significantly increasing the amount of calculation, so as to enrich the channels in the channel. feature information to ensure the accuracy of the image segmentation model.
  • the edge module By setting the edge module, the learning ability of the image segmentation model for the edge is improved, and the accuracy of the image segmentation model is further ensured. In the application process, the edge module is deleted to reduce the calculation amount of the image segmentation model.
  • Converting the image segmentation model into an image segmentation model under the forward inference framework can reduce the low dependence of the image segmentation model on the GPU, and reduce the storage space occupied by the application running the image segmentation model.
  • the trained image segmentation model can accurately segment the portrait area in the video data without human prior or interaction. After testing, under the environment of ordinary PC integrated graphics card, the processing time of each frame of image in the video data It only takes about 20ms to realize real-time automatic portrait segmentation.
  • the image segmentation model further includes: a decoding module.
  • the method further includes: inputting the first decoding feature with the highest resolution to the decoding module to obtain a corresponding new first decoding feature.
  • FIG. 7 is a schematic structural diagram of another image segmentation model provided by an embodiment of the present application. Compared with the image segmentation model shown in FIG. 3 , the image segmentation model shown in FIG. 7 further includes a decoding module 28 .
  • a decoding module is passed to further decode the first decoding feature with the highest resolution, that is, a new first decoding feature is obtained.
  • the new first decoding feature can be considered as the first decoding feature finally obtained by the highest resolution level, and then the new first decoding feature is input to the multiple upsampling module and edge set in the highest resolution level Module, it can be understood that the number of channels and the resolution of the new first decoding feature are the same as the number of channels and the resolution of the original first decoding feature.
  • the first decoding feature after the decoding module 28 in FIG. 7 is denoted as Refine112 ⁇ 112, and its resolution is the same as that of RS Block112 ⁇ 112.
  • the decoding module is a convolutional network, and the number and structure of the convolutional layers are not limited.
  • the accuracy of the first decoding feature of the highest layer can be improved, thereby improving the accuracy of the image segmentation model.
  • the first decoding feature the lower the resolution, the more advanced the semantic feature it has, and the higher the resolution, the richer the detailed feature it has.
  • the first decoding feature with the highest resolution there will be sawtooth phenomenon when directly up-sampling it, that is, the detail feature will appear sawtooth phenomenon. Therefore, a decoding module is added to it, so as to make the final obtained new first decoding feature transition. It is more uniform and avoids the appearance of jaggedness.
  • the first decoding features included in other layers basically do not appear aliasing after up-sampling. Even if a decoding module is set for it, the accuracy of the image segmentation model will not be affected. Therefore, there is no need to set a decoding module for other layers. It can be understood that, in practical applications, if the aliasing phenomenon occurs after the up-sampling of the first decoding features of other layers, a decoding module may also be set for them, so as to improve the accuracy of the image segmentation model.
  • the target object is described as a human being, and in practical applications, the target object can also be any other object.
  • FIG. 8 is a schematic structural diagram of an image segmentation apparatus provided by an embodiment of the present application.
  • the image segmentation apparatus includes: a data acquisition module 301 , a first segmentation module 302 , a second segmentation module 303 and a repeated segmentation module 304 .
  • the data acquisition module 301 is used to acquire the current frame image in the video data, and the target object is displayed in the video data;
  • the first segmentation module 302 is used to input the current frame image into the trained image segmentation model, to Obtain the first segmented image based on the target object;
  • the second segmentation module 303 is used for smoothing the first segmented image to obtain the second segmented image based on the target object;
  • the repeating segmentation module 304 is used for the The next frame image in the video data is taken as the current frame image, and the operation of inputting the current frame image into the trained image segmentation model is returned to execute until each frame image in the video data obtains a corresponding second segmentation image.
  • a training acquisition module for acquiring a training data set, the training data set includes a plurality of original images; a label construction module for constructing a label data set according to the training data set, the label The dataset contains multiple segmentation label images and multiple edge label images, one of the original images corresponds to one segmented label image and one edge label image; the model training module is used for training according to the training dataset and the label dataset Image segmentation model.
  • the image segmentation model includes: a normalization module 21, an encoding module 22, a channel confusion module 23, a residual module 24, and a multiple upsampling module 25 , an output module 26 and an edge module 27 .
  • the above model training module includes: a normalization unit for inputting the original image into the normalization module 21 to obtain a normalized image; an encoding unit for obtaining the normalized image by using the encoding module 22
  • the multi-layer image features of the transformed image, and the resolution of each layer of image features is different;
  • the channel confusion unit is used to input the image features of each layer into the corresponding channel confusion module 23 respectively, so as to obtain multi-layer confusion features, each layer of the image
  • the feature corresponds to a channel confusion module 23;
  • the fusion unit is used to upsample the confusion features of each layer except the confusion features with the highest resolution, and fuse them with the confusion features of a higher resolution to obtain a higher resolution.
  • each second decoding feature has the same resolution as the original image;
  • the segmentation output unit is used to combine the multi-layer second decoding features and input to the output module 26 to obtain the segmentation result image;
  • the edge output unit with In inputting the first decoding feature with the highest resolution to this edge module 27, to obtain an edge result image;
  • a parameter updating unit for each of the second decoding features, edge result image, corresponding segmentation label image and edge label image Construct a loss function, and update the model parameters of the image segmentation model according to the loss function;
  • the image selection unit is used to select the next original image, and returns to perform the operation of inputting the original image to the normalization module until the loss until the function converges.
  • the image segmentation model further includes: a decoding module 28 .
  • the above-mentioned model training module also includes: a decoding unit, which is used to input the fusion features of each layer into the corresponding residual module 24 respectively, so as to obtain the multi-layer first decoding features, and then convert the first decoding features with the highest resolution. Input to the decoding module 28 to obtain the corresponding new first decoding feature.
  • the encoding module includes the MobileNetV2 network.
  • the loss function is expressed as: Among them, Loss represents the loss function, n represents the total number of layers corresponding to the second decoding feature, represents the sub-loss function calculated from the second decoding feature with the highest resolution and the corresponding segmentation label image, represents the sub-loss function calculated from the second decoding feature with the lowest resolution and the segmented label image, A n represents the second decoding feature with the lowest resolution, B represents the corresponding segmentation label image, Iou n represents the overlap similarity between A n and B, and loss edge is the Focal loss loss function.
  • an edge deletion module is further included, and after the loss function of the image segmentation model converges, the edge deletion module is further included.
  • a frame conversion module after training the image segmentation model according to the training data set and the label data set, when the image segmentation model is not a network model identifiable by the forward inference framework, Transform the image segmentation model into a network model recognizable by the forward inference framework.
  • the label construction module includes: a label acquisition unit, used for obtaining the labeling result for the original image; a segmentation label obtaining unit, used for obtaining a corresponding segmented label image according to the labeling result; an erosion unit, using is used to perform the erosion operation on the segmented label image to obtain the eroded image; the Boolean unit is used to perform the Boolean operation on the segmented label image and the eroded image to obtain the edge label image corresponding to the original image; the data set construction unit is used for to obtain a label dataset according to the segmented label image and the edge label image.
  • the method further includes: a target background acquisition module, configured to perform smoothing processing on the first segmented image to obtain a target background image after obtaining a second segmented image based on the target object. It contains a target background; a background replacement module is used to replace the background of the current frame image according to the target background image and the second divided image, so as to obtain a new image of the current frame.
  • a target background acquisition module configured to perform smoothing processing on the first segmented image to obtain a target background image after obtaining a second segmented image based on the target object. It contains a target background
  • a background replacement module is used to replace the background of the current frame image according to the target background image and the second divided image, so as to obtain a new image of the current frame.
  • the image segmentation device provided above can be used to execute the image segmentation method provided by any of the above embodiments, and has corresponding functions and beneficial effects.
  • the units and modules included are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be realized;
  • the specific names of the functional units are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present application.
  • FIG. 9 is a schematic structural diagram of an image segmentation device provided by an embodiment of the present application.
  • the image segmentation device includes a processor 40, a memory 41, an input device 42, and an output device 44; the number of processors 40 in the image segmentation device may be one or more, and one processor 40 is used in FIG. 9 .
  • the processor 40 , the memory 41 , the input device 42 , and the output device 43 in the image segmentation device may be connected by a bus or in other ways, and the connection by a bus is taken as an example in FIG. 9 .
  • the memory 41 can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the image segmentation method in the embodiments of the present application (for example, data acquisition in the image segmentation device). module 301, a first segmentation module 302, a second segmentation module 303, and a repeated segmentation module 304).
  • the processor 40 executes various functional applications and data processing of the image segmentation device by running the software programs, instructions and modules stored in the memory 41 , that is, to implement the above-mentioned image segmentation method.
  • the memory 41 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the image dividing apparatus, and the like.
  • memory 41 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device.
  • the memory 41 may further include memory located remotely from the processor 40, and these remote memories may be connected to the image segmentation apparatus through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the input device 42 may be used to receive input numerical or character information, and to generate key signal input related to user settings and function control of the image segmentation apparatus.
  • the output device 43 may include a display device such as a display screen.
  • the above-mentioned image segmentation device includes an image segmentation device, which can be used to execute any image segmentation method, and has corresponding functions and beneficial effects.
  • embodiments of the present application also provide a storage medium containing computer-executable instructions, when the computer-executable instructions are executed by a computer processor, for performing relevant operations in the image segmentation method provided by any embodiment of the present application , and has corresponding functions and beneficial effects.
  • the embodiments of the present application may be provided as a method, a system, or a computer program product.
  • the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects.
  • the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • the present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions.
  • These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • Memory may include non-persistent memory in computer readable media, random access memory (RAM) and/or non-volatile memory in the form of, for example, read only memory (ROM) or flash memory (flash RAM).
  • RAM random access memory
  • ROM read only memory
  • flash RAM flash memory
  • Computer-readable media includes both persistent and non-permanent, removable and non-removable media, and storage of information may be implemented by any method or technology.
  • Information may be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
  • computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

Les modes de réalisation de la présente invention concernent le domaine technique du traitement d'image. L'invention porte sur un procédé et un appareil de segmentation d'image, sur un dispositif et un support de stockage. Le procédé de segmentation d'image consiste à : acquérir la trame actuelle d'image dans des données vidéo, un objet cible étant affiché dans les données vidéo ; entrer la trame actuelle d'image dans un modèle de segmentation d'image formé, de façon à obtenir une première image segmentée sur la base de l'objet cible ; effectuer un traitement de lissage sur la première image segmentée, de façon à obtenir une seconde image segmentée sur la base de l'objet cible ; et prendre la trame suivante d'image dans les données vidéo en tant que trame actuelle d'image, et revenir à la réalisation de l'opération d'entrée de la trame actuelle d'image dans le modèle de segmentation d'image formé, jusqu'à ce qu'une seconde image segmentée correspondante soit obtenue pour chaque trame d'image dans les données vidéo. Grâce au procédé, le problème technique de l'état de la technique selon lequel il n'est pas possible de réaliser avec précision une segmentation d'image sur des données vidéo en ligne peut être résolu.
PCT/CN2020/137858 2020-12-21 2020-12-21 Procédé et appareil de segmentation d'image, et dispositif et support de stockage WO2022133627A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2020/137858 WO2022133627A1 (fr) 2020-12-21 2020-12-21 Procédé et appareil de segmentation d'image, et dispositif et support de stockage
CN202080099096.5A CN115349139A (zh) 2020-12-21 2020-12-21 图像分割方法、装置、设备及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/137858 WO2022133627A1 (fr) 2020-12-21 2020-12-21 Procédé et appareil de segmentation d'image, et dispositif et support de stockage

Publications (1)

Publication Number Publication Date
WO2022133627A1 true WO2022133627A1 (fr) 2022-06-30

Family

ID=82157066

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/137858 WO2022133627A1 (fr) 2020-12-21 2020-12-21 Procédé et appareil de segmentation d'image, et dispositif et support de stockage

Country Status (2)

Country Link
CN (1) CN115349139A (fr)
WO (1) WO2022133627A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114882076A (zh) * 2022-07-11 2022-08-09 中国人民解放军国防科技大学 一种基于大数据记忆存储的轻量型视频对象分割方法
CN115277452A (zh) * 2022-07-01 2022-11-01 中铁第四勘察设计院集团有限公司 基于边端协同的ResNet自适应加速计算方法及应用
CN116189194A (zh) * 2023-04-27 2023-05-30 北京中昌工程咨询有限公司 一种用于工程建模的图纸增强分割方法
CN117237397A (zh) * 2023-07-13 2023-12-15 天翼爱音乐文化科技有限公司 基于特征融合的人像分割方法、系统、设备及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116824308B (zh) * 2023-08-30 2024-03-22 腾讯科技(深圳)有限公司 图像分割模型训练方法与相关方法、装置、介质及设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827292A (zh) * 2019-10-23 2020-02-21 中科智云科技有限公司 一种基于卷积神经网络的视频实例分割方法及设备
CN110910391A (zh) * 2019-11-15 2020-03-24 安徽大学 一种双模块神经网络结构视频对象分割方法
WO2020170167A1 (fr) * 2019-02-21 2020-08-27 Sony Corporation Segmentation d'objet basée sur des réseaux neuronaux multiples dans une séquence de trames d'image couleur

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020170167A1 (fr) * 2019-02-21 2020-08-27 Sony Corporation Segmentation d'objet basée sur des réseaux neuronaux multiples dans une séquence de trames d'image couleur
CN110827292A (zh) * 2019-10-23 2020-02-21 中科智云科技有限公司 一种基于卷积神经网络的视频实例分割方法及设备
CN110910391A (zh) * 2019-11-15 2020-03-24 安徽大学 一种双模块神经网络结构视频对象分割方法

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115277452A (zh) * 2022-07-01 2022-11-01 中铁第四勘察设计院集团有限公司 基于边端协同的ResNet自适应加速计算方法及应用
CN115277452B (zh) * 2022-07-01 2023-11-28 中铁第四勘察设计院集团有限公司 基于边端协同的ResNet自适应加速计算方法及应用
CN114882076A (zh) * 2022-07-11 2022-08-09 中国人民解放军国防科技大学 一种基于大数据记忆存储的轻量型视频对象分割方法
CN116189194A (zh) * 2023-04-27 2023-05-30 北京中昌工程咨询有限公司 一种用于工程建模的图纸增强分割方法
CN116189194B (zh) * 2023-04-27 2023-07-14 北京中昌工程咨询有限公司 一种用于工程建模的图纸增强分割方法
CN117237397A (zh) * 2023-07-13 2023-12-15 天翼爱音乐文化科技有限公司 基于特征融合的人像分割方法、系统、设备及存储介质
CN117237397B (zh) * 2023-07-13 2024-05-28 天翼爱音乐文化科技有限公司 基于特征融合的人像分割方法、系统、设备及存储介质

Also Published As

Publication number Publication date
CN115349139A (zh) 2022-11-15

Similar Documents

Publication Publication Date Title
WO2022133627A1 (fr) Procédé et appareil de segmentation d'image, et dispositif et support de stockage
Ng et al. Actionflownet: Learning motion representation for action recognition
US11393100B2 (en) Automatically generating a trimap segmentation for a digital image by utilizing a trimap generation neural network
WO2022109922A1 (fr) Procédé et appareil d'implémentation de matage d'image, dispositif, et support de stockage
CN112084859B (zh) 一种基于稠密边界块和注意力机制的建筑物分割方法
CN112990222B (zh) 一种基于图像边界知识迁移的引导语义分割方法
CN115512103A (zh) 多尺度融合遥感图像语义分割方法及系统
CN112070040A (zh) 一种用于视频字幕的文本行检测方法
CA3137297A1 (fr) Circonvolutions adaptatrices dans les reseaux neuronaux
CN112883887B (zh) 一种基于高空间分辨率光学遥感图像的建筑物实例自动提取方法
JP2022090633A (ja) 高解像度画像内の物体検出を改善するための方法、コンピュータ・プログラム製品、およびコンピュータ・システム
CN110852199A (zh) 一种基于双帧编码解码模型的前景提取方法
KR20210029692A (ko) 비디오 영상에 보케 효과를 적용하는 방법 및 기록매체
CN113822794A (zh) 一种图像风格转换方法、装置、计算机设备和存储介质
CN112364933A (zh) 图像分类方法、装置、电子设备和存储介质
CN115565043A (zh) 结合多表征特征以及目标预测法进行目标检测的方法
CN117237648B (zh) 基于上下文感知的语义分割模型的训练方法、装置和设备
WO2024041235A1 (fr) Procédé et appareil de traitement d'image, dispositif, support d'enregistrement et produit-programme
Li et al. Inductive guided filter: Real-time deep image matting with weakly annotated masks on mobile devices
CN115187831A (zh) 模型训练及烟雾检测方法、装置、电子设备及存储介质
CN115205624A (zh) 一种跨维度注意力聚合的云雪辩识方法、设备及存储介质
Tran et al. Encoder–decoder network with guided transmission map: Robustness and applicability
Lin et al. Deep asymmetric extraction and aggregation for infrared small target detection
Zhang Detect forgery video by performing transfer learning on deep neural network
Yetiş Auto-conversion from2D drawing to 3D model with deep learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20966208

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20966208

Country of ref document: EP

Kind code of ref document: A1