CN115249306A

CN115249306A - Image segmentation model training method, image processing device and storage medium

Info

Publication number: CN115249306A
Application number: CN202211111509.4A
Authority: CN
Inventors: 曾颖森; 沈招益; 郑天航; 杨思庆
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-09-13
Filing date: 2022-09-13
Publication date: 2022-10-28
Anticipated expiration: 2042-09-13
Also published as: CN115249306B

Abstract

The invention discloses an image segmentation model training method, an image processing device and a storage medium. The embodiment of the invention can more accurately realize the image segmentation of the virtual image. The method can be widely applied to information processing technologies of various scenes needing image anti-blocking processing, such as artificial intelligence, intelligent traffic, auxiliary driving, audio and video and the like.

Description

Image segmentation model training method, image processing device and storage medium

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to an image segmentation model training method, an image processing apparatus, and a storage medium.

Background

Currently, in the field of image segmentation, a semantic segmentation model is often used to perform image segmentation on an input image so as to locate a target object in the input image to generate a segmentation mask corresponding to the target object.

Because the non-virtual image has similar image characteristics, the semantic segmentation model in the related technology can perform more accurate image segmentation on the non-virtual image, so that the non-virtual image in the non-virtual image can be more accurately positioned. However, for the virtual images, because different created paintings, the same virtual image may have a plurality of different types of image features, when the semantic segmentation model in the related art is used to segment the image of the virtual image, problems such as overfitting, poor generalization capability and the like are often generated, so that accurate image segmentation cannot be performed on the virtual image.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

The embodiment of the invention provides an image segmentation model training method, an image processing device and a storage medium, which can more accurately realize image segmentation of an avatar image.

In one aspect, an embodiment of the present invention provides an image segmentation model training method, including the following steps:

acquiring a non-virtual image sample and a non-virtual image segmentation label, a virtual image sample and a virtual image segmentation label, a general image sample and a saliency segmentation label;

training an initial image segmentation model by using the non-avatar image sample to obtain a first image segmentation model, wherein in the process of training the initial image segmentation model by using the non-avatar image sample, parameters of the initial image segmentation model are corrected according to the non-avatar segmentation label;

training the first image segmentation model by using the avatar image sample and the general image sample to obtain a target image segmentation model, wherein when the first image segmentation model is trained by using the avatar image sample, parameters of the first image segmentation model are corrected according to the avatar segmentation label; and when the first image segmentation model is trained by utilizing the general image sample, correcting parameters of the first image segmentation model according to the saliency segmentation labels.

On the other hand, the embodiment of the invention also provides an image processing method, which comprises the following steps:

acquiring an image to be processed;

inputting the image to be processed into a target image segmentation model for image segmentation to obtain a first segmentation image;

carrying out image anti-shielding treatment by using the first segmentation image;

wherein, the target image segmentation model is obtained by training through the image segmentation model training method.

On the other hand, an embodiment of the present invention further provides an image segmentation model training device, including:

the system comprises a sample acquisition unit, a segmentation unit and a segmentation unit, wherein the sample acquisition unit is used for acquiring a non-virtual image sample and a non-virtual image segmentation label, a virtual image sample and a virtual image segmentation label, a general image sample and a saliency segmentation label;

the first training unit is used for training an initial image segmentation model by using the non-avatar image sample to obtain a first image segmentation model, wherein in the process of training the initial image segmentation model by using the non-avatar image sample, parameters of the initial image segmentation model are corrected according to the non-avatar segmentation label;

the second training unit is used for training the first image segmentation model by using the avatar image sample and the general image sample to obtain a target image segmentation model, wherein when the first image segmentation model is trained by using the avatar image sample, parameters of the first image segmentation model are corrected according to the avatar segmentation label; and when the first image segmentation model is trained by utilizing the general image sample, correcting parameters of the first image segmentation model according to the saliency segmentation labels.

Optionally, the non-avatar image samples comprise non-avatar dynamic image samples of different resolutions; the first training unit is further configured to:

acquiring the non-virtual image dynamic image samples with different time sequence lengths under different resolutions;

and training the initial image segmentation model by using the non-virtual image dynamic image samples with each time sequence length under each resolution ratio to obtain a first image segmentation model.

Optionally, the first training unit is further configured to:

performing model iterative training on the initial image segmentation model by using the non-virtual image dynamic image samples with different time sequence lengths under the same resolution ratio to obtain a second image segmentation model;

and training the second image segmentation model by using the non-virtual image dynamic image samples with different time sequence lengths under different resolutions to obtain a first image segmentation model.

Optionally, the different timing lengths at the same resolution include a first timing length and a second timing length; the first training unit is further to:

training the initial image segmentation model by using the non-avatar dynamic image sample with the first time sequence length to obtain a third image segmentation model;

and training the third image segmentation model by using the non-avatar dynamic image sample with the second time sequence length to obtain a second image segmentation model.

Optionally, the image segmentation model training apparatus further includes:

the system comprises a first acquisition unit, a second acquisition unit and a display unit, wherein the first acquisition unit is used for acquiring a virtual environment image material, an avatar image material and a transparency channel map corresponding to the avatar image material;

the first fusion unit is used for carrying out image fusion on the virtual image material and the virtual environment image material to obtain the virtual image sample;

and the second fusion unit is used for carrying out image fusion on the transparency channel map and the virtual environment image material to obtain the virtual image segmentation label.

Optionally, the first fusion unit is further configured to:

performing at least one of geometric transformation, color transformation or random noise addition on the virtual image material to obtain a plurality of target image materials;

and carrying out image fusion on each target image material and the virtual environment image material to obtain a plurality of virtual image samples.

Optionally, the second fusion unit is further configured to:

performing at least one of geometric transformation, color transformation or random noise addition on the transparency channel map to obtain a plurality of target channel maps, wherein the target channel maps correspond to the target image materials one by one;

and carrying out image fusion on each target channel image and the virtual environment image material to obtain a plurality of virtual image segmentation labels corresponding to the virtual image samples.

Optionally, the image segmentation model training apparatus further includes:

the second acquisition unit is used for acquiring a non-virtual image material and a non-virtual image material segmentation label corresponding to the non-virtual image material;

the virtual stylizing unit is used for virtually stylizing the non-virtual image material to obtain a virtual image sample;

a label determination unit for using the non-avatar material segmentation label as the avatar segmentation label.

On the other hand, an embodiment of the present invention further provides an image processing apparatus, including:

the image acquisition unit is used for acquiring an image to be processed;

the image segmentation unit is used for inputting the image to be processed into a target image segmentation model for image segmentation to obtain a first segmentation image;

the image anti-blocking unit is used for carrying out image anti-blocking processing by utilizing the first divided image;

wherein the target image segmentation model is obtained by training through the image segmentation model training device.

Optionally, the image anti-blocking unit is further configured to:

carrying out Gaussian blur on the first segmentation image to obtain a second segmentation image;

carrying out binarization on the second segmentation image according to a preset threshold value to obtain a binarization image;

detecting a connected domain of the binary image to obtain a connected domain in the binary image;

obtaining a mask image according to the connected domain;

and carrying out image anti-shielding treatment on the image to be processed according to the mask image.

Optionally, the image anti-blocking unit is further configured to:

filling holes in the connected domain to obtain a filled image;

and vectorizing the filled image to obtain a mask image.

On the other hand, an embodiment of the present invention further provides an electronic device, including:

at least one processor;

at least one memory for storing at least one program;

at least one of the programs, when executed by at least one of the processors, implements an image segmentation model training method as previously described, or implements an image processing method as previously described.

In another aspect, the present invention further provides a computer-readable storage medium, in which a computer program executable by a processor is stored, and the computer program executable by the processor is used to implement the image segmentation model training method as described above, or implement the image processing method as described above.

In another aspect, the embodiment of the present invention further provides a computer program product, which includes a computer program or a computer instruction, where the computer program or the computer instruction is stored in a computer-readable storage medium, and a processor of an electronic device reads the computer program or the computer instruction from the computer-readable storage medium, and the processor executes the computer program or the computer instruction, so that the electronic device executes the image segmentation model training method as described above or executes the image processing method as described above.

The embodiment of the invention at least comprises the following beneficial effects: after obtaining a non-avatar image sample and a non-avatar segmentation label, an avatar image sample and an avatar segmentation label, a general image sample and a saliency segmentation label, training an initial image segmentation model by using the non-avatar image sample to obtain a first image segmentation model, wherein in the process of training the initial image segmentation model by using the non-avatar image sample, parameters of the initial image segmentation model are corrected according to the non-avatar segmentation label, so that the first image segmentation model can rapidly adapt to data distribution, and the image segmentation capability of the first image segmentation model is initially improved; then, the first image segmentation model is trained by utilizing an avatar image sample and a general image sample to obtain a target image segmentation model, wherein when the avatar image sample is used for training the first image segmentation model, parameters of the first image segmentation model are corrected according to an avatar segmentation label, and when the general image sample is used for training the first image segmentation model, parameters of the first image segmentation model are corrected according to a saliency segmentation label, so that a segmentation object can be expanded into an avatar from a non-avatar by the target image segmentation model, and the image segmentation accuracy of the avatar image can be improved; in addition, the first image segmentation model is trained by utilizing the general image sample and the saliency segmentation label, so that the target image segmentation model can have the capability of saliency detection, and the saliency detection is not sensitive to various virtual images, so that the image segmentation of various virtual image images can be realized without using a large number of virtual image samples for model training, the image segmentation of the virtual image images can be realized more accurately, and the training efficiency of the target image segmentation model can be improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.

FIG. 1 is a schematic diagram of an implementation environment provided by embodiments of the invention;

FIG. 2 is a schematic diagram of another implementation environment provided by embodiments of the invention;

FIG. 3 is a schematic diagram of another example environment provided by embodiments of the invention;

FIG. 4 is a flowchart of an image segmentation model training method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a model structure of an image segmentation model according to an embodiment of the present invention;

FIG. 6 is a schematic flow chart of a foreground and background synthesis method provided in an embodiment of the present invention;

FIG. 7 is a schematic flow chart illustrating a style migration method according to an embodiment of the present invention;

FIG. 8 is a flow chart of an image processing method according to an embodiment of the present invention;

FIG. 9 is a flowchart illustrating an overall method for training an image segmentation model and a mask image acquisition method according to an embodiment of the present invention;

FIG. 10 is a flow chart of model training of an image segmentation model provided by a specific example of the present invention;

FIG. 11 is a flow chart of data post-processing of model output results provided by a specific example of the present invention;

FIG. 12 (a) is a schematic diagram of a mask image provided by an embodiment of the present invention;

FIG. 12 (b) is a schematic illustration of another mask image provided by an embodiment of the present invention;

FIG. 13 (a) is a schematic diagram of an image after being subjected to an image anti-blocking process according to an embodiment of the present invention;

FIG. 13 (b) is a schematic diagram of another image after being subjected to image anti-blocking processing according to the embodiment of the present invention;

FIG. 14 (a) is a schematic diagram of a mask image obtained by the image processing method according to the embodiment of the present invention;

FIG. 14 (b) is a diagram illustrating a mask image obtained by a semantic segmentation-based method according to the related art;

FIG. 14 (c) is a schematic view of another mask image obtained by the image processing method according to the embodiment of the present invention;

FIG. 14 (d) is a diagram showing another mask image obtained by a semantic segmentation-based method according to the related art;

FIG. 15 is a diagram of an image segmentation model training apparatus according to an embodiment of the present invention;

fig. 16 is a schematic diagram of an image processing apparatus according to an embodiment of the present invention;

fig. 17 is a schematic diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the drawings and the specific examples. The described embodiments should not be considered as limiting the invention, and all other embodiments obtained by a person skilled in the art without making any inventive step are within the scope of protection of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

1) An avatar (Virtual Character) generally refers to a synthesized avatar. In terms of the structure of the avatar, the avatar may be the image of a three-dimensional model, or may be the image of a planar image. The virtual image can be an image formed by simulating a human image, an image formed by simulating an animal image, or an image formed based on an image in a cartoon or cartoon. The avatar is generally an avatar, which may be a person, animal, etc.

2) A non-avatar, is an avatar as opposed to an avatar. Non-avatar refers generally to an avatar of a real character in a real environment, which may be, for example, a person, an animal, or the like.

3) The general image refers to a general image including a target object. The target object may be a non-virtual image such as a person, an animal or a plant, or may be an virtual image corresponding to the non-virtual image such as the person, the animal or the plant.

4) Semantic segmentation technology, a deep learning algorithm that associates a label or class with each pixel in an image. Semantic segmentation techniques may be used to identify a set of pixels that constitute a distinguishable category.

5) The saliency detection refers to a technology for simulating human vision through an intelligent algorithm and extracting a salient region (namely, an interested region) in an image.

6) Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like. With the research and development of artificial intelligence technology, the artificial intelligence technology is developed and researched in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical services, smart customer service and the like.

7) Computer Vision technology (CV), which is a science for researching how to make a machine look, further refers to replacing human eyes with a camera and a Computer to perform machine Vision such as target identification and measurement, and further performing image processing, so that the Computer processing becomes an image more suitable for human eyes to observe or to transmit to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

8) Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formula learning.

9) The block chain (Blockchain) is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. The blockchain is essentially a decentralized database, which is a string of data blocks associated by using cryptography, each data block contains information of a batch of network transactions, and the information is used for verifying the validity (anti-counterfeiting) of the information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

10 An Intelligent Transportation System (ITS), also known as Intelligent Transportation System, is a comprehensive Transportation System that effectively and comprehensively applies advanced scientific technologies (information technology, computer technology, data communication technology, sensor technology, electronic control technology, automatic control theory, operational research, artificial intelligence, etc.) to Transportation, service control and vehicle manufacturing, and strengthens the connection among vehicles, roads and users, thereby forming a comprehensive Transportation System that ensures safety, improves efficiency, improves environment and saves energy.

The image segmentation technology can be used for realizing an image anti-blocking function for an input image, for example, can be used for realizing a bullet screen anti-blocking function for the input image, so as to improve the viewing experience of a user, wherein the input image can include a static image and a dynamic image, the static image can be a picture or a static three-dimensional image built through a three-dimensional technology, and the like, and the dynamic image can be a moving picture or a video, and the like. In the related art, a semantic segmentation model is often used to perform image segmentation on an input image, so as to locate a target object in the input image to generate a segmentation mask corresponding to the target object, and then the segmentation mask can be used to implement an image anti-blocking function on the target object. In order to enable the semantic segmentation model to accurately perform image segmentation on an input image, a large number of image samples are required to train the semantic segmentation model, however, in the related art, since more attention is paid to semantic segmentation on a non-avatar (e.g., a real-person character), a sample set provided in the related art is basically an image sample mainly based on the non-avatar, for example, a COCO (Common Objects in Context) data set or a PASCAL-VOC data set provided in the related art is an image sample data set mainly based on the real-person character, and since the non-avatar has similar image features, the semantic segmentation model in the related art often can perform more accurate image segmentation on the non-avatar image, so that the non-avatar in the non-avatar image can be more accurately positioned. However, for the avatar, because different authors have different creation styles, the same avatar may have a plurality of different types of image features, and therefore, if the semantic segmentation model in the related art is used to segment the image of the avatar, problems such as overfitting, poor generalization capability, etc. are often easily generated, so that accurate image segmentation cannot be performed on the avatar image, and further, the image anti-blocking function on the avatar image cannot be effectively implemented.

In order to improve the accuracy of image segmentation of an avatar image, the embodiment of the invention provides an image segmentation model training method, an image processing method, an image segmentation model training device, an image processing device, an electronic device, a computer-readable storage medium and a computer program product, after obtaining a non-avatar image sample and a non-avatar segmentation label, an avatar image sample and an avatar segmentation label, a general image sample and a saliency segmentation label, training an initial image segmentation model by using the non-avatar image sample to obtain a first image segmentation model, wherein in the process of training the initial image segmentation model by using the non-avatar image sample, parameters of the initial image segmentation model are corrected according to the non-avatar segmentation label, so that the first image segmentation model can quickly adapt to data distribution, and the image segmentation capability of the first image segmentation model is initially improved; then, the first image segmentation model is trained by utilizing an avatar image sample and a general image sample to obtain a target image segmentation model, wherein when the avatar image sample is used for training the first image segmentation model, parameters of the first image segmentation model are corrected according to the avatar segmentation label, and when the general image sample is used for training the first image segmentation model, the parameters of the first image segmentation model are corrected according to the significance segmentation label, so that a segmentation object can be expanded from a non-avatar to an avatar by the target image segmentation model, and the image segmentation accuracy of the avatar image can be improved; in addition, the first image segmentation model is trained by utilizing the general image sample and the saliency segmentation label, so that the target image segmentation model can have the capability of saliency detection, and the saliency detection is not sensitive to various virtual images, so that the image segmentation of various virtual image images can be realized without using a large number of virtual image samples for model training, the image segmentation of the virtual image images can be realized more accurately, and the training efficiency of the target image segmentation model can be improved.

The scheme provided by the embodiment of the invention relates to the technologies such as machine learning of artificial intelligence, and the like, and the corresponding description is specifically carried out through the following embodiments.

FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the invention. Referring to fig. 1, the implementation environment includes a terminal 101 and a server 102. The terminal 101 and the server 102 are connected directly or indirectly through wired or wireless communication, where the terminal 101 and the server 102 may be nodes in a block chain, and this embodiment is not particularly limited thereto.

The terminal 101 may include, but is not limited to, a smart phone, a computer, a smart voice interaction device, a smart appliance, a vehicle-mounted terminal, an aircraft, and other smart devices having a display screen. Alternatively, the terminal 101 may be installed with a player application for playing a still image or a moving image, and when the user views the still image or the moving image through the player application, the user may issue content such as a text bullet screen and an image bullet screen superimposed on the still image or the moving image, may also view content such as a text bullet screen and an image bullet screen superimposed on the still image or the moving image issued by another user, and may also view content such as a system message superimposed on the still image or the moving image issued by the server 102.

The terminal 101 has at least functions of initiating a request and displaying an image, for example, the terminal 101 can send an acquisition request for a mask image to the server 102 in response to an operation of opening an image anti-blocking function by a user, perform image anti-blocking processing according to the mask image after receiving the mask image fed back by the server 102 according to the acquisition request, and display the image subjected to the image anti-blocking processing. For another example, the terminal 101 can request the server 102 for the target video with the mask image in response to an operation of playing the video with the image anti-blocking function, and after receiving the target video with the mask image fed back by the server 102, display the target video with the mask image for implementing the image anti-blocking function. In addition, the terminal 101 may also download the trained image segmentation model from the server 102, and in response to the operation of playing the video, the terminal 101 may input the video frame image to be played to the image segmentation model for image segmentation to obtain a corresponding mask image, then perform image anti-occlusion processing on the video frame image to be played by using the mask image, and then display the video frame image subjected to the image anti-occlusion processing. It should be noted that the mask image is an image for performing global or local occlusion on the image to be processed, and the mask image may cause content superimposed on the image to be processed not to be displayed in the global range or the local position of the image to be processed, and in addition, the mask image does not affect the normal display of the image to be processed.

The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform.

The server 102 has at least functions of training an image segmentation model, performing image segmentation on an image to be processed by using the trained image segmentation model, and the like, and for example, can train an initial image segmentation model by using a non-avatar image sample to obtain a first image segmentation model after obtaining the non-avatar image sample and a non-avatar segmentation label, the avatar image sample and the avatar segmentation label, and a common image sample and a saliency segmentation label, wherein parameters of the initial image segmentation model are corrected according to the non-avatar segmentation label during training the initial image segmentation model by using the non-avatar image sample, and then the first image segmentation model is trained by using the avatar image sample and the common image sample to obtain a target image segmentation model, wherein parameters of the first image segmentation model are corrected according to the avatar segmentation label during training the first image segmentation model by using the avatar image sample, and parameters of the first image segmentation model are corrected according to the saliency label during training the first image segmentation model by using the common image sample. For another example, the image to be processed can be segmented according to the acquisition request for the mask image from the terminal 101 and the trained image segmentation model to obtain the mask image, and then the mask image is sent to the terminal 101, so that the terminal 101 performs image anti-blocking processing on the image to be processed according to the mask image and displays the image subjected to the image anti-blocking processing; or, the image segmentation can be performed on the video frame image in the target video according to the acquisition request for the target video with the mask image from the terminal 101 and the trained image segmentation model to obtain the mask image, then the mask image and the video frame image are fused to obtain the target video with the mask image, and then the target video with the mask image is sent to the terminal 101, so that the terminal 101 can directly display the target video with the mask image.

Referring to fig. 1, in an application scenario, it is assumed that the terminal 101 is a smart phone, and the terminal 101 is installed with a player application (e.g., player software or social media playing software) for playing a still image or a moving image. In response to the fact that the user triggers an image anti-blocking function in the process of watching a video, the terminal 101 sends an acquisition request for a mask image to the server 102; in response to receiving the acquisition request, the server 102 inputs a video frame image in a video currently played by the terminal 101 to the trained image segmentation model for image segmentation to obtain a mask image corresponding to a target object in the video frame image, and then sends the mask image to the terminal 101; in response to receiving the mask image, the terminal 101 performs image anti-blocking processing on the video frame image to be played according to the mask image, and displays the video frame image subjected to the image anti-blocking processing. In the process of training the image segmentation model, after obtaining the non-avatar image sample and the non-avatar segmentation label, the avatar image sample and the avatar segmentation label, and the generic image sample and the saliency segmentation label, the server 102 trains the initial image segmentation model by using the non-avatar image sample to obtain a first image segmentation model, wherein in the process of training the initial image segmentation model by using the non-avatar image sample, parameters of the initial image segmentation model are corrected according to the non-avatar segmentation label, and then the first image segmentation model is trained by using the avatar image sample and the generic image sample to obtain a target image segmentation model, wherein in the process of training the first image segmentation model by using the avatar image sample, parameters of the first image segmentation model are corrected according to the avatar segmentation label, and in the process of training the first image segmentation model by using the generic image sample, parameters of the first image segmentation model are corrected according to the saliency segmentation label.

Referring to fig. 2, in another application scenario, it is assumed that the terminal 101 is a smartphone, and the terminal 101 is installed with a player application (e.g., player software or social media playing software) for playing a still image or a moving image. In response to the user watching the video with the image anti-blocking function, the terminal 101 requests the server 102 for a target video subjected to image anti-blocking processing; in response to receiving request information for a target video subjected to image anti-blocking processing, the server 102 inputs a video frame image to be played to a trained image segmentation model for image segmentation to obtain a mask image corresponding to a target object in the video frame image, then performs image anti-blocking processing on the video frame image to be played according to the mask image to obtain the target video subjected to image anti-blocking processing, and then sends the target video subjected to image anti-blocking processing to the terminal 101; in response to receiving the target video subjected to the image anti-blocking processing, the terminal 101 displays the target video subjected to the image anti-blocking processing. In the process of training the image segmentation model, after obtaining the non-avatar image sample and the non-avatar segmentation label, the avatar image sample and the avatar segmentation label, and the generic image sample and the saliency segmentation label, the server 102 trains the initial image segmentation model by using the non-avatar image sample to obtain a first image segmentation model, wherein in the process of training the initial image segmentation model by using the non-avatar image sample, parameters of the initial image segmentation model are corrected according to the non-avatar segmentation label, and then the first image segmentation model is trained by using the avatar image sample and the generic image sample to obtain a target image segmentation model, wherein in the process of training the first image segmentation model by using the avatar image sample, parameters of the first image segmentation model are corrected according to the avatar segmentation label, and in the process of training the first image segmentation model by using the generic image sample, parameters of the first image segmentation model are corrected according to the saliency segmentation label.

Referring to fig. 3, in an application scenario, it is assumed that the terminal 101 is a vehicle-mounted terminal, and the terminal 101 is installed with a player application (for example, player software or social media playing software) for playing a still image or a moving image, and in addition, the terminal 101 downloads a trained image segmentation model from the server 102 in advance. In response to the fact that the user triggers an image anti-blocking function in the process of watching a video, the terminal 101 inputs a video frame image to be played into the trained image segmentation model for image segmentation to obtain a mask image corresponding to a target object in the video frame image, then performs image anti-blocking processing on the video frame image to be played according to the mask image, and then displays the video frame image subjected to the image anti-blocking processing. The method comprises the steps that in the process of training an image segmentation model, after a non-avatar image sample and a non-avatar segmentation label, an avatar image sample and an avatar segmentation label, and a general image sample and a saliency segmentation label are obtained, an initial image segmentation model is trained by the non-avatar image sample to obtain a first image segmentation model, wherein in the process of training the initial image segmentation model by the non-avatar image sample, parameters of the initial image segmentation model are corrected according to the non-avatar segmentation label, then the first image segmentation model is trained by the avatar image sample and the general image sample to obtain a target image segmentation model, wherein in the process of training the first image segmentation model by the avatar image sample, parameters of the first image segmentation model are corrected according to the avatar segmentation label, and in the process of training the first image segmentation model by the general image sample, parameters of the first image segmentation model are corrected according to the saliency segmentation label.

In each embodiment of the present invention, when data related to characteristics of an object such as attribute information or attribute information sets of the object (for example, a user) is subjected to a correlation process, permission or approval of the corresponding object is obtained, and the collection, use, and processing of the data comply with relevant laws and regulations and standards of relevant countries and regions. In addition, when the embodiment of the present invention needs to acquire the attribute information of the object, the individual permission or the individual consent of the corresponding object may be acquired in a manner of popping up a window or jumping to a confirmation page, and after the individual permission or the individual consent of the corresponding object is explicitly acquired, the related data of the necessary object for enabling the embodiment of the present invention to normally operate may be acquired.

The embodiment of the invention can be applied to various scenes needing image anti-blocking processing on images, including but not limited to image anti-blocking scenes in the fields of videos, live broadcasts and the like.

Fig. 4 is a flowchart of an image segmentation model training method provided in an embodiment of the present invention, where the image segmentation model training method may be executed by a terminal or a server, or may be executed by both the terminal and the server. Referring to fig. 4, the image segmentation model training method includes, but is not limited to, steps 110 to 130.

Step 110: obtaining a non-virtual image sample and a non-virtual image segmentation label, an virtual image sample and a virtual image segmentation label, a general image sample and a saliency segmentation label.

In the step, the non-avatar segmentation label is label information corresponding to the non-avatar image sample, and when the non-avatar image sample is adopted to train the image segmentation model, the non-avatar segmentation label can be used as label information for correcting model parameters in the model training process; the virtual image segmentation label is label information corresponding to the virtual image sample, and when the virtual image sample is adopted to train the image segmentation model, the virtual image segmentation label can be used as label information for correcting model parameters in the model training process; the saliency segmentation labels are label information corresponding to the general image samples, and when the general image samples are adopted to train the image segmentation model, the saliency segmentation labels can be used as label information for correcting model parameters in the model training process. It should be noted that the non-avatar image sample, the avatar image sample, and the general image sample may all include a static image and a dynamic image, where the static image may be a picture or a static three-dimensional image built by a three-dimensional technology, and the dynamic image may be a moving picture or a video.

In one possible embodiment, the non-avatar image sample and the non-avatar segmentation label may be obtained from a data set provided by the related art, for example, different data sets such as a COCO data set or a PASCAL-VOC data set provided by the related art, which is not limited herein. The COCO data set is a data set commonly used in the art and can be used for image recognition, and includes a training set, a verification set and a test set. The PASCAL-VOC data set is a data set of a PASCAL-VOC challenge race and can be applied to the aspects of target classification, target detection, target segmentation, human body layout, action recognition and the like in image recognition. In addition, the non-avatar image sample may be obtained by acquiring a public image on a network, in which case, the non-avatar segmentation label may be obtained by manually labeling the non-avatar image sample, or may be obtained by performing image segmentation on the non-avatar image sample by using a semantic segmentation model, which is not specifically limited herein. The semantic segmentation model may be composed of a deep neural network or a deep convolutional neural network, and is not limited in this respect.

In a possible embodiment, the avatar image sample may be obtained by acquiring a public image on a network, in which case, the avatar segmentation label may be obtained by manually labeling the avatar image sample, or may be obtained by performing image segmentation on the avatar image sample by using a semantic segmentation model, which is not specifically limited herein. In addition, the avatar image sample and the avatar segmentation label may also be obtained by a foreground-background synthesis method or a style migration method, which is not specifically limited herein. It should be noted that the specific contents of the foreground-background synthesis method and the grid migration method will be given in detail in the following.

In a possible embodiment, the generic image sample may be obtained from a different data set, such as a COCO data set or a PASCAL-VOC data set, or may be obtained by acquiring a public image on a network, and is not limited in this respect. The saliency segmentation labels can be obtained by performing saliency detection on a general image sample by using a conventional saliency detection model, wherein the saliency detection model can be a deep learning target detection model based on region suggestion or a deep learning target detection model based on regression, and the like, and is not limited in detail here. The deep learning target detection model based on the Region suggestion may include a Region Convolutional Neural Network (R-CNN) model, a Spatial Pyramid Pooling (SPP-Net) model, or a Region Fully Convolutional Neural Network (R-FCN) model, which is not specifically limited herein. The regression-based deep learning target detection model may include a YOLO (young Only Look one) model, a Single shot multi-box target detection (SSD) model, or a Non Maximum Suppression (NMS) model, and the like, and is not limited herein. It should be noted that the R-CNN model, SPP-Net model, R-FCN model, YOLO model, SSD model, NMS model, etc. are all common models in the art, and the specific model structures of these models may refer to the related descriptions in the related art, and are not described herein again.

Step 120: and training the initial image segmentation model by using the non-virtual image sample to obtain a first image segmentation model, wherein in the process of training the initial image segmentation model by using the non-virtual image sample, parameters of the initial image segmentation model are corrected according to the non-virtual image segmentation label.

In this step, since the non-avatar image sample and the non-avatar segmentation label are obtained in step 110, the non-avatar image sample may be used to train the initial image segmentation model to obtain the first image segmentation model, so that the first image segmentation model may adapt to data distribution quickly, and the image segmentation capability of the first image segmentation model is initially improved. The non-avatar segmentation label can be used as label information for correcting model parameters in a model training process, so that in the process of training an initial image segmentation model by using a non-avatar image sample, when the initial image segmentation model outputs image segmentation information corresponding to the non-avatar image sample, a loss value can be calculated according to the non-avatar segmentation label and the image segmentation information output by the initial image segmentation model, and then parameters of the initial image segmentation model are corrected according to the calculated loss value until the loss value is smaller than a preset loss threshold value. The preset loss threshold may be appropriately selected according to the actual application, and is not particularly limited herein.

In a possible embodiment, assuming that the non-avatar image sample is a non-avatar video sample, when the initial image segmentation model is trained by using the non-avatar image sample, a first number (e.g., 15 frames, etc.) of video sample frames may be continuously taken each time to train the initial image segmentation model, where the video sample frames may be low-resolution video frames, so that the initial image segmentation model can rapidly segment the video sample frames in the training process, the initial image segmentation model can rapidly adapt to data distribution, and a segmentation task for the non-avatar image sample is initially implemented, thereby initially improving the image segmentation capability of the first image segmentation model.

Step 130: training the first image segmentation model by using the virtual image sample and the general image sample to obtain a target image segmentation model, wherein when the first image segmentation model is trained by using the virtual image sample, parameters of the first image segmentation model are corrected according to the virtual image segmentation label; and when the first image segmentation model is trained by using the general image sample, correcting the parameters of the first image segmentation model according to the saliency segmentation labels.

In this step, since the avatar image sample and the avatar segmentation label, the generic image sample and the saliency segmentation label are obtained in step 110, and the first image segmentation model is obtained in step 120, the first image segmentation model can be trained using the avatar image sample and the generic image sample, and the target image segmentation model can be obtained, so that the target image segmentation model can expand the segmentation object from a non-avatar to an avatar, thereby improving the image segmentation accuracy of the avatar image, and since the first image segmentation model is trained using the generic image sample, the target image segmentation model can have the capability of saliency detection, since the saliency detection can distinguish the most visually distinct region in the picture and segment the region according to the edge of the region, that is, the saliency detection can distinguish the image foreground (i.e., the avatar) and the image background by analyzing the structure of the picture, thereby realizing insensitivity of the saliency detection to various avatars, and realizing the training of various avatar image samples without using a large number of avatar image samples, and further realizing the efficiency of the avatar segmentation to training of the target image.

It should be noted that the first image segmentation model is trained by using the avatar image sample and the general image sample, the first image segmentation model may be trained by using the avatar image sample and the general image sample respectively, or the first image segmentation model may be trained by using the avatar image sample and the general image sample as an integrated sample set, which is not limited herein. When the avatar image sample and the general image sample are respectively used for training the first image segmentation model, the avatar image sample can be used for training the first image segmentation model, after the training is finished, the general image sample is used for training the trained first image segmentation model, and after the training is finished, the target image segmentation model can be obtained. Or, the first image segmentation model may be trained by using the general image sample, and after the training is completed, the trained first image segmentation model is trained by using the avatar image sample, and after the training is completed, the target image segmentation model may be obtained. The avatar segmentation label and the saliency segmentation label can be used as label information for correcting model parameters in a model training process, so that in the process of training a first image segmentation model by using an avatar image sample, when the first image segmentation model outputs image segmentation information corresponding to the avatar image sample, a loss value can be calculated according to the avatar segmentation label and the image segmentation information output by the first image segmentation model, and then the parameters of the first image segmentation model are corrected according to the calculated loss value until the loss value is smaller than a preset loss threshold value; in the process of training the first image segmentation model by using the common image sample, when the first image segmentation model outputs image segmentation information corresponding to the common image sample, a loss value may be calculated according to the saliency segmentation label and the image segmentation information output by the first image segmentation model, and then a parameter of the first image segmentation model is corrected according to the calculated loss value until the loss value is smaller than a preset loss threshold.

In a possible embodiment, assuming that the initial image segmentation model has been previously trained by using a non-avatar video sample, after the training of the initial image segmentation model is completed to obtain the first image segmentation model, the first image segmentation model may be trained by using an avatar image sample and a general image sample in the form of picture data, and both the avatar image sample and the general image sample may be picture data with low resolution, so that the first image segmentation model can rapidly perform image segmentation on the picture data in the training process, so that the obtained target image segmentation model can expand a segmentation object from the non-avatar to the avatar, thereby improving the image segmentation accuracy of the avatar image.

In a possible embodiment, the image segmentation models (i.e., the initial image segmentation model, the first image segmentation model, and the target image segmentation model) may include an input module, a plurality of encoding modules, a plurality of Gated Round Unit (GRU) modules, a plurality of upsampling modules, a plurality of decoding modules, and an output module, etc., wherein the GRU module is one of Recurrent neural networks capable of solving the problems of long-term memory and gradient in back propagation, etc. For example, the model structure of the image segmentation model may be as shown in fig. 5, the image segmentation model includes 4 coding modules, 4 GRU modules, 4 upsampling modules, 3 decoding modules and an output module, wherein the output of the coding module includes a first output and a second output, the first output is connected to the corresponding decoding module, the second output is connected to the next coding module, and the output of the next coding module, after being processed by the corresponding GRU module and the upsampling module, is connected to the first output of the previous coding module in a feature series manner and then connected to the decoding module corresponding to the previous coding module, that is, the overall structure of the image segmentation model is in a deep and shallow structure, and the output of the deep coding module is connected to the decoding module corresponding to the shallow coding module in a feature series manner first. It should be noted that there is only one output of the deepest coding module, and after the output is processed by the corresponding GRU module and the upsampling module, the output can be connected in series with the output of the previous coding module in a characteristic manner, and does not need to be processed by the decoding module. In the image segmentation model, the encoding module, the decoding module and the output module can be multilayer convolutional neural networks, wherein the multilayer convolutional neural networks can comprise a convolutional layer, a batch normalization layer and a ReLU activation layer which are connected in sequence, and the multilayer convolutional neural networks can extract the features of the picture. In addition, the GRU module can keep the motion information of the video frame, so that the output result of the network can be smoother, and the expression capability of the characteristics can be improved. In an embodiment, the input data input to the image segmentation model may be continuous video frames, and after the input data is processed by each module in the image segmentation model, the input data may output a single-channel segmentation map, where each pixel in the output segmentation map has a value between 0 and 1, and the value size of the pixel represents a confidence that the pixel is a foreground of the image of interest.

It should be noted that, in a possible embodiment, the encoding module and the decoding module in the image segmentation model may also be implemented by using a Residual Network (ResNet), a VGG Network (Visual Geometry Group Network), a MobileNet, or another type of deep convolutional neural Network, which is not limited specifically here. The ResNet, the VGG network, the MobileNet, and the like are all neural networks commonly used in the art, and specific structures and related descriptions of these neural networks may refer to related descriptions in related technologies, and are not described herein again.

In addition, in a possible implementation manner, the image segmentation model in this embodiment may also be implemented by using other models that can be used for the saliency detection task, such as U2-Net. The U2-Net is a network model with a two-layer nested U-shaped structure, in the U2-Net, by adopting ReSidual U-block (RSU) to fuse the characteristics of receiving fields with different sizes, more context information with different scales can be captured, and in addition, in the RSU blocks, the depth of the whole network architecture can be increased without obviously increasing the calculation cost by using pooling operation.

In this embodiment, by the training method of image segmentation model including the foregoing steps 110 to 130, after obtaining the non-avatar image sample and the non-avatar segmentation label, the avatar image sample and the avatar segmentation label, the general image sample and the saliency segmentation label, the initial image segmentation model is trained by using the non-avatar image sample to obtain the first image segmentation model, wherein in the process of training the initial image segmentation model by using the non-avatar image sample, parameters of the initial image segmentation model are modified according to the non-avatar segmentation label, so that the first image segmentation model can adapt to data distribution quickly, and the image segmentation capability of the first image segmentation model is initially improved; then, the first image segmentation model is trained by utilizing an avatar image sample and a general image sample to obtain a target image segmentation model, wherein when the avatar image sample is used for training the first image segmentation model, parameters of the first image segmentation model are corrected according to an avatar segmentation label, and when the general image sample is used for training the first image segmentation model, parameters of the first image segmentation model are corrected according to a saliency segmentation label, so that a segmentation object can be expanded into an avatar from a non-avatar by the target image segmentation model, and the image segmentation accuracy of the avatar image can be improved; in addition, since the first image segmentation model is trained by using the general image sample and the saliency segmentation label, the target image segmentation model can have the capability of saliency detection, and since the saliency detection can distinguish the most visually obvious region in a picture and segment the region according to the edge of the region, that is, the saliency detection can distinguish an image foreground (i.e., an avatar) and an image background by analyzing the structure of the picture, the saliency detection is insensitive to various avatars, so that the image segmentation of various avatar images can be realized without using a large number of avatar image samples for model training, the image segmentation of the avatar images can be realized more accurately, and the training efficiency of the target image segmentation model can be improved.

In a possible embodiment, the non-avatar image samples may include non-avatar dynamic image samples with different resolutions, in which case, when the initial image segmentation model is trained by using the non-avatar image samples to obtain the first image segmentation model, the non-avatar dynamic image samples with different time sequence lengths at different resolutions may be obtained first, and then the non-avatar dynamic image samples with each time sequence length at each resolution may be used to train the initial image segmentation model to obtain the first image segmentation model. For example, assuming that the non-avatar image samples include high-resolution non-avatar animation samples and low-resolution non-avatar animation samples, when acquiring non-avatar animation samples with different time sequence lengths at different resolutions, multiple non-avatar animation samples such as 15 consecutive low-resolution non-avatar animation samples, 6 consecutive high-resolution non-avatar animation samples, and 40 consecutive low-resolution non-avatar animation samples may be randomly acquired, and then the initial image segmentation model is trained by using the non-avatar animation samples, so that the trained first image segmentation model can adapt to images with different resolutions and long-short time sequence information, thereby better achieving the segmentation effect on images with various sizes and time sequence lengths (e.g., videos with various sizes and various time lengths).

In a possible embodiment, when the initial image segmentation model is trained by using the non-avatar dynamic image samples of each time sequence length at each resolution to obtain the first image segmentation model, model iteration training may be performed on the initial image segmentation model by using the non-avatar dynamic image samples of different time sequence lengths at the same resolution to obtain the second image segmentation model, and then training is performed on the second image segmentation model by using the non-avatar dynamic image samples of different time sequence lengths at different resolutions to obtain the first image segmentation model. The model iterative training of the initial image segmentation model by using the non-avatar dynamic image samples with different time sequence lengths under the same resolution ratio means that the non-avatar dynamic image samples with a certain time sequence length are used for training the initial image segmentation model, model parameters of the initial image segmentation model are corrected until the trained initial image segmentation model is suitable for the non-avatar dynamic image samples with the time sequence length, and then the non-avatar dynamic image samples with another time sequence length under the same resolution ratio are used for further training the trained initial image segmentation model. The method for training the second image segmentation model by using the non-avatar dynamic image samples with different time sequence lengths under different resolutions is characterized in that the non-avatar dynamic image samples with different time sequence lengths under different resolutions are used as an integral sample set, and then the integral sample set is used for training the second image segmentation model. Model iteration training is carried out on the initial image segmentation model by using the non-virtual image dynamic image samples with different time sequence lengths under the same resolution, so that the second image segmentation model obtained after training can adapt to different time sequence information, and the segmentation effect on the images with different time sequence lengths can be improved. In addition, the second image segmentation model is trained by using the non-avatar dynamic image samples with different time sequence lengths under different resolutions, so that the first image segmentation model obtained after training can adapt to images with different resolutions and long and short time sequence information, and the segmentation effect on images with various sizes and various time sequence lengths (such as videos with various sizes and various time lengths) can be better realized.

In a possible embodiment, the different timing lengths at the same resolution may include a first timing length and a second timing length, in this case, when model iterative training is performed on the initial image segmentation model by using the non-avatar dynamic image samples at different timing lengths at the same resolution to obtain the second image segmentation model, the initial image segmentation model may be trained by using the non-avatar dynamic image samples at the first timing length to obtain a third image segmentation model, and then the third image segmentation model may be trained by using the non-avatar dynamic image samples at the second timing length to obtain the second image segmentation model. The first timing length may be greater than the second timing length, or may be smaller than the second timing length, which is not limited herein. For example, in an embodiment, assuming that the non-avatar dynamic image sample with the first timing length is a continuous 15 frames of video frames with low resolution, and the non-avatar dynamic image sample with the second timing length is a continuous 50 frames of video frames with low resolution, the initial image segmentation model may be trained by using the continuous 15 frames of video frames with low resolution, so that the trained third image segmentation model can be quickly adapted to data distribution, the image segmentation capability of the third image segmentation model is initially improved, and then the third image segmentation model is trained by using the continuous 50 frames of video frames with low resolution, so that the trained second image segmentation model can adapt to long-time-series information, and a better image segmentation effect can be achieved by using the previous video frames more fully.

In a possible embodiment, in order to reduce the labor cost for acquiring the avatar image sample through the public image on the network and the labor cost for obtaining the avatar segmentation label through manual labeling of the avatar image sample, the avatar image sample and the avatar segmentation label may be obtained by a foreground-background synthesis method or a style migration method.

As shown in fig. 6, fig. 6 is a schematic flow chart of a foreground-background synthesis method provided in an embodiment. When the virtual image sample and the virtual image segmentation label are obtained through the foreground-background synthesis method, the virtual environment image material, the virtual image material and the transparency channel image corresponding to the virtual image material can be obtained first, then the virtual image material and the virtual environment image material are subjected to image fusion to obtain the virtual image sample, and then the transparency channel image and the virtual environment image material are subjected to image fusion to obtain the virtual image segmentation label. The virtual environment image material, the virtual image material and the transparency channel map can be obtained by acquiring public images on a network. It should be noted that, after obtaining the virtual environment image material, the virtual image material and the transparency channel map, the virtual image material and the virtual environment image material may be subjected to image fusion to obtain the virtual image sample, and the transparency channel map and the virtual environment image material may be subjected to image fusion to obtain the virtual image segmentation label, that is, when obtaining the virtual environment image material, the virtual image material and the transparency channel map through the public image on the network, a large amount of material information is not required to be obtained, and only a small amount of material information may be collected, that is, the number of the virtual image sample and the virtual image segmentation label may be expanded in an image fusion manner, so that the labor cost for collecting the virtual image sample and the labor cost for manually labeling the virtual image sample to obtain the virtual image segmentation label may be reduced. It should be noted that when the virtual image material is fused with the virtual environment image material, the virtual image material can be overlaid and displayed on the virtual environment image material, so as to obtain a virtual image sample; when the transparency channel map is fused with the virtual environment image material, the transparency channel map can be overlaid and displayed on the virtual environment image material, so that the virtual image segmentation label is obtained. The transparency channel map is a special map layer for recording image transparency information, and can be used for protecting a selected area, storing the selected area as a gray image, and modifying the selected area by editing the transparency channel map. The transparency channel map can include 3 colors of white, black and gray, wherein white represents a region which can be selected and belongs to an opaque solid color region; black represents an unselectable region belonging to a region not containing pixel information; gray represents a region that can be partially selected, i.e., a common feathered region.

In a possible implementation manner, when the avatar image material and the virtual environment image material are subjected to image fusion to obtain the avatar image sample, at least one of geometric transformation, color transformation or random noise addition may be performed on the avatar image material to obtain a plurality of target image materials, and then each target image material and the virtual environment image material are subjected to image fusion to obtain a plurality of avatar image samples. In addition, when the transparency channel map and the virtual environment image material are subjected to image fusion to obtain the virtual image segmentation labels, at least one of geometric transformation, color transformation or random noise addition can be firstly carried out on the transparency channel map to obtain a plurality of target channel maps, wherein the target channel maps correspond to the target image material one by one, and then, each target channel map and the virtual environment image material are subjected to image fusion to obtain a plurality of virtual image segmentation labels corresponding to the virtual image sample. When the number of the virtual environment image materials is one, multiple virtual image samples obtained after image fusion have the same environment background and different virtual image foregrounds, and multiple virtual image segmentation labels obtained after image fusion have the same environment background and different foreground segmentation labels; when the number of the virtual environment image materials is multiple, multiple virtual image samples obtained after image fusion can have different environment backgrounds and different virtual image foregrounds, and multiple virtual image segmentation labels obtained after image fusion can have different environment backgrounds and different foreground segmentation labels. That is, a large amount of target image materials can be obtained by performing at least one of geometric transformation, color transformation or random noise addition data enhancement processing on a small amount of avatar image materials, and then the target image materials are subjected to image fusion with various avatar environment image materials, so that a large amount of avatar image samples corresponding to the avatar under different forms, different paintings and different environment backgrounds can be rapidly obtained, and the labor cost for acquiring the avatar image samples through public images on a network can be greatly saved. In addition, the segmentation labels of the corresponding virtual image materials can be obtained through the transparency channel diagram, so that the virtual image segmentation labels corresponding to the virtual image samples can be quickly obtained by performing data enhancement processing on the transparency channel diagram consistent with the corresponding virtual image materials and performing image fusion on the transparency channel diagram after the data enhancement and the corresponding various virtual environment image materials, and additional manual labeling is not needed, so that the workload and the time consumption for constructing the data set can be greatly reduced.

As shown in fig. 7, fig. 7 is a flowchart illustrating a style migration method according to an embodiment. When the avatar image sample and the avatar segmentation label are obtained by the style migration method, the non-avatar image material and the non-avatar material segmentation label corresponding to the non-avatar image material can be obtained first, then the avatar image is stylized for the non-avatar image material to obtain the avatar image sample, and at the moment, the non-avatar material segmentation label can be directly used as the avatar segmentation label. The non-avatar image material and the non-avatar material segmentation label are obtained from different data sets such as a COCO data set or a PASCAL-VOC data set provided by the related art, and are not particularly limited herein. In addition, the non-avatar image material can be obtained by collecting public images on the network, in this case, the non-avatar material segmentation label can be obtained by manually labeling the non-avatar image material, or can be obtained by performing image segmentation on the non-avatar image material by using a conventional semantic segmentation model, which is not limited herein. When the non-avatar image material is subjected to avatar stylization to obtain an avatar image sample, different methods such as an image analogy method, an image filtering method or a machine learning method can be adopted to perform avatar stylization on the non-avatar image material, which is not specifically limited herein. The image analogy method mainly comprises the steps of learning a mapping relation between a pair of source images and a target image, and then positioning a stylized image in a supervised learning mode according to the mapping relation. The image filtering method mainly adopts some combined image filters (such as a bilateral filter, a Gaussian filter and the like) to render a given image, so that the given image can be stylized with an avatar. The machine learning method mainly includes that a trained neural network model is used for carrying out virtual image stylization on non-virtual image materials, and the output of the neural network model is virtual image samples corresponding to the non-virtual image materials, wherein when the neural network model is trained, the content characteristics and the wind lattice characteristics of the image samples are extracted by the neural network model, then the content characteristics and the wind lattice characteristics are recombined to generate virtual image images, image difference values are obtained through calculation according to the generated virtual image images and target images serving as label information, then model parameters of the neural network model are corrected according to the image difference values, and the virtual image images are repeatedly reconstructed ceaselessly until the image difference values between the generated virtual image images and the target images serving as the label information meet the preset threshold value requirements. Since the non-avatar image material and the non-avatar material segmentation labels can be easily obtained from different data sets such as COCO data sets or PASCAL-VOC data sets provided by related technologies, after a large number of non-avatar image materials and non-avatar material segmentation labels are obtained, a large number of avatar image samples and avatar segmentation labels can be obtained by performing avatar stylization without additional manual labeling, so that the workload and time for constructing the data set can be greatly reduced.

Referring to fig. 8, fig. 8 is a flowchart of an image processing method according to an embodiment of the present invention, where the image processing method may be executed by a terminal or a server, or may be executed by both the terminal and the server. Referring to fig. 8, the image processing method includes, but is not limited to, steps 810 to 830.

Step 810: and acquiring an image to be processed.

In this step, the image to be processed may be a static image (such as a picture) or a dynamic image (such as a video), and is not limited herein. For example, suppose that a user starts an image anti-blocking function when browsing pictures uploaded by other users through social media playing software in a terminal, in order to avoid blocking the pictures by bullet screen information, the terminal may request the server for the pictures subjected to image anti-blocking processing, and in this case, the server may first obtain the pictures from the database according to the request, so that the subsequent steps may obtain corresponding segmented images according to the pictures to implement image anti-blocking processing on the pictures. For another example, assuming that a user starts an image anti-blocking function when the user prevents the barrage information from blocking the video in the process of watching the network video through the video playing platform in the terminal, the terminal may request the server for the video subjected to the image anti-blocking processing, and in this case, the server may first obtain the video from the database according to the request, so that the subsequent step may obtain the corresponding segmented image according to the video to implement the image anti-blocking processing on the video.

Step 820: inputting an image to be processed into a target image segmentation model for image segmentation to obtain a first segmentation image, wherein the target image segmentation model is obtained by training through an image segmentation model training method.

In this step, since the image to be processed is obtained in step 810, and the target image segmentation model has been trained by the previous image segmentation model training method, the image to be processed may be input to the target image segmentation model for image segmentation to obtain a first segmentation image, so that the image to be processed may be subjected to image anti-occlusion processing by using the first segmentation image in subsequent steps. It should be noted that the first segmentation image obtained after the image segmentation is performed by using the target image segmentation model is a segmentation image corresponding to the region of interest in the image to be processed, for example, assuming that the region of interest in the image to be processed is a non-avatar region (e.g., a portrait region), the first segmentation image is a corresponding non-avatar segmentation image; for another example, assuming that the region of interest in the image to be processed is an avatar region (e.g., a cartoon image region, etc.), the first segmented image is a corresponding avatar segmented image; for another example, assuming that the region of interest in the image to be processed is a plant region, the first segmentation image is a corresponding plant segmentation image. The target image segmentation model is trained by the general image sample and the saliency segmentation label in the training process of the model, so that the target image segmentation model can have the capability of saliency detection, and the saliency detection is insensitive to various avatar segmentation images, so that the image segmentation of the avatar image can be more accurately realized, and when the region of interest in the image to be processed is a region of interest to be processed, the image to be processed can be more accurately processed by the corresponding image processing method of the prior image, so that the user can more accurately process the image to be processed by the corresponding image processing method of the image to be processed, and the user can more accurately process the image to be processed by the corresponding image processing method of the image to be processed.

Step 830: and performing image anti-blocking processing by using the first segmentation image.

In this step, since the first divided image corresponding to the image to be processed is obtained in step 820, the image to be processed may be subjected to image anti-blocking processing using the first divided image. The image anti-blocking processing performed on the image to be processed by using the first split image may include a plurality of different processing contents, and taking an area of interest in the image to be processed as an avatar area as an example, the image anti-blocking processing performed on the avatar in the image to be processed by using the first split image may be performed for preventing the avatar in the image to be processed from being blocked by the subtitle information, or may be performed for preventing the avatar in the image to be processed from being blocked by the gift image, or may be performed for preventing the avatar in the image to be processed from being blocked by the notification information, and this is not particularly limited.

In a possible implementation manner, when the first segmentation image is used for image anti-blocking processing, gaussian blurring may be performed on the first segmentation image to obtain a second segmentation image, binarization is performed on the second segmentation image according to a preset threshold to obtain a binarized image, connected domain detection is performed on the binarized image to obtain a connected domain in the binarized image, after the connected domain in the binarized image is obtained, a mask image is obtained according to the connected domain, and then image anti-blocking processing is performed on the image to be processed according to the mask image. When the first segmentation image is subjected to Gaussian blur, the first segmentation image can be subjected to Gaussian blur through a Gaussian convolution kernel, the edge of the first segmentation image is subjected to smoothing processing, after the first segmentation image is subjected to Gaussian blur, the first segmentation image subjected to Gaussian blur is subjected to binarization through a preset threshold value, so that the pixel value of the first segmentation image subjected to binarization is 0 or 1, wherein the pixel value of 0 represents a segmentation background, and the pixel value of 1 represents a segmentation foreground. It should be noted that, in the first segmented image output by the image segmentation model, the value of each pixel is between 0 and 1, and after the first segmented image is subjected to gaussian blurring and binarization, the value of the pixel of the binarized first segmented image may be 0 or 1, so as to more accurately segment the image foreground (i.e., the avatar) and the image background.

It should be noted that, in some cases, values of some pixel positions in the image foreground may be close to or equal to values of some pixel positions in the image background, and after binarization and connected domain detection are performed, a void point may appear in an obtained connected domain range to affect accuracy of a subsequently obtained mask image. The connected domain is subjected to cavity filling, so that cavity points in the range of the connected domain can be removed, and the whole connected domain is complete. In addition, when the filling image is vectorized, a picture vectorization tool such as Potrace can be used for vectorizing the filling image to obtain a mask image after vectorization, so that image anti-occlusion processing can be performed on the image to be processed by using the mask image in the subsequent step. The Potrace is a common tool capable of converting a pixel bitmap into a vector diagram, and is capable of generating a vector diagram in a corresponding SVG format from data information of the pixel bitmap. Wherein, SVG (Scalable Vector Graphics ) is an interactive dynamic graphic, and SVG images can be arbitrarily enlarged and displayed without losing image quality.

It should be noted that, the mask image is an image layer for measuring transparency, in the mask image, a position with a pixel value of 1 represents that the transparency of the position is 0%, and a position with a pixel value of 0 represents that the transparency of the position is 100%, that is, a position with a pixel value of 1 in the mask image displays an image, and a position with a pixel value of 0 does not display an image, so that when the image to be processed is subjected to image anti-blocking processing according to the mask image, the mask image and the image to be processed can be subjected to fusion rendering, such that the pixel value of the mask image at a position corresponding to a region of interest (e.g. avatar region) in the image to be processed is 0, and the pixel value of the mask image at a position corresponding to a region of interest (e.g. avatar region) in the image to be processed is 1, thus, the contents such as the bullet screen type information, the gift type image or the notification type information are not displayed in the range of the interested area (such as the virtual image area) in the image to be processed, the contents such as the bullet screen type information, the gift type image or the notification type information are only displayed at the position outside the interested area (such as the virtual image area) in the image to be processed, therefore, the contents such as the bullet screen type information, the gift type image or the notification type information can be visually displayed in a moving way in the picture or the video, when the contents such as the bullet screen type information, the gift type image or the notification type information pass through the virtual image in the picture or the video, the contents such as the bullet screen type information, the gift type image or the notification type information are not displayed, and only when the contents are moved to the position outside the virtual image, the contents such as the bullet screen type information, the gift type image or the notification type information are displayed, therefore, the image anti-blocking processing of the image to be processed is realized.

The principles of the image segmentation model training method and the image processing method provided by the embodiment of the invention are fully described in some specific examples.

Referring to fig. 9, fig. 9 is a flowchart illustrating an overall flowchart of an image segmentation model training method and a mask image obtaining method according to an embodiment of the present invention, where the overall flowchart specifically includes, but is not limited to, the following steps 910 to 940.

Step 910: and constructing a training sample set for training the image segmentation model.

In this step, the training sample set for training the image segmentation model may include a non-avatar image sample and a non-avatar segmentation label that correspond to each other, an avatar image sample and an avatar segmentation label that correspond to each other, and a general image sample and a saliency segmentation label that correspond to each other. When the image segmentation model is trained by adopting a non-virtual image sample, the non-virtual image segmentation label can be used as label information for correcting model parameters in the model training process; when the image segmentation model is trained by adopting the virtual image sample, the virtual image segmentation label can be used as label information for correcting model parameters in the model training process; when the image segmentation model is trained by adopting the general image sample, the saliency segmentation labels can be used as label information for correcting model parameters in the model training process.

In this step, the non-avatar image sample and the non-avatar segmentation label may be obtained from different data sets such as a COCO data set or a PASCAL-VOC data set provided by the related art, which is not specifically limited herein. The avatar image sample and the avatar segmentation label may be obtained by the foregoing foreground-background synthesis method or style migration method, and are not particularly limited herein. The universal image sample can be obtained from different data sets such as a COCO data set or a PASCAL-VOC data set, and can also be obtained by acquiring public images on a network, and the universal image sample is not specifically limited herein; the saliency segmentation labels can be obtained by saliency detection of a general image sample by using a conventional saliency detection model.

Step 920: and constructing an image segmentation model based on deep learning.

In this step, an image segmentation model having a model structure as shown in fig. 5 may be constructed, so that after the training of the image segmentation model is completed in the subsequent step, the image segmentation model may be used to perform accurate image segmentation on the image to be processed.

Step 930: and performing model training of four stages on the image segmentation model by using a training sample set to obtain the trained image segmentation model, namely the target image segmentation model.

In this step, because the training sample set is constructed in step 910 and the image segmentation model is constructed in step 920, the image segmentation model may be subjected to model training in four stages by using the training sample set, so as to obtain a trained target image segmentation model.

A specific procedure of performing four-stage model training on the image segmentation model by using the training sample set can be as shown in fig. 10. In fig. 10, four stages of model training are performed on the image segmentation model by using the training sample set, which may specifically include, but is not limited to, the following steps 1010 to 1040.

Step 1010: and training the image segmentation model by using the low-resolution non-avatar image sample of the first time sequence length and the non-avatar segmentation label.

In the step, the image segmentation model is trained by using the low-resolution non-virtual image sample with the first time sequence length and the non-virtual image segmentation label, so that the image segmentation model can quickly adapt to data distribution, and a task of segmenting the non-virtual image sample by the image segmentation model is initially realized. For example, assuming that the non-avatar image sample is a video sample including a human character, in this step, 15 frames of low-resolution video frames may be continuously taken each time to train the image segmentation model, so that the image segmentation model can initially implement a task of segmenting the human character in the video frames.

Step 1020: and training the image segmentation model by using the low-resolution non-avatar image sample with the second time sequence length and the non-avatar segmentation label.

In the step, the image segmentation model is trained by using the low-resolution non-avatar image sample with the second time sequence length and the non-avatar segmentation label, so that the image segmentation model can adapt to the long time sequence information, and the better image segmentation effect can be realized by more fully using the previous image sample. For example, if the non-avatar image sample is a video sample including a real character, 50 frames of low-resolution video frames may be continuously taken each time to train the image segmentation model in this step, so as to improve the adaptability of the image segmentation model to long-time sequence information, thereby improving the image segmentation capability of the image segmentation model to the real character in the video frames.

Step 1030: and training the image segmentation model by using the low-resolution non-avatar image sample and the non-avatar segmentation label of the third time sequence length and the high-resolution non-avatar image sample and the non-avatar segmentation label of the fourth time sequence length.

In the step, the image segmentation model is trained by using the low-resolution non-avatar image sample and the non-avatar segmentation label of the third time sequence length and the high-resolution non-avatar image sample and the non-avatar segmentation label of the fourth time sequence length, so that the image segmentation model can adapt to high-resolution images, low-resolution images, long time sequence information and short time sequence information, and the segmentation effect of the non-avatar image samples of various sizes and various durations can be better realized. For example, if the non-avatar image sample is a video sample including a human character, in this step, 40 frames of low-resolution video frames and 6 frames of high-resolution video frames may be taken at random and continuously at a time to train the image segmentation model, so as to improve the adaptability of the image segmentation model to high-resolution and low-resolution pictures and long-short time sequence information, thereby better achieving the effect of segmenting the human character in videos of various sizes and durations.

Step 1040: and training the image segmentation model by using the avatar image sample and the avatar segmentation label, the general image sample and the saliency segmentation label.

In this step, the avatar image sample and the general image sample may both be picture samples, and the avatar image sample and the avatar segmentation label, the general image sample and the saliency segmentation label are used to train the image segmentation model, so that the segmentation object of the target image segmentation model obtained after training can be extended from a non-avatar to an avatar, thereby realizing the training of the target image segmentation model based on saliency detection. For example, if the avatar image sample is a picture sample including a cartoon character, and the general image sample is a picture sample, in this step, 1 frame of low-resolution avatar image sample and 1 frame of low-resolution general image sample may be taken each time to train the image segmentation model, and the segmentation object of the target image segmentation model obtained after training is extended from a real human character to a cartoon character, thereby implementing training of the target image segmentation model capable of supporting significance detection on the cartoon character.

Through the steps 1010 to 1040, the trained target image segmentation model can support accurate segmentation of the virtual image, so that image anti-blocking processing can be performed on the image to be processed by using the model output result of the target image segmentation model in the subsequent steps.

Step 940: and performing image segmentation on the image to be processed by using the trained target image segmentation model to obtain a model output result, performing data post-processing on the model output result, and outputting a mask image corresponding to the image to be processed.

In this step, since the trained target image segmentation model is obtained in step 930, the trained target image segmentation model may be used to perform image segmentation on the image to be processed to obtain a model output result, and then the model output result is subjected to data post-processing to output a mask image corresponding to the image to be processed, so that the mask image may be used to perform image anti-blocking processing on the image to be processed in the subsequent step.

A specific flow of performing data post-processing on the output result of the model may be as shown in fig. 11. In fig. 11, the post-processing of the data on the output result of the model may specifically include, but is not limited to, the following steps 1110 to 1150.

Step 1110: and performing Gaussian blur on the output result of the model.

In this step, the output result of the model is a single-channel image with the resolution consistent with that of the image to be processed, and the value of each pixel is between 0 and 1. In order to perform data post-processing on the model output result to obtain an accurate mask image, in this step, the gaussian blur may be performed on the model output result by gaussian convolution kernel, and the edge of the model output result may be smoothed to provide a data basis for the subsequent binarization processing.

Step 1120: and carrying out binarization on the output result of the model after Gaussian blur.

In this step, after the gaussian blurring of the model output result is completed, the model output result after the gaussian blurring may be binarized through a preset threshold value, so that a pixel value of the model output result after the binarization is 0 or 1, where the pixel value of 0 represents a segmentation background, and the pixel value of 1 represents a segmentation foreground. It should be noted that, in the model output result output by the image segmentation model, the value of each pixel is between 0 and 1, and after the model output result is subjected to gaussian blurring and binarization, the value of the pixel of the binarized model output result may be 0 or 1, so as to more accurately segment the image foreground (i.e., the avatar) and the image background.

Step 1130: and carrying out connected domain detection on the output result of the binarized model to obtain a connected domain.

In this step, after the binarization of the model output result is completed, connected domain detection may be performed on the binarized model output result to obtain a connected domain, so that an accurate mask image may be obtained based on the connected domain in the subsequent step.

Step 1140: and filling holes in the connected domain to obtain a filled image.

In this step, in some cases, the values of some pixel positions in the image foreground may be close to or equal to those of some pixel positions in the image background, and after binarization and connected domain detection are performed, void points may appear in the obtained connected domain range to affect the accuracy of a subsequently obtained mask image, so that a filled image may be obtained by filling the connected domain with voids first, for example, the pixel value of the void point in the connected domain may be set to 1, and the void points existing in the connected domain range are removed, so that the whole connected domain is complete.

Step 1150: and vectorizing the filled image to obtain a mask image.

In this step, because the padded image is obtained in step 1140, the padded image may be vectorized to obtain a mask image, so that the mask image may be used to perform image anti-blocking processing on the image to be processed in the subsequent step. For example, a picture vectorization tool such as Potrace may be used to vectorize the padding image, so as to obtain a mask image after vectorization. As shown in fig. 12 (a) and 12 (b), in fig. 12 (a), the image to be processed includes 3 cartoon characters, so that the obtained mask image corresponds to the 3 cartoon characters, specifically, the mask image includes an image foreground and an image background, a range of the image foreground is consistent with a range of the 3 cartoon characters, and ranges of the mask image except the image foreground all belong to a range of the image background. In fig. 12 (b), 1 cartoon character is included in the image to be processed, so the obtained mask image corresponds to the 1 cartoon character, that is, the range of the image foreground in the mask image corresponds to the range of the 1 cartoon character.

In addition, after the mask image is obtained, the mask image and the image to be processed may be subjected to fusion rendering, so that a pixel value of the mask image at a position corresponding to the avatar region in the image to be processed is 0, and a pixel value of the mask image at a position corresponding to the image to be processed other than the avatar region is 1, and thus, the contents such as the bullet screen type information, the gift type image, or the notification type information are not displayed in the range of the avatar region in the image to be processed, and the contents such as the bullet screen type information, the gift type image, or the notification type information are only displayed at the position in the image to be processed other than the avatar region. For example, as shown in fig. 13 (a) and 13 (b), in fig. 13 (a), the barrage type information is only displayed in the image background and is not displayed in a superimposed manner in 3 cartoon characters, and similarly, in fig. 13 (b), the barrage type information is also only displayed in the image background and is not displayed in a superimposed manner in the cartoon characters.

By adopting the image segmentation model training method and the whole flow steps of the mask image acquisition method provided by the specific example, the image segmentation accuracy of the virtual image can be effectively improved, so that a more accurate mask image can be obtained, and the mask image is favorable for performing image anti-shielding treatment on the image to be treated. As shown in fig. 14 (a), 14 (b), 14 (c) and 14 (d), fig. 14 (a) is a mask image obtained by using the image segmentation model training method and the overall flow of the mask image acquisition method provided in the present specific example, fig. 14 (b) is a mask image obtained by using the semantic segmentation method in the related art, and in fig. 14 (a) and 14 (b), the solid line region part is the image foreground and the dotted line region part is the image background, so that, as can be seen from fig. 14 (a) and 14 (b), the mask image obtained by using the semantic segmentation method in the related art cannot be matched with all cartoon characters, whereas the mask image obtained by using the image segmentation model training method and the overall flow of the mask image acquisition method provided in the present specific example can be accurately matched with all cartoon characters. Fig. 14 (c) is another mask image obtained by using the image segmentation model training method and the overall flow of the mask image acquisition method provided in this specific example, and fig. 14 (d) is another mask image obtained by using the semantic segmentation based method in the related art, and in fig. 14 (c) and fig. 14 (d), the solid line region part is the image foreground and the dotted line region part is the image background, so it can be known from fig. 14 (c) and fig. 14 (d) that the cartoon character cannot be segmented from the mask image obtained by using the semantic segmentation based method in the related art, whereas the cartoon character can be accurately segmented from the mask image obtained by using the image segmentation model training method and the overall flow of the mask image acquisition method provided in this specific example.

The following is a practical example to illustrate the application scenario of the embodiment of the present invention.

It should be noted that the image processing method provided in the embodiment of the present invention may be applied to different application scenes, such as a scene for watching a video or a scene for watching a live broadcast, and the following description takes the scene for watching a video and the scene for watching a live broadcast as examples.

Scene one

The image processing method provided by the embodiment of the invention can be applied to a scene of watching a video, and particularly, when a user watches a network video through a video playing platform by using a terminal such as a smart phone or a vehicle-mounted terminal, the user finds that bullet screen information can shield a target object (such as an avatar such as a cartoon character) in the video, so that the watching experience is influenced, when the user starts an image anti-shielding function in the video playing platform, the device such as the smart phone or the vehicle-mounted terminal sends a request for acquiring the video frame image subjected to the image anti-shielding processing to a server, when the server receives the request, the server firstly inputs the video frame image to be played to a trained image segmentation model for image segmentation, so as to obtain a first segmentation image corresponding to the target object (such as the avatar such as the cartoon character) in the video frame image, performing Gaussian blur on the first divided image to obtain a second divided image, performing binarization on the second divided image according to a preset threshold value to obtain a binarized image, performing connected domain detection on the binarized image to obtain a connected domain in the binarized image, performing void filling on the connected domain to obtain a filled image after the connected domain in the binarized image is obtained, performing vectorization on the filled image to obtain a mask image, performing image anti-blocking processing on a target object (such as a virtual image such as a cartoon character) in a video frame image to be played according to the mask image by the server after the mask image is obtained, then transmitting the video frame image subjected to image anti-blocking processing to the terminal, and displaying the video frame image subjected to image anti-blocking processing after the terminal receives the video frame image subjected to image anti-blocking processing, at this time, in the process of watching the network video, the user finds that the barrage information does not block a target object (e.g., an avatar such as a cartoon character) in the network video. The method comprises the steps that in the process of training an image segmentation model, after a non-virtual image sample, a non-virtual image segmentation label, an virtual image sample, a virtual image segmentation label and a common image sample and a saliency segmentation label are obtained, the non-virtual image sample is used for training the initial image segmentation model to obtain a first image segmentation model, in the process of training the initial image segmentation model by using the non-virtual image sample, parameters of the initial image segmentation model are corrected according to the non-virtual image segmentation label, then the virtual image sample and the common image sample are used for training the first image segmentation model to obtain a target image segmentation model, in the process of training the first image segmentation model by using the virtual image sample, parameters of the first image segmentation model are corrected according to the virtual image segmentation label, and in the process of training the first image segmentation model by using the common image sample, parameters of the first image segmentation model are corrected according to the saliency segmentation label.

Scene two

The image processing method provided by the embodiment of the invention can be applied to watching live broadcast scenes, and particularly, when a user watches live broadcast videos through a social media live broadcast platform by using terminals such as a smart phone or a vehicle-mounted terminal, the user finds that bullet screen information or gift images sent by other users can shield target objects (such as virtual images such as cartoon characters) in the live broadcast videos, so that the watching experience is influenced, when the user starts an image anti-shielding function in the social media live broadcast platform, the smart phone or the vehicle-mounted terminal and other devices send a request for acquiring video frame images subjected to image anti-shielding processing to a server, when the server receives the request, the server firstly inputs the video frame images to be played to a trained image segmentation model for image segmentation, obtaining a first segmentation image corresponding to a target object (such as an avatar like a cartoon character) in a video frame image, then performing Gaussian blur on the first segmentation image to obtain a second segmentation image, then performing binarization on the second segmentation image according to a preset threshold to obtain a binarization image, then performing connected domain detection on the binarization image to obtain a connected domain in the binarization image, after obtaining the connected domain in the binarization image, performing void filling on the connected domain to obtain a filled image, and performing vectorization on the filled image to obtain a mask image, after obtaining the mask image, performing image anti-blocking processing on the target object (such as the avatar like the cartoon character) in the video frame image to be played by a server according to the mask image, then sending the video frame image subjected to image anti-blocking processing to a terminal, and after receiving the video frame image subjected to image anti-blocking processing by the terminal, the video frame images subjected to image anti-blocking processing are displayed, and at the moment, a user can find that bullet screen information or gift images cannot block target objects (such as virtual images of cartoon characters and the like) in the live video in the process of watching the live video. The method comprises the steps that in the process of training an image segmentation model, after a non-virtual image sample, a non-virtual image segmentation label, an virtual image sample, a virtual image segmentation label and a common image sample and a saliency segmentation label are obtained, the non-virtual image sample is used for training the initial image segmentation model to obtain a first image segmentation model, in the process of training the initial image segmentation model by using the non-virtual image sample, parameters of the initial image segmentation model are corrected according to the non-virtual image segmentation label, then the virtual image sample and the common image sample are used for training the first image segmentation model to obtain a target image segmentation model, in the process of training the first image segmentation model by using the virtual image sample, parameters of the first image segmentation model are corrected according to the virtual image segmentation label, and in the process of training the first image segmentation model by using the common image sample, parameters of the first image segmentation model are corrected according to the saliency segmentation label.

It is to be understood that, although the steps in the respective flowcharts described above are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not limited to being performed in the exact order shown and steps may be performed in other orders unless explicitly stated in the embodiment. Moreover, at least a part of the steps in the above-mentioned flowcharts may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or the stages is not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a part of the steps or the stages in other steps.

Referring to fig. 15, an embodiment of the present invention further discloses an image segmentation model training apparatus 1500, where the image segmentation model training apparatus 1500 is capable of implementing the image segmentation model training method according to the foregoing embodiment, and the image segmentation model training apparatus 1500 includes:

a sample acquiring unit 1510 for acquiring a non-avatar image sample and a non-avatar segmentation label, an avatar image sample and an avatar segmentation label, a general image sample and a saliency segmentation label;

a first training unit 1520, configured to train the initial image segmentation model using the non-avatar image sample, resulting in a first image segmentation model, wherein parameters of the initial image segmentation model are modified according to the non-avatar segmentation labels during the training of the initial image segmentation model using the non-avatar image sample;

a second training unit 1530 for training the first image segmentation model by using the avatar image sample and the general image sample to obtain a target image segmentation model, wherein when the first image segmentation model is trained by using the avatar image sample, parameters of the first image segmentation model are corrected according to the avatar segmentation labels; and when the first image segmentation model is trained by using the general image sample, correcting parameters of the first image segmentation model according to the saliency segmentation labels.

In one embodiment, the non-avatar image samples include non-avatar animation image samples of different resolutions; the first training unit 1520 is further configured to:

acquiring non-virtual image dynamic image samples with different time sequence lengths under different resolutions;

and training the initial image segmentation model by using the non-virtual image dynamic image samples of each time sequence length under each resolution ratio to obtain a first image segmentation model.

In an embodiment, the first training unit 1520 is further configured to:

carrying out model iterative training on the initial image segmentation model by using non-virtual image dynamic image samples with different time sequence lengths under the same resolution ratio to obtain a second image segmentation model;

In one embodiment, the different timing lengths at the same resolution include a first timing length and a second timing length; the first training unit 1520 is further configured to:

training the initial image segmentation model by using the non-virtual image dynamic image sample with the first time sequence length to obtain a third image segmentation model;

In an embodiment, the image segmentation model training apparatus 1500 further includes:

the first fusion unit is used for carrying out image fusion on the virtual image material and the virtual environment image material to obtain a virtual image sample;

and the second fusion unit is used for carrying out image fusion on the transparency channel image and the virtual environment image material to obtain the virtual image segmentation label.

In an embodiment, the first fusion unit is further configured to:

In an embodiment, the second fusion unit is further configured to:

performing at least one of geometric transformation, color transformation or random noise addition on the transparency channel images to obtain a plurality of target channel images, wherein the target channel images correspond to the target image materials one to one;

the second acquisition unit is used for acquiring the non-virtual image material and a non-virtual image material segmentation label corresponding to the non-virtual image material;

the virtual stylization unit is used for virtually stylizing the non-virtual image materials to obtain virtual image samples;

a label determination unit for using the non-avatar material segmentation label as an avatar segmentation label.

It should be noted that, since the image segmentation model training apparatus 1500 of the present embodiment can implement the image segmentation model training method according to the foregoing embodiment, the image segmentation model training apparatus 1500 of the present embodiment and the image segmentation model training method according to the foregoing embodiment have the same technical principles and the same beneficial effects, and are not repeated herein to avoid redundancy.

Referring to fig. 16, an embodiment of the present invention further discloses an image processing apparatus, the image processing apparatus 1600 being capable of implementing the image processing method according to the foregoing embodiment, the image processing apparatus 1600 including:

an image acquisition unit 1610 configured to acquire an image to be processed;

the image segmentation unit 1620 is configured to input the image to be processed to the target image segmentation model for image segmentation, so as to obtain a first segmented image;

an image anti-blocking unit 1630 configured to perform image anti-blocking processing using the first divided image;

the target image segmentation model is trained by the image segmentation model training apparatus 1500 as described above.

In an embodiment, the image anti-blocking unit 1630 is further configured to:

binarizing the second segmentation image according to a preset threshold value to obtain a binarized image;

carrying out connected domain detection on the binary image to obtain a connected domain in the binary image;

obtaining a mask image according to the connected domain;

and performing image anti-blocking treatment on the image to be processed according to the mask image.

In an embodiment, the image anti-blocking unit 1630 is further configured to:

filling holes in the connected domain to obtain a filled image;

and vectorizing the filled image to obtain a mask image.

It should be noted that, since the image processing apparatus 1600 of the present embodiment can implement the image processing method according to the foregoing embodiment, the image processing apparatus 1600 of the present embodiment has the same technical principle and the same beneficial effects as the image processing method according to the foregoing embodiment, and therefore, in order to avoid the repetition, the details are not repeated here.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Referring to fig. 17, an embodiment of the present invention further discloses an electronic device, where the electronic device 1700 includes:

at least one processor 1701;

at least one memory 1702 for storing at least one program;

when executed by the at least one processor 1701, the at least one program implements an image segmentation model training method as previously described, or implements an image processing method as previously described.

The embodiment of the present invention also discloses a computer readable storage medium, in which a computer program executable by a processor is stored, and when the computer program executable by the processor is executed by the processor, the computer program is used for implementing the image segmentation model training method as described above or implementing the image processing method as described above.

Embodiments of the present invention also disclose a computer program product, which includes a computer program or computer instructions, the computer program or computer instructions being stored in a computer readable storage medium, a processor of the electronic device reading the computer program or computer instructions from the computer readable storage medium, the processor executing the computer program or computer instructions, so that the electronic device executes the image segmentation model training method as described above, or executes the image processing method as described above.

The terms "first," "second," "third," "fourth," and the like in the description of the invention and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It is to be understood that, in the present invention, "at least one" means one or more, "a plurality" means two or more. "and/or" is used to describe the association relationship of the associated object, indicating that there may be three relationships, for example, "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be an indirect coupling or communication connection through some interfaces, devices or units, and may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which is substantially or partly contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a portable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other media capable of storing program codes.

The step numbers in the above method embodiments are set for convenience of illustration only, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

Claims

1. An image segmentation model training method is characterized by comprising the following steps:

2. The image segmentation model training method according to claim 1, wherein the non-avatar image samples include non-avatar dynamic image samples of different resolutions;

the training of the initial image segmentation model by using the non-virtual image sample to obtain a first image segmentation model comprises the following steps:

acquiring the non-avatar dynamic image samples with different time sequence lengths under different resolutions;

3. The method according to claim 2, wherein the training the initial image segmentation model with the non-avatar dynamic image samples of each time sequence length at each resolution to obtain a first image segmentation model comprises:

4. The image segmentation model training method according to claim 3, wherein the different timing lengths at the same resolution include a first timing length and a second timing length;

performing model iterative training on the initial image segmentation model by using the non-avatar dynamic image samples with different time sequence lengths under the same resolution to obtain a second image segmentation model, wherein the model iterative training comprises the following steps:

5. The image segmentation model training method according to claim 1, wherein the avatar image sample and the avatar segmentation label are obtained by:

acquiring a virtual environment image material, a virtual image material and a transparency channel map corresponding to the virtual image material;

carrying out image fusion on the virtual image material and the virtual environment image material to obtain a virtual image sample;

and carrying out image fusion on the transparency channel image and the virtual environment image material to obtain the virtual image segmentation label.

6. The method for training an image segmentation model according to claim 5, wherein the image fusion of the avatar image material and the virtual environment image material to obtain the avatar image sample comprises:

7. The method for training the image segmentation model according to claim 6, wherein the image fusion between the transparency channel map and the image material of the virtual environment to obtain the avatar segmentation label comprises:

performing at least one of geometric transformation, color transformation or random noise addition on the transparency channel images to obtain a plurality of target channel images, wherein the target channel images correspond to the target image materials one by one;

8. The image segmentation model training method according to claim 1, wherein the avatar image sample and the avatar segmentation label are obtained by:

acquiring a non-virtual image material and a non-virtual image material segmentation label corresponding to the non-virtual image material;

performing avatar stylization on the non-avatar image material to obtain an avatar image sample;

and taking the non-avatar material segmentation label as the avatar segmentation label.

9. An image processing method, characterized by comprising the steps of:

acquiring an image to be processed;

wherein the target image segmentation model is obtained by training through the image segmentation model training method according to any one of claims 1 to 8.

10. The image processing method according to claim 9, wherein the performing image occlusion prevention processing using the first divided image includes:

performing Gaussian blur on the first segmentation image to obtain a second segmentation image;

obtaining a mask image according to the connected domain;

and carrying out image anti-blocking treatment on the image to be processed according to the mask image.

11. The image processing method according to claim 10, wherein the obtaining a mask image according to the connected component comprises:

filling holes in the connected domain to obtain a filled image;

and vectorizing the filled image to obtain a mask image.

12. An image segmentation model training device, comprising:

the first training unit is used for training an initial image segmentation model by using the non-avatar image sample to obtain a first image segmentation model, wherein in the process of training the initial image segmentation model by using the non-avatar image sample, parameters of the initial image segmentation model are corrected according to the non-avatar segmentation labels;

13. An image processing apparatus characterized by comprising:

the image acquisition unit is used for acquiring an image to be processed;

wherein the target image segmentation model is trained by the image segmentation model training device according to claim 12.

14. An electronic device, comprising:

at least one processor;

at least one memory for storing at least one program;

the program when executed by at least one processor implements an image segmentation model training method according to any one of claims 1 to 8 or an image processing method according to any one of claims 9 to 11.

15. A computer-readable storage medium, in which a computer program executable by a processor is stored, the computer program executable by the processor being adapted to implement the image segmentation model training method according to any one of claims 1 to 8 or to implement the image processing method according to any one of claims 9 to 11 when the computer program is executed by the processor.