CN114723760B

CN114723760B - Portrait segmentation model training method and device and portrait segmentation method and device

Info

Publication number: CN114723760B
Application number: CN202210543469.4A
Authority: CN
Inventors: 胡志伟; 高原; 白锦峰
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2022-05-19
Filing date: 2022-05-19
Publication date: 2022-08-23
Anticipated expiration: 2042-05-19
Also published as: CN114723760A

Abstract

The present disclosure provides a training method and apparatus for a portrait segmentation model, and a portrait segmentation method and apparatus, wherein the training method for the portrait segmentation model comprises: acquiring a training sample set, wherein the training sample set comprises a sample image and label information of each pixel point in the sample image; inputting a sample image into a model to be trained to obtain a binary mask image corresponding to the sample image, wherein an encoder network of the model to be trained comprises a plurality of lightweight network modules and a Transformer conversion network module; calculating a loss function value of the model to be trained according to the label information of each pixel point in the sample image and the prediction result of the corresponding pixel point in the binary mask map; and updating network parameters of an encoder network and a decoder network in the model to be trained according to the loss function value to carry out iterative training until the loss function value of the model to be trained is less than or equal to a preset value, thereby obtaining the portrait segmentation model. The scheme can realize real-time and accurate portrait segmentation of the mobile terminal, and achieves a good portrait segmentation effect.

Description

Portrait segmentation model training method and device and portrait segmentation method and device

Technical Field

The disclosure relates to the technical field of deep learning, in particular to a training method and device of a portrait segmentation model and a portrait segmentation method and device.

Background

The portrait segmentation technology is a technology for separating a portrait from the background of an original image. With the continuous popularization of mobile terminal intelligent equipment, the portrait segmentation technology has wide application in a plurality of scenes such as mobile terminal live broadcast, real-time communication, interactive entertainment and the like.

With the rapid development of deep learning algorithms and the accelerated falling of artificial intelligence related technologies in recent years, researchers have proposed a plurality of deep learning segmentation algorithms capable of meeting the real-time requirement, and have been successfully deployed in the fields related to the mobile terminal portrait segmentation technology.

However, although the conventional portrait segmentation algorithm at the mobile terminal can basically meet the requirement of real-time performance, it is difficult to meet the requirement of accuracy of portrait segmentation.

Disclosure of Invention

In order to solve the technical problem or at least partially solve the technical problem, embodiments of the present disclosure provide a method and an apparatus for training a portrait segmentation model, and a portrait segmentation method and an apparatus.

According to an aspect of the present disclosure, there is provided a training method of a human image segmentation model, including:

acquiring a training sample set, wherein the training sample set comprises a sample image and label information of each pixel point in the sample image;

inputting the sample image into a model to be trained, extracting a coding feature map of the sample image through an encoder network of the model to be trained, and decoding the coding feature map through a decoder network of the model to be trained to obtain a binary mask map corresponding to the sample image, wherein the encoder network comprises a plurality of lightweight network modules and a transform conversion network module;

calculating a loss function value of the model to be trained according to the label information of each pixel point in the sample image and the prediction result of the corresponding pixel point in the binary mask map;

and updating network parameters of the encoder network and the decoder network in the model to be trained according to the loss function value to carry out iterative training until the loss function value of the model to be trained is less than or equal to a preset value, thereby obtaining the portrait segmentation model.

According to another aspect of the present disclosure, there is provided a human image segmentation method based on a human image segmentation model, the human image segmentation model being trained by the training method of the human image segmentation model according to the previous aspect, the method including:

acquiring an image to be detected;

inputting the image to be detected into the portrait segmentation model to obtain a binary mask map of the image to be detected;

and segmenting a portrait image from the image to be detected according to the image to be detected and the binaryzation mask image.

According to another aspect of the present disclosure, there is provided a training apparatus for a human image segmentation model, including:

the system comprises a sample set acquisition module, a training sample set and a comparison module, wherein the sample set acquisition module is used for acquiring a training sample set, and the training sample set comprises a sample image and label information of each pixel point in the sample image;

the prediction result obtaining module is used for inputting the sample image into a model to be trained, extracting a coding feature map of the sample image through a coder network of the model to be trained, and decoding the coding feature map through a decoder network of the model to be trained to obtain a binary mask map corresponding to the sample image, wherein the coder network comprises a plurality of lightweight network modules and a Transformer conversion network module;

the calculation module is used for calculating a loss function value of the model to be trained according to the label information of each pixel point in the sample image and the prediction result of the corresponding pixel point in the binary mask map;

and the parameter updating module is used for updating the network parameters of the encoder network and the decoder network in the model to be trained according to the loss function value to carry out iterative training until the loss function value of the model to be trained is less than or equal to a preset value, so as to obtain the portrait segmentation model.

According to another aspect of the present disclosure, there is provided a human image segmentation apparatus based on a human image segmentation model, the human image segmentation model being trained by the training method of the human image segmentation model according to the previous aspect, the apparatus including:

the image acquisition module is used for acquiring an image to be detected;

the input module is used for inputting the image to be detected into the portrait segmentation model so as to obtain a binary mask image of the image to be detected;

and the processing module is used for segmenting a portrait image from the image to be detected according to the image to be detected and the binary mask image.

According to another aspect of the present disclosure, there is provided an electronic device including:

a processor; and

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the method of training a portrait segmentation model according to the one aspect described above or to perform the method of portrait segmentation based on a portrait segmentation model according to the other aspect described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to execute the method of training a portrait segmentation model according to the one aspect or the method of portrait segmentation based on the portrait segmentation model according to the other aspect.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements the method for training a person image segmentation model according to the aforementioned one aspect or implements the method for person image segmentation based on a person image segmentation model according to the aforementioned another aspect.

According to one or more technical schemes provided in the embodiment of the disclosure, the characteristics of an image are extracted through an encoder network comprising a plurality of lightweight network modules and a transform conversion network module, the speed of characteristic extraction and the global information of the image can be ensured, so that a binary mask image is determined according to the characteristic image extracted by the encoder network, a real-time high-quality portrait mask can be obtained, and a portrait segmentation model obtained through training can meet the requirements of real-time performance and high precision of portrait segmentation of a mobile terminal.

Drawings

Further details, features and advantages of the disclosure are disclosed in the following description of exemplary embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 shows a flow diagram of a training method of a human image segmentation model according to an example embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating a network structure of a model to be trained in an exemplary embodiment of the present disclosure;

FIG. 3 shows a flow chart of a method of training a human image segmentation model according to another exemplary embodiment of the present disclosure;

fig. 4 is a schematic diagram illustrating a network structure of a convergence module in a decoder network according to an exemplary embodiment of the present disclosure;

FIG. 5 shows a flowchart of a method of training a human image segmentation model according to yet another example embodiment of the present disclosure;

FIG. 6 shows a flowchart of a portrait segmentation method based on a portrait segmentation model according to an exemplary embodiment of the present disclosure;

FIG. 7 shows a schematic block diagram of a training apparatus for a human image segmentation model according to an exemplary embodiment of the present disclosure;

FIG. 8 shows a schematic block diagram of a person segmentation apparatus based on a person segmentation model according to an exemplary embodiment of the present disclosure;

FIG. 9 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based at least in part on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description. It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The following describes training methods and apparatuses of a portrait segmentation model and portrait segmentation methods and apparatuses provided by the present disclosure with reference to the drawings.

With the continuous popularization of mobile terminal intelligent equipment, the portrait segmentation technology has wide application in a plurality of scenes such as mobile terminal live broadcast, real-time communication, interactive entertainment and the like. In order to guarantee the real-time performance and accuracy of the portrait segmentation result, higher requirements are put forward on the performance of the portrait segmentation technology, however, the current portrait segmentation algorithm of the mobile terminal is difficult to meet the requirements of real-time performance and high precision at the same time.

In order to solve the above problems, the present disclosure provides a training method for a portrait segmentation model, in a portrait segmentation model structure, an encoder network includes a plurality of lightweight network modules and a transform network module, the lightweight network modules can meet the deployment requirement of a mobile terminal and meet the real-time requirement, the transform network module can acquire global information of an image, so that the speed of feature extraction and the global information of the image can be ensured through the encoder network, a binary mask image can be determined according to the feature image extracted by the encoder network, a real-time high-quality portrait mask can be obtained, and further the portrait segmentation model obtained by training can meet the real-time and high-precision requirements of portrait segmentation of the mobile terminal, so that a portrait mask in the image to be detected can be obtained by using a pre-trained portrait segmentation model, a portrait image can be separated from the image to be detected according to the portrait mask, the real-time accurate portrait segmentation of the mobile terminal can be realized, and a better portrait segmentation effect is achieved.

Fig. 1 is a flowchart illustrating a method for training a portrait segmentation model according to an exemplary embodiment of the present disclosure, where the method may be performed by a device for training a portrait segmentation model, where the device may be implemented by software and/or hardware, and may be generally integrated in an electronic device, and the electronic device may be a terminal device such as a computer, and may also be a server. As shown in fig. 1, the training method of the portrait segmentation model includes:

step 101, a training sample set is obtained, wherein the training sample set comprises a sample image and label information of each pixel point in the sample image.

In the embodiment of the disclosure, a plurality of sample images including a portrait may be acquired from a data set disclosed on the internet or acquired through offline collection, and the acquired sample images are labeled, and label information of each pixel in the sample images is labeled, where the label information includes a portrait or a background, and may be represented by "1" and "0" respectively, a pixel point labeled with "1" is a portrait, and a pixel point labeled with "0" is a background, so as to obtain a training sample set. It can be understood that the human figure included in the sample image may be a half-body human figure or a whole-body human figure, and the human figure segmentation model may be obtained by collecting the sample image according to actual needs and performing model training.

For example, when a portrait segmentation model that can recognize a portrait of a half body is obtained by training, sample data including the portrait of the half body may be acquired to form a training sample set.

Illustratively, the open source data set disclosed on the web can be subjected to data cleaning, and after the data with low resolution is removed, a training sample set is obtained. For example, a high-resolution bust portrait image may be extracted from the opening data set AI Segment to form a training sample set.

Exemplarily, unlabeled bust human images meeting the requirements can be collected, prediction is carried out by using a high-precision segmentation algorithm with an online open source, and since the prediction result contains label information of each pixel point, high-quality labeling data can be obtained through screening, so that the construction of a training sample set is realized. The image that meets the requirement may be, for example, an image with high definition and a large area proportion of the bust portrait in the image.

Illustratively, high-precision sample data can be obtained by adopting a manual labeling mode aiming at a half-length portrait scene, and a training sample set is further obtained.

Exemplarily, the portrait foreground image and the background image may be subjected to data synthesis to obtain artificial image data, and then the artificial image data is labeled to obtain a training sample set.

In an optional implementation manner of the present disclosure, the training sample set obtained in the above example may be further preprocessed, and the sample images in the training sample set are subjected to data augmentation to obtain diverse data, so as to improve the robustness of the model obtained by training. Data augmentation strategies include, but are not limited to, random cropping, random horizontal flipping, random rotation, and photometric distortion, among others.

102, inputting the sample image into a model to be trained, extracting a coding feature map of the sample image through an encoder network of the model to be trained, and decoding the coding feature map through a decoder network of the model to be trained to obtain a binary mask map corresponding to the sample image, wherein the encoder network comprises a plurality of lightweight network modules and a transform conversion network module.

In the embodiment of the disclosure, the built model to be trained comprises an encoder network and a decoder network, the encoder network comprises a plurality of lightweight network modules and a transform network module, wherein the lightweight network modules can be of a MobileNet V2 network structure, so that the accuracy and speed of the whole network can be improved, the transform network can extract the global information of an input image, and the high-quality portrait segmentation result can be obtained. In the embodiment of the disclosure, after the training sample set is obtained, a plurality of sample images in the training sample set may be sequentially input to the model to be trained, the encoder network of the model to be trained extracts feature information of the input sample images to obtain an encoding feature map, and then the obtained encoding feature map is input to the decoder network for decoding to obtain a binarization mask map corresponding to the sample images.

The size of the binarization mask image output by the model to be trained is the same as that of the input sample image, the binarization mask image contains pixel points with the same number as the sample image, and the gray value of each pixel point in the binarization mask image is 0 or 255 and belongs to the background and the portrait in the sample image respectively. And outputting a binarization mask image corresponding to the sample image by the model to be trained, wherein each pixel point in the binarization mask image carries a corresponding prediction result, and the prediction result is used for indicating the prediction probability that the corresponding pixel point belongs to the portrait.

103, calculating a loss function value of the model to be trained according to the label information of each pixel point in the sample image and the prediction result of the corresponding pixel point in the binary mask map.

The higher the prediction probability is, the higher the possibility that the corresponding pixel point is a pixel point on the portrait area is.

In the embodiment of the present disclosure, after the binarized mask map of the sample image output by the model to be trained is obtained, the loss function value of the model to be trained may be calculated according to the label information of each pixel point in the input sample image and the prediction result of the corresponding pixel point in the binarized mask map.

Illustratively, a preset damage function can be used to determine a loss function value of the model to be trained by calculating a difference between the label information of each pixel point in the sample image and the prediction result of the corresponding pixel point in the binary mask map.

It can be understood that the size of the binarized mask image is the same as that of the input sample image, and the pixel points in the two images are in one-to-one correspondence, so that the loss function value of the model to be trained can be calculated by using the difference between the corresponding pixel points in the two images.

And step 104, updating network parameters of the encoder network and the decoder network in the model to be trained according to the loss function value, and performing iterative training until the loss function value of the model to be trained is smaller than or equal to a preset value, so as to obtain the portrait segmentation model.

The preset value may be preset, for example, the preset value is set to 0.01, 0.001, and the like.

It can be understood that the training of the model is a repeated iteration process, and the trained model is obtained by calculating the loss function value after each iteration and continuously adjusting the network parameters of the model to perform iterative training when the loss function value is greater than the preset value until the overall loss function value of the model is less than or equal to the preset value or the overall loss function value of the model is not changed or the change amplitude is slow, and the model converges.

In the embodiment of the disclosure, the loss function value obtained by each calculation is compared with a preset value, if the loss function value is greater than the preset value, the network parameters of the encoder network and the decoder network in the model to be trained are updated to perform iterative training, the binarized mask map of the sample image is re-obtained based on the model to be trained after the network parameters are updated, the loss function value after the iteration is re-calculated and compared with the preset value again, and the iteration is performed until the loss function value of the model to be trained is less than or equal to the preset value, so that the trained portrait segmentation model is obtained.

The training method of the portrait segmentation model of the embodiment of the disclosure obtains a training sample set, the training sample set comprises a sample image and label information of each pixel point in the sample image, inputs the sample image into a model to be trained, extracts a coding feature map of the sample image through an encoder network of the model to be trained, and decodes the coding feature map through a decoder network of the model to be trained to obtain a binary mask map corresponding to the sample image, wherein the encoder network comprises a plurality of lightweight network modules and a transform conversion network module, then calculates a loss function value of the model to be trained according to the label information of each pixel point in the sample image and a prediction result of a corresponding pixel point in the binary mask map, and further updates network parameters of the encoder network and the decoder network in the model to be trained according to the loss function value to perform iterative training, and obtaining the portrait segmentation model until the loss function value of the model to be trained is less than or equal to a preset value. By adopting the scheme, the characteristics of the image are extracted through the encoder network comprising a plurality of lightweight network modules and a Transformer conversion network module, the speed of characteristic extraction and the global information of the image can be ensured, so that the binaryzation mask image is determined according to the characteristic image extracted by the encoder network, the real-time high-quality portrait mask can be obtained, the portrait segmentation model obtained by training can meet the requirements of the real-time performance and the high precision of portrait segmentation of the mobile terminal, and favorable conditions are provided for the real-time accurate portrait segmentation of the mobile terminal.

Optionally, in the process of training the model to be trained, the training iteration number may be counted, and when the training iteration number of the model to be trained reaches the iteration number threshold, the model to be trained is considered to be converged, so as to obtain the trained portrait segmentation model. The threshold of the number of iterations may be determined according to the number of sample images in the training sample set. And when the loss function value of the model to be trained in the training process is larger than the preset value but the training iteration number reaches the iteration number threshold value, ending the training to obtain the well-trained portrait segmentation model. By setting the iteration number threshold, when the training iteration number of the model to be trained reaches the iteration number threshold, the well-trained portrait segmentation model is obtained, the model training process can be ended in time when the loss function value of the model to be trained cannot be converged, and the condition that the model training cannot be ended due to the fact that the loss function value cannot be converged is avoided.

In an optional implementation manner of the present disclosure, the encoder network of the model to be trained may be formed by serially connecting a convolutional network module, at least one group of lightweight network module groups, a Transformer conversion network module, and a lightweight network module, where the lightweight network module group includes at least one lightweight network module.

It should be noted that, when the encoder network includes multiple lightweight network module groups, the number of the lightweight network modules included in each lightweight network module group may be the same or different, and may be specifically set according to actual needs, which is not limited in this disclosure.

Therefore, in the embodiment of the present disclosure, when the coding feature map of the sample image is extracted through the encoder network of the model to be trained, the sample image may be input into the convolutional network module, the at least one group of lightweight network module groups, the Transformer conversion network module, and the first lightweight network module, which are connected in series, to obtain the coding feature map.

In an optional embodiment of the present disclosure, the encoder network of the model to be trained may include three lightweight network module groups, that is, at least one lightweight network module group includes three lightweight network module groups, which are a first lightweight network module group, a second lightweight network module group, and a third lightweight network module group, respectively, so that, when the encoding feature map of the sample image is extracted through the encoder network of the model to be trained, the sample image may be input into the convolutional network module, the first lightweight network module group, the second lightweight network module group, the third lightweight network module group, the Transformer conversion network module, and the first lightweight network module, which are connected in series, to obtain the encoding feature map.

In the embodiment of the disclosure, the encoder network is designed to include three lightweight network module groups, each group is used for extracting different features, and gradually transits from shallow features to deep features, so that feature expression content is enriched.

Further, in an optional implementation manner of the present disclosure, the first group of lightweight network module groups includes a second lightweight network module and a third lightweight network module, the second group of lightweight network module groups includes a fourth lightweight network module, a fifth lightweight network module and a sixth lightweight network module, and the third group of lightweight network module groups includes a seventh lightweight network module and an eighth lightweight network module.

The step length of each lightweight network module can be set according to actual requirements, which is not limited by the disclosure.

Therefore, in the embodiment of the present disclosure, the encoder network of the model to be trained may be formed by serially connecting a convolutional network module, a second lightweight network module, a third lightweight network module, a fourth lightweight network module, a fifth lightweight network module, a sixth lightweight network module, a seventh lightweight network module, an eighth lightweight network module, a Transformer conversion network module, and a first lightweight network module, where an output of a previous network module is used as an input of a next network module, and is input to the next network module for processing. Thus, when the coding feature map of the sample image is extracted by the coder network of the model to be trained, the sample image can be input into a convolution network module, a second lightweight network module, a third lightweight network module, a fourth lightweight network module, a fifth lightweight network module, a sixth lightweight network module, a seventh lightweight network module, an eighth lightweight network module, a Transformer conversion network module and a first lightweight network module which are connected in series to obtain a coding feature map, the second lightweight network module, the third lightweight network module, the fourth lightweight network module, the fifth lightweight network module, the sixth lightweight network module, the seventh lightweight network module, the eighth lightweight network module, the Transformer conversion network module and the first lightweight network module respectively take the output of the above one network module as input, and finally the first lightweight network module outputs the coding feature map.

Exemplarily, fig. 2 shows a network structure schematic diagram of a model to be trained in an exemplary embodiment of the present disclosure, and the human figure segmentation model which can be deployed at a mobile terminal and meets requirements of real-time performance and high accuracy can be obtained by training the model to be trained shown in fig. 2 by using the training method of the human figure segmentation model provided in the embodiment of the present disclosure. As shown in fig. 2, in the embodiment of the present disclosure, the encoder network is formed by stacking one convolutional network module (E1), eight lightweight network modules mobilonetv 2 (M2-M9), and one transform conversion network module (transform). For an input image (a sample image during model training), extracting shallow layer information of the image through a convolution layer network (E1), and downsampling the image into 1/2 of an original input image, wherein the convolution kernel size of the convolution layer network is 3 x 3, the step length is 2, and the number of channels is 16; then, shallow features of the image are further extracted through two MobileNet V2 network modules (M2 and M3), and the learned semantics are more complex, wherein the expansion rate of M2 is 2, the step length is 1, the number of channels is 16, the expansion rate of M3 is 2, the step length is 2, and the number of channels is 24; after M3, downsampling the image to 1/4 of the original input image; then, gradually extracting deep information of the image, such as semantic information and position information, by using three continuous MobileNet V2 network modules (M4-M6), wherein the position information is mainly reflected on a pixel value, the larger the pixel value is, the position information is 1, the semantic information refers to judgment of categories, such as whether a pixel point is a portrait or not, wherein the expansion rate of M4 is 2, the step length is 1, the number of channels is 24, the expansion rate of M5 is 2, the step length is 1, the number of channels is 24, the expansion rate of M6 is 2, the step length is 2, and the number of channels is 48; after M6, downsampling the image to 1/8 of the original input image; subsequently, the extraction of the features of the image is continued through two MobileNetV2 network modules (M7 and M8), wherein the expansion rate of M7 is 2, the step size is 1, the number of channels is 48, the expansion rate of M8 is 2, the step size is 2, and the number of channels is 64; after M8, downsampling the image to 1/16 of the original input image; secondly, inputting the feature map output by the M8 into a transform network module, wherein the number of channels is 80, and the transform network module can increase the receptive field, acquire global correlation and enhance the feature expression capability; and finally, sending the output of the Transformer conversion network module into a MobileNet V2 network module (M9), completing the extraction of visual features, and obtaining a coding feature map, wherein the size of the coding feature map is 1/16 of the input image, and the expansion rate of M9 is 2, the step length is 1, and the number of channels is 80.

In the embodiment of the disclosure, a sample image is input into a convolutional network module, at least one group of lightweight network module groups, a transform conversion network module and a lightweight network module which are connected in series to obtain a coding feature map, by introducing the transform module, global information of the image can be obtained, which is beneficial to generating a high-quality segmentation result, and by using the lightweight network, a real-time segmentation effect on a mobile terminal can be realized.

Fig. 3 shows a flowchart of a training method of a human image segmentation model according to another exemplary embodiment of the present disclosure, and as shown in fig. 3, on the basis of the foregoing embodiment, in step 102, the decoding is performed on the coding feature map through a decoder network of the model to be trained to obtain a binarization mask map corresponding to the sample image, the method may include the following sub-steps:

step 201, a first feature map is obtained by upsampling the coding feature map through a first upsampling module of the decoder network, wherein the size of the first feature map is the same as that of a second feature map output by a target lightweight network module, and the target lightweight network module is any lightweight network module in the at least one group of lightweight network module groups. In the embodiment of the present disclosure, the encoding characteristic diagram output by the encoder network may be up-sampled to obtain the first characteristic diagram having the same size as the second characteristic diagram output by the target lightweight network module. The target lightweight network module is any one of at least one lightweight network module group in the encoder network, and can be determined according to the network structure of the model. For example, in the model structure shown in fig. 2, the target lightweight network module is the third lightweight network module in the first group of lightweight network modules, and the second feature map is the feature map output by M3 in fig. 2.

It can be understood that, in the embodiment of the present disclosure, the encoder network includes at least one set of lightweight network module groups, each set of lightweight network module group includes at least one lightweight network module, and the step size of each lightweight network module affects the feature map output by the encoder network module compared with the downsampling multiple of the input image, for example, if the step size of a certain lightweight network module is 1, the feature map output by the lightweight network module is the same as the size of the input image input to the lightweight network module, and if the step size of the lightweight network module is 2, the feature map output by the lightweight network module is 1/2 of the size of the input image input to the lightweight network module. In the embodiment of the present disclosure, any lightweight network module in at least one group of lightweight network module groups may be used as a target lightweight network module, and it is only necessary to ensure that a first feature map obtained by upsampling a coding feature map output by an encoder network is the same as a second feature map output by the target lightweight network module in size.

In an optional implementation manner of the present disclosure, the encoder network includes a first lightweight network module group, a second lightweight network module group, and a third lightweight network module group, where the first lightweight network module group is a module before the encoder network and can extract shallow features of an input image, the first lightweight network module group includes a second lightweight network module and a third lightweight network module, and the third lightweight network module can extract relatively complete detailed features and does not contain too much noise, and therefore, in the embodiment of the present disclosure, the third lightweight network module in the first lightweight network module group may be used as a target lightweight network module and used for performing fusion processing with a feature map sampled on an encoding feature map, so that shallow detail information of an image can be retained in a feature map obtained by fusion, and the model precision is improved.

Illustratively, in the model structure shown in fig. 2, the target lightweight network module is the third lightweight network module M3, after passing through the third lightweight network module M3, the image is down-sampled to 1/4 of the original input image, and the encoded feature map output by the first lightweight network module M9 is 1/16 of the original input image, so that the encoded feature map can be up-sampled by 4 times by the first up-sampling module of the decoder network to obtain a first feature map, and the size of the first feature map is 1/4 of the original input image and is the same as the size of the second feature map output by the third lightweight network module M3.

It can be understood that, in the embodiment of the present disclosure, the multiple of upsampling the coding feature map may be preset according to a pre-designed structure of the model to be trained, for example, in fig. 2, the coding feature map finally output by the lightweight network module M9 is 1/16 of the original input image, in the decoder network, after the coding feature map is upsampled by the first upsampling module (upsampling 1 in fig. 2), the coding feature map is fused with the second feature map output by the lightweight network module M3, and the second feature map output by M3 is 1/4 of the original input image, the multiple of upsampling of the upsampling 1 may be preset to be 4 times. On the other hand, in the decoder network, if the coded feature map is fused with the feature map output by E1 after being upsampled by upsamplle 1, and the feature map output by E1 is 1/2 of the original input image, the upsampling multiple of Upsample1 can be set to be 8 times in advance.

Step 202, inputting the first feature map and the second feature map into a fusion module of the decoder network for feature fusion to obtain a fusion feature map.

In the embodiment of the disclosure, the coding feature map is output to a decoder network of a model to be trained, first, upsampling is performed through a first upsampling module of the decoder network to obtain a first feature map with the same size as a second feature map output by a target lightweight network module, then, the first feature map can be input to a fusion module of the decoder network, and in the fusion module, feature fusion is performed on the first feature map and the second feature map to obtain a fusion feature map.

In an alternative embodiment of the present disclosure, the first feature map and the second feature map may be directly subjected to feature fusion to obtain a fused feature map. For example, the first feature map and the second feature map may be channel fused to obtain a fused feature map.

In an optional embodiment of the present disclosure, after the first feature map is processed, feature fusion may be performed according to the new feature map obtained by the processing, the first feature map, and the second feature map, so as to obtain a fused feature map. Specifically, when the fusion module performs feature fusion on the first feature map and the second feature map to obtain a fusion feature map, the first feature map may be input to a plurality of convolution networks connected in series in the fusion module to perform convolution processing, so as to obtain an attention feature map, and then a fusion image is generated according to the attention feature map, the first feature map and the second feature map, and then the fusion image is input to the convolution layer to perform convolution processing, so as to obtain a fusion feature map.

The convolution layer of the fused image input may be a 1 × 1 convolution layer.

In an alternative embodiment of the present disclosure, the plurality of convolution networks may be three, and the convolution kernels of three series-connected convolution networks are 1 × 1, 3 × 3, and 1 × 1, respectively. By setting the convolution kernel of the middle convolution network to 3 × 3, the edge feature expression capability of the feature map can be enhanced.

Illustratively, fig. 4 shows a network structure diagram of a fusion module in a decoder network provided by an exemplary embodiment of the present disclosure, and if 4 shows that the fusion module in the decoder network includes three convolutional networks connected in series, the first feature map is convolved by the three convolutional networks connected in series, so as to obtain an attention feature map. The convolution kernel of the first convolution network is 1 x 1, the step length is 1, the number of channels is 24, and the number of channels of the input feature graph is reduced; the convolution kernel of the second convolution network is 3 x 3, the step length is 1, the number of channels is 24, and the convolution kernel is used for enhancing the expression capability of the edge feature; the convolution sum of the third convolution network is 1 x 1, the step size is 1, the number of channels is 1, and an attention feature map is obtained. And then, taking the attention feature map as a weight, and performing weighted summation on the first feature map and the second feature map to obtain a fused image, wherein the calculation mode of the fused image is shown as formula (1). And finally, inputting the weighted fusion image into a convolution network with a convolution kernel of 1 × 1, a step length of 1 and a channel number of 16, and integrating the fused features to obtain a fusion feature map.

Wherein, the first and the second end of the pipe are connected with each other,Fa fused image is represented that is,αa graph of the attention characteristics is shown,E ₃ a second profile representing the output of the target lightweight network module,E ₉ the first characteristic diagram is obtained by up-sampling the coding characteristic diagram output by the first lightweight network module.

It should be noted that, in the embodiment of the present disclosure, as shown in fig. 2, after the encoding feature map output by the lightweight network module M9 is up-sampled, the encoding feature map is fused with the feature map output by the lightweight network module M3, one is to focus on semantic information corresponding to the lightweight network module M9, and the other is to focus on information contained in the feature map output by the lightweight network module M3, so as to supplement the encoding feature map, but it should be understood that fusing the encoding feature map and the feature map output by the lightweight network module M3 is only a preferred embodiment to explain the content of the present disclosure, and is not to be taken as a limitation of the present disclosure. The encoding feature map may also be fused with feature maps output by other network modules, and a scheme of fusing with feature maps output by other network modules should also belong to the disclosure.

In the embodiment of the disclosure, when feature fusion is performed, the first feature map is input into the plurality of convolution networks connected in series in the fusion module for convolution processing to obtain the attention feature map, then the fusion image is generated according to the attention feature map, the first feature map and the second feature map, and then the fusion image is input into the convolution layer for convolution processing to obtain the fusion feature map.

And 203, performing upsampling on the fusion feature map through a second upsampling module of the decoder network to obtain a binarization mask map corresponding to the sample image.

The upsampling multiple of the second upsampling module may be determined according to a network structure of the model, if the fusion feature map output by the fusion module is 1/4 times of the original input image, the upsampling multiple of the second upsampling module may be preset to be 4 times, and if the fusion feature map output by the fusion module is 1/8 times of the original input image, the upsampling multiple of the second upsampling module may be preset to be 8 times, which may be set according to actual requirements, which is not limited by the present disclosure.

In the embodiment of the disclosure, after the fusion module outputs the fusion feature map, the fusion feature map is input to a second upsampling module of the decoder network, and after upsampling is performed by the second upsampling module, a binarization mask map corresponding to the input sample image is obtained.

Exemplarily, as shown in fig. 2, the decoder network of the model to be trained includes three parts, namely, a first upsampling module (Upsample 1), a fusion module and a second upsampling module (Upsample 2), after the coding feature map output by M9 is upsampled by upsampling module 1, feature fusion is performed between the fusion module and the second feature map output by M3 to obtain a fusion feature map, and then the fusion feature map output by the fusion module is upsampled by upsampling 2 to obtain an output image with the same size as the original input image, which is the binarized mask map corresponding to the input image.

The training method of the portrait segmentation model of the embodiment of the disclosure performs upsampling on the coding feature map through a first upsampling module of a decoder network to obtain a first feature map, wherein, the first characteristic diagram has the same size with the second characteristic diagram output by the target lightweight network module, and then the first characteristic diagram and the second characteristic diagram are input into the fusion module of the decoder network for characteristic fusion to obtain a fusion characteristic diagram, and then the fusion characteristic diagram passes through the second up-sampling module of the decoder network, and the fusion feature map is subjected to upsampling to obtain a binarization mask map corresponding to the sample image, so that the encoding feature map subjected to upsampling and the feature map output by the target lightweight network module are fused, shallow detail information of the image is retained, deep semantic information of the image is concerned, feature expression of a portrait foreground region can be enhanced, background noise is inhibited, and the accuracy of portrait mask prediction is improved.

In the embodiment of the disclosure, the calculation mode of the loss function value of the model to be trained is redefined, and in the calculation, not only is the final prediction result supervised, but also the output of the target lightweight network module is subjected to auxiliary supervision, so as to ensure the accuracy of the model in predicting the whole region of the portrait. Therefore, in the embodiment of the present disclosure, the loss function value of the model to be trained is determined by two parts, which will be described in detail below with reference to fig. 5.

In a possible implementation manner of the embodiment of the present disclosure, as shown in fig. 5, on the basis of the foregoing embodiment, step 103 may include the following sub-steps:

step 301, down-sampling the sample image to obtain a sample image, wherein the size of the sample image is the same as that of the second feature map.

As mentioned above, the second feature map is a feature map output by the target lightweight network module, and in the embodiment of the present disclosure, the sampling multiple of the down-sampling of the sample image may be determined according to a size ratio of the second feature map to the sample image. For example, if the second feature map is 1/4 times the input sample image, the sample image may be downsampled by 4 times to obtain a sample image, so that the sample image and the second feature map are the same size and 1/4 times the sample image.

Step 302, based on a binary cross entropy loss function, determining a first loss function value according to label information of each pixel point in the sample image and a prediction result of a corresponding pixel point in the binary mask map.

The higher the prediction probability is, the higher the possibility that the corresponding pixel point is the portrait is.

In the embodiment of the disclosure, after the binary mask image output by the model to be trained is obtained, the first loss function value can be calculated based on the preset binary cross entropy loss function according to the label information of each pixel point in the sample image and the prediction result of the corresponding pixel point in the binary mask image.

Step 303, determining a second loss function value according to the sampling image and the second feature map based on the binary cross entropy loss function.

In the embodiment of the disclosure, in addition to calculating the first loss function value according to the label information of each pixel point in the sample image and the prediction result of the corresponding pixel point in the binary mask map based on the binary cross entropy loss function, the second loss function value is calculated according to the sample image and the second feature map based on the binary cross entropy loss function.

It can be understood that each pixel point in the sample image is labeled with corresponding label information in advance, and is used for representing whether each pixel point is a portrait or not, and each pixel point also corresponds to the label information in the sample image obtained by down-sampling the sample image. In the embodiment of the present disclosure, when the second loss function value is calculated according to the sampling image and the second feature map based on the binary cross entropy loss function, the second loss function value is calculated based on the binary cross entropy loss function according to a difference between the label information corresponding to each pixel point in the sampling image and the prediction result of the corresponding pixel point in the second feature map. The prediction result of each pixel point in the second characteristic diagram is obtained by predicting through the target lightweight network module, the prediction result represents the prediction probability that the corresponding pixel point is the portrait, and the higher the prediction probability is, the higher the possibility that the corresponding pixel point is the portrait is.

Step 304, determining a loss function value of the model to be trained according to the first loss function and the second loss function.

In the embodiment of the present disclosure, after the first loss function value and the second loss function value are obtained through calculation, the loss function value of the model to be trained may be obtained through calculation according to the calculated loss value.

For example, the first loss function value and the second loss function value may be summed or weighted to obtain the loss function value of the model to be trained.

For example, the average of the first loss function value and the second loss function value may be calculated, and the resulting average may be used as the loss function value for the model to be trained.

In an alternative embodiment of the present disclosure, the loss function value of the model to be trained can be calculated by the following formula (2).

（2）

Wherein, in the above formula (2),CElossa binary cross entropy loss function representing the mainstream commonly used at present,Pfor the set of all pixel points contained in one sample image inputted,i∈Prepresenting a sample image set of pixel pointsPAny one pixel point ini，

Representing the first in the sample imageiThe actual label information of each pixel point is obtained,y _predi representing the second in a binary mask mapiThe prediction result of each pixel point is obtained,Qrepresenting a set of pixel points contained in a sample image obtained by down-sampling the image,j∈Qrepresenting a set of sampled image pixelsQAny pixel point inj，

Representing a sampled imagey _d To middlejThe actual label information of each pixel point is obtained,

is shown in the second characteristic diagramjThe prediction result of each pixel point is obtained,λthe weight coefficient representing the auxiliary supervision branch is a preset constant coefficient, and the value range is usually 0-1 later. By way of example, in the embodiments of the present disclosure,λmay be set to 0.4.

The training method of the portrait segmentation model of the embodiment of the disclosure obtains the sampling image by down-sampling the sample image, the size of the sampling image is the same as that of the second feature map, and based on the binary cross entropy loss function, determining a first loss function value according to the label information of each pixel point in the sample image and the prediction result of the corresponding pixel point in the binary mask image, and determining a first loss function value based on the binary cross entropy loss function, determining a second loss function value according to the sample image and the second feature map, and further determining a loss function value of the model to be trained according to the first loss function and the second loss function, whereby, when the overall loss function value of the model is calculated, the final output of the model is supervised, and the output of a third lightweight network module in the encoder network is subjected to auxiliary supervision, so that the accuracy of the model for predicting the whole image region is improved.

In the embodiment of the present disclosure, the portrait segmentation model obtained by training in the foregoing embodiment may be used to segment the portrait in the image, and separate the portrait from the image. Fig. 6 is a flowchart illustrating a human image segmentation method based on a human image segmentation model according to an exemplary embodiment of the present disclosure, which may be trained by the training method of the human image segmentation model described in the foregoing embodiment, and which may be executed by a human image segmentation apparatus based on the human image segmentation model, wherein the human image segmentation apparatus may be implemented by software and/or hardware, and may be generally integrated in an electronic device. As shown in fig. 6, the human image segmentation method based on the human image segmentation model may include the following steps:

step 401, obtaining an image to be detected.

In the embodiment of the present disclosure, the image to be detected may be any image that needs to be subjected to portrait segmentation, for example, an image that needs to be subjected to background processing to protect privacy of a user or a video image in a video, and the video may be a video stream generated in real time when a live broadcast or a video conference is performed, or may be a video that has been recorded.

For example, after the teacher records the course video, the teacher wants to replace the actual background in the course video with the upload picture related to the course video, at this time, the to-be-processed video is the recorded course video, the to-be-detected image is a plurality of frames of video images included in the to-be-processed video, and the background picture is the upload picture related to the course video selected by the teacher.

Step 402, inputting an image to be detected into a portrait segmentation model to obtain a binary mask map of the image to be detected.

In the embodiment of the disclosure, for the acquired image to be detected, the image to be detected can be input into the trained portrait segmentation model, and the portrait segmentation model outputs the binary mask image of the portrait contained in the image to be detected.

And 403, segmenting a portrait image from the image to be detected according to the image to be detected and the binary mask image.

It can be understood that the binary mask image output by the human image segmentation model is a binary image, and cannot be directly applied, so that the predicted human image result needs to be extracted from the original image.

In the embodiment of the disclosure, the portrait image can be segmented from the image to be detected according to the image to be detected and the binary mask image.

Exemplarily, the binarization mask image output by the portrait segmentation model can be subjected to normalization processing, pixel points of a background region in the binarization mask image are kept to be 0, pixel points of a portrait region in the binarization mask image are normalized to be 1, and then matrix multiplication is carried out on the normalized mask image and the image to be detected, so that the portrait image of the separation image is obtained. It can be understood that the pixel value of the pixel point in the background area in the portrait image is still 0, and the pixel value of the pixel point in the portrait area is the pixel value of the corresponding pixel point in the image to be detected.

Considering that the black background is not attractive, in some application scenes, the background of the obtained portrait image can be replaced, and the purpose of protecting the privacy of the user can be achieved while the portrait is attractive. Therefore, when the portrait is separated from the image to be detected, the background can be replaced, and the background is replaced by the appointed background or the preset background, so that the individual requirements of the user are met.

Illustratively, the composite image after the background replacement can be obtained by how formula (3).

Wherein, in the formula (3),Outrepresenting the resultant composite image that is ultimately obtained,predthe image obtained by normalizing the binary mask image output by the portrait segmentation model,fgrepresenting the image to be detected,bgindicating a background image that needs to be replaced.

According to the portrait segmentation method based on the portrait segmentation model, the image to be detected is input into the portrait segmentation model by obtaining the image to be detected, so that the binary mask image of the image to be detected is obtained, and the portrait image is segmented from the image to be detected according to the binary mask image and the image to be detected.

The exemplary embodiment of the present disclosure also provides a training device of the portrait segmentation model. Fig. 7 shows a schematic block diagram of a training apparatus for a human figure segmentation model according to an exemplary embodiment of the present disclosure, and as shown in fig. 7, the training apparatus 50 for a human figure segmentation model includes: a sample set obtaining module 501, a prediction result obtaining module 502, a calculating module 503 and a parameter updating module 504.

The system comprises a sample set obtaining module 501, a sample set obtaining module, a sample image obtaining module and a training sample set obtaining module, wherein the sample set obtaining module is used for obtaining a training sample set, and the training sample set comprises a sample image and label information of each pixel point in the sample image;

a prediction result obtaining module 502, configured to input the sample image into a model to be trained, extract a coding feature map of the sample image through a coder network of the model to be trained, and decode the coding feature map through a decoder network of the model to be trained to obtain a binarization mask map corresponding to the sample image, where the coder network includes multiple lightweight network modules and a transform conversion network module;

a calculating module 503, configured to calculate a loss function value of the model to be trained according to label information of each pixel point in the sample image and a prediction result of a corresponding pixel point in the binarized mask map;

a parameter updating module 504, configured to update network parameters of the encoder network and the decoder network in the model to be trained according to the loss function value, and perform iterative training until the loss function value of the model to be trained is less than or equal to a preset value, so as to obtain the portrait segmentation model.

Optionally, the prediction result obtaining module 502 is further configured to:

and inputting the sample image into a convolutional network module, at least one group of lightweight network module groups, a Transformer conversion network module and a first lightweight network module which are connected in series to obtain the coding feature map, wherein the lightweight network module groups comprise at least one lightweight network module.

Optionally, the at least one group of lightweight network module groups includes three groups of lightweight network module groups, which are a first group of lightweight network module groups, a second group of lightweight network module groups, and a third group of lightweight network module groups, respectively.

Optionally, the first group of lightweight network module groups includes a second lightweight network module and a third lightweight network module, the second group of lightweight network module groups includes a fourth lightweight network module, a fifth lightweight network module and a sixth lightweight network module, and the third group of lightweight network module groups includes a seventh lightweight network module and an eighth lightweight network module.

Optionally, the prediction result obtaining module 502 includes:

a first up-sampling unit, configured to up-sample the encoded feature map by using a first up-sampling module of the decoder network to obtain a first feature map, where the first feature map has a size same as that of a second feature map output by a target lightweight network module, and the target lightweight network module is any lightweight network module in the at least one group of lightweight network module groups;

the feature fusion unit is used for inputting the first feature map and the second feature map into a fusion module of the decoder network for feature fusion to obtain a fusion feature map;

and the second up-sampling unit is used for up-sampling the fusion characteristic graph through a second up-sampling module of the decoder network to obtain a binarization mask graph corresponding to the sample image.

Optionally, the target lightweight network module is the third lightweight network module.

Optionally, the feature fusion unit is further configured to:

inputting the first feature map into a plurality of convolution networks connected in series in the fusion module for convolution processing to obtain an attention feature map;

generating a fused image according to the attention feature map, the first feature map and the second feature map;

and inputting the fusion image into a convolution layer for convolution processing to obtain the fusion characteristic diagram.

Optionally, the number of convolution networks is three, and the convolution kernels of the three convolution networks are 1 × 1, 3 × 3 and 1 × 1, respectively.

Optionally, the calculating module 503 is further configured to:

down-sampling the sample image to obtain a sampled image, wherein the size of the sampled image is the same as that of the second feature map;

determining a first loss function value according to the label information of each pixel point in the sample image and the prediction result of the corresponding pixel point in the binary mask map based on a binary cross entropy loss function;

determining a second loss function value according to the sampling image and the second feature map based on the binary cross entropy loss function;

and determining a loss function value of the model to be trained according to the first loss function and the second loss function.

The training device for the portrait segmentation model provided by the embodiment of the disclosure can execute any training method for the portrait segmentation model applicable to electronic equipment such as a terminal and the like provided by the embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the execution method. Reference may be made to the description of any method embodiment of the disclosure that may not be described in detail in the embodiments of the apparatus of the disclosure.

The exemplary embodiment of the present disclosure further provides a portrait segmentation apparatus based on a portrait segmentation model, where the portrait segmentation model can be obtained by training using the training method of the portrait segmentation model described in the foregoing embodiment. Fig. 8 shows a schematic block diagram of a person image segmentation apparatus based on a person image segmentation model according to an exemplary embodiment of the present disclosure, and as shown in fig. 8, the person image segmentation apparatus 60 based on a person image segmentation model may include: an image acquisition module 601, an input module 602, and a processing module 603.

The image acquisition module 601 is used for acquiring an image to be detected;

an input module 602, configured to input the image to be detected to the portrait segmentation model, so as to obtain a binary mask map of the image to be detected;

and the processing module 603 is configured to segment a portrait image from the image to be detected according to the image to be detected and the binarized mask image.

The portrait segmentation device based on the portrait segmentation model provided by the embodiment of the disclosure can execute any portrait segmentation method based on the portrait segmentation model, which can be applied to electronic equipment such as a terminal and the like, and has corresponding functional modules and beneficial effects of the execution method. Reference may be made to the description of any method embodiment of the disclosure that may not be described in detail in the embodiments of the apparatus of the disclosure.

An exemplary embodiment of the present disclosure also provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor, the computer program, when executed by the at least one processor, is for causing the electronic device to perform a training method of a portrait segmentation model or a portrait segmentation method based on a portrait segmentation model according to embodiments of the present disclosure.

The disclosed exemplary embodiments also provide a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is configured to cause the computer to perform a training method of a portrait segmentation model or a portrait segmentation method based on the portrait segmentation model according to the disclosed embodiments.

Exemplary embodiments of the present disclosure also provide a computer program product comprising a computer program, wherein the computer program, when being executed by a processor of a computer, is adapted to cause the computer to carry out a training method of a portrait segmentation model or a portrait segmentation method based on a portrait segmentation model according to embodiments of the present disclosure.

Referring to fig. 9, a block diagram of a structure of an electronic device 1100, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the electronic device 1100 includes a computing unit 1101, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1102 or a computer program loaded from a storage unit 1108 into a Random Access Memory (RAM) 1103. In the RAM1103, various programs and data necessary for the operation of the device 1100 may also be stored. The calculation unit 1101, the ROM1102, and the RAM1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.

A number of components in electronic device 1100 connect to I/O interface 1105, including: an input unit 1106, an output unit 1107, a storage unit 1108, and a communication unit 1109. The input unit 1106 may be any type of device capable of inputting information to the electronic device 1100, and the input unit 1106 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. Output unit 1107 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Storage unit 1108 may include, but is not limited to, a magnetic or optical disk. The communication unit 1109 allows the electronic device 1100 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

The computing unit 1101 can be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1101 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 1101 performs the respective methods and processes described above. For example, in some embodiments, the training method of the portrait segmentation model or the portrait segmentation method based on the portrait segmentation model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 1108. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 1100 via the ROM1102 and/or the communication unit 1109. In some embodiments, the computing unit 1101 may be configured to perform a training method of a portrait segmentation model or a portrait segmentation method based on a portrait segmentation model in any other suitable manner (e.g., by means of firmware).

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Claims

1. A method of training a portrait segmentation model, the method comprising:

inputting the sample image into a model to be trained, extracting a coding feature map of the sample image through a coder network of the model to be trained, and decoding the coding feature map through a decoder network of the model to be trained to obtain a binary mask map corresponding to the sample image, wherein the coder network comprises a plurality of lightweight network modules and a Transformer conversion network module;

updating network parameters of the encoder network and the decoder network in the model to be trained according to the loss function value to carry out iterative training until the loss function value of the model to be trained is smaller than or equal to a preset value, so as to obtain the portrait segmentation model;

decoding the coding feature map through the decoder network of the model to be trained to obtain a binarization mask map corresponding to the sample image, wherein the method comprises the following steps:

the coding feature map is up-sampled through a first up-sampling module of the decoder network to obtain a first feature map, wherein the first feature map is the same as a second feature map output by a target lightweight network module, and the target lightweight network module is any lightweight network module in at least one group of lightweight network module groups included in the encoder network;

inputting the first feature map and the second feature map into a fusion module of the decoder network for feature fusion to obtain a fusion feature map, inputting the first feature map into a plurality of convolution networks connected in series in the fusion module for convolution processing to obtain an attention feature map, taking the attention feature map as a weight, performing weighted summation on the first feature map and the second feature map to generate a fusion image, inputting the fusion image into a convolution layer for convolution processing to obtain the fusion feature map;

and upsampling the fusion characteristic graph through a second upsampling module of the decoder network to obtain a binarization mask graph corresponding to the sample image.

2. The training method of the human image segmentation model according to claim 1, wherein the extracting the coding feature map of the sample image through the encoder network of the model to be trained comprises:

3. The method for training a portrait segmentation model according to claim 2, wherein the at least one set of lightweight network module groups includes three sets of lightweight network module groups, which are a first set of lightweight network module groups, a second set of lightweight network module groups, and a third set of lightweight network module groups.

4. A method for training a portrait segmentation model according to claim 3, wherein the first set of lightweight network module groups includes a second lightweight network module and a third lightweight network module, the second set of lightweight network module groups includes a fourth lightweight network module, a fifth lightweight network module and a sixth lightweight network module, and the third set of lightweight network module groups includes a seventh lightweight network module and an eighth lightweight network module.

5. The method of training a human image segmentation model of claim 1, wherein the target lightweight network module is the third lightweight network module.

6. A method of training a human image segmentation model according to claim 1, wherein the number of convolution networks is three, and the convolution kernels of the three convolution networks are 1 x 1, 3 x 3 and 1 x 1, respectively.

7. The method for training the portrait segmentation model according to any one of claims 1 to 6, wherein the calculating the loss function value of the model to be trained according to the label information of each pixel point in the sample image and the prediction result of the corresponding pixel point in the binarized mask image includes:

8. A human image segmentation method based on a human image segmentation model, wherein the human image segmentation model is obtained by training with a human image segmentation model training method according to any one of claims 1 to 7, and the method comprises the following steps:

acquiring an image to be detected;

and segmenting a portrait image from the image to be detected according to the image to be detected and the binarization mask image.

9. A training device for a portrait segmentation model comprises:

the system comprises a sample set acquisition module, a sample set acquisition module and a training sample set acquisition module, wherein the sample set acquisition module is used for acquiring a training sample set, and the training sample set comprises a sample image and label information of each pixel point in the sample image;

a parameter updating module, configured to update network parameters of the encoder network and the decoder network in the model to be trained according to the loss function value, and perform iterative training until the loss function value of the model to be trained is less than or equal to a preset value, so as to obtain the portrait segmentation model;

the prediction result obtaining module comprises:

a first upsampling unit, configured to upsample the encoded feature map through a first upsampling module of the decoder network to obtain a first feature map, where the size of the first feature map is the same as that of a second feature map output by a target lightweight network module, and the target lightweight network module is any lightweight network module in at least one group of lightweight network module groups included in the encoder network;

a feature fusion unit, configured to input the first feature map and the second feature map into a fusion module of the decoder network for feature fusion to obtain a fusion feature map, where the first feature map is input into a plurality of convolution networks connected in series in the fusion module for convolution processing to obtain an attention feature map, the attention feature map is used as a weight, the first feature map and the second feature map are weighted and summed to generate a fusion image, and the fusion image is input into a convolution layer for convolution processing to obtain the fusion feature map;

10. A portrait segmentation apparatus based on a portrait segmentation model, wherein the portrait segmentation model is obtained by training through the training method of the portrait segmentation model according to any one of claims 1 to 7, the apparatus comprising:

the image acquisition module is used for acquiring an image to be detected;

11. An electronic device, comprising:

a processor; and

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to carry out the method of training a portrait segmentation model according to any one of claims 1-7 or to carry out the method of portrait segmentation based on a portrait segmentation model according to claim 8.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of training a portrait segmentation model according to any one of claims 1-7 or the method of portrait segmentation based on a portrait segmentation model according to claim 8.