CN112492323A

CN112492323A - Live broadcast mask generation method, readable storage medium and computer equipment

Info

Publication number: CN112492323A
Application number: CN201910862862.8A
Authority: CN
Inventors: 张抗抗; 时英选; 刘若衡; 高龙文
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2019-09-12
Filing date: 2019-09-12
Publication date: 2021-03-12
Anticipated expiration: 2039-09-12
Also published as: CN112492323B

Abstract

The invention discloses a method for generating a live broadcast mask, which belongs to the technical field of communication and comprises the following steps: establishing an image segmentation model, and training the image segmentation model to obtain a target model; the method and the device have the advantages that the video stream is obtained, and the target model is adopted to process the frame data in the video stream one by one to obtain the mask frame data.

Description

Live broadcast mask generation method, readable storage medium and computer equipment

Technical Field

The invention relates to the technical field of communication, in particular to a live broadcast mask generation method, a readable storage medium and computer equipment.

Background

Along with the development of network technology, the application of live platform of network is more and more extensive, except traditional comment function, the barrage has become a live broadcast interface's interactive mode, and the appearance of barrage technique is watched experience to improving the user, and reinforcing interactivity, interest have indispensable effect, however, along with the increase of barrage quantity, can lead to can't see the picture in the live broadcast interface clearly to influence normal video and watch.

In the prior art, in order to improve the viewing experience of a user, a mask is generally inserted into a video and used for controlling a bullet screen to be displayed in a set area, the mask is generated by adopting an image semantic segmentation technology, a convolutional neural network is adopted as a main frame in the current mainstream method for image semantic segmentation, the convolutional neural network can well realize image classification, and great progress is made in the segmentation problem.

However, in order to achieve the goal of high segmentation accuracy, the number of convolutional layers is large, and the structure of the network framework is often complex, so that a large amount of calculation time and cost are required, which all bring obstacles to the field with high real-time requirements, for example, application in a live broadcast scene, and therefore a method capable of improving the operation speed is required to meet the generation of a live broadcast mask.

Disclosure of Invention

Aiming at the problems that the semantic segmentation technology in the prior art causes large calculation amount and low running speed and cannot meet the use requirement of a real-time live broadcast scene in order to achieve a good effect, a live broadcast mask generation method, a readable storage medium and computer equipment are provided.

The invention provides a method for generating a live broadcast mask, which comprises the following steps:

establishing an image segmentation model, and training the image segmentation model to obtain a target model;

acquiring video stream, processing frame data in the video stream one by adopting the target model to acquire mask frame data

Preferably, the image segmentation model includes a first feature extraction module, a second feature extraction module, and a fusion processing module.

Preferably, the training of the image segmentation model to obtain the target model includes the following steps:

obtaining a training sample;

and training an image segmentation model according to the training sample to obtain a target model.

Preferably, before the training sample is obtained, the method comprises the following steps:

acquiring training video data, and acquiring example mask data corresponding to a time stamp of the video data by adopting an example segmentation model;

screening the example mask data, and filtering abnormal mask frame data to obtain at least one mask sample frame data set;

and matching the mask sample frame data set with sample frame data in corresponding training video data according to the time stamp to generate a training sample.

Preferably, the training an image segmentation model according to the training samples to obtain a target model includes the following steps:

acquiring at least one sample pair of data in the training sample, wherein the sample pair of data comprises mask sample frame data and sample frame data in training video data corresponding to the mask sample frame data;

inputting the sample frame data into the image segmentation model;

adopting the first feature extraction module to perform down-sampling on the sample frame data, and performing feature extraction on the down-sampled sample frame data to obtain a first sample feature map;

performing feature extraction on the sample frame data by adopting a second feature extraction module to obtain a second sample feature map;

fusing the first sample feature map and the second sample feature map, and processing the fused feature map to obtain an image segmentation result;

mapping the first sample characteristic diagram to obtain a first segmentation result, and mapping the second sample characteristic diagram to obtain a second segmentation result;

adjusting parameter values in the first feature extraction module, the second feature extraction module and the fusion processing module based on the comparison between the first segmentation result, the second segmentation result and the image segmentation result and the mask sample frame data respectively;

and obtaining the target model until the training of the image segmentation model is completed.

Preferably, the fusing the first sample feature map and the second sample feature map includes the following steps:

up-sampling the first sample characteristic diagram until the first sample characteristic diagram mapping image is consistent with the second sample characteristic diagram mapping image in size, and obtaining a processed first sample characteristic diagram;

and obtaining a fused feature map based on the processed first sample feature map and the second sample feature map.

Preferably, after the obtaining of the fused feature map, the method includes the following steps:

acquiring previous frame sample frame data corresponding to the sample frame data in a training sample;

acquiring a fused feature map corresponding to the previous frame of sample frame data;

and monitoring the fused feature map corresponding to the sample frame data based on the fused feature map corresponding to the previous frame of sample frame data.

Preferably, the processing the frame data in the video stream one by using the target model to obtain the mask frame data includes the following steps:

acquiring at least one frame data in the video stream;

after the frame data are subjected to down-sampling processing, inputting the down-sampled frame data into a feature extraction unit for feature extraction to obtain a first feature map;

performing feature extraction on the frame data to obtain a second feature map;

and fusing the first characteristic diagram and the second characteristic diagram, and acquiring mask frame data based on the fused characteristic diagram.

Preferably, the fusing the first feature map and the second feature map includes the following steps:

the first feature map is subjected to up-sampling until the first feature map mapping image is consistent with the second feature map mapping image in size, and a processed first feature map is obtained;

and obtaining a fused feature map based on the processed first feature map and the second feature map.

The invention also provides a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as described above.

The present invention also provides a computer device, comprising:

a memory for storing executable program code; and

a processor for calling said executable program code in said memory, the executing step comprising the method for generating a live mask as described above.

The beneficial effects of the above technical scheme are that:

in the technical scheme, a target model is obtained through training of an image segmentation model, mask frame data are obtained in real time through the target model, different depths of video frame data are respectively extracted, characteristics are extracted, and then the mask frame data are obtained through fusion processing.

Drawings

FIG. 1 is a block diagram of one embodiment of a system architecture diagram of the present invention;

FIG. 2 is a flowchart of a method for generating a live mask according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating training of the image segmentation model to obtain a target model in an embodiment of the live broadcast mask generation method according to the present invention;

FIG. 4 is a flowchart of a method for generating a live mask according to an embodiment of the present invention before training samples are obtained;

FIG. 5 is a flowchart illustrating training an image segmentation model according to the training sample to obtain a target model in an embodiment of the live broadcast mask generation method according to the present invention;

fig. 6 is a flowchart of fusing the first sample feature map and the second sample feature map in an embodiment of the method for generating a live broadcast mask according to the present invention;

FIG. 7 is a flowchart illustrating a process of obtaining a fused feature map according to an embodiment of the method for generating a live broadcast mask;

fig. 8 is a flowchart of processing frame data in the video stream one by using the target model to obtain mask frame data in an embodiment of the live broadcast mask generation method according to the present invention;

fig. 9 is a flowchart illustrating the fusing of the first feature map and the second feature map in an embodiment of the method for generating a live broadcast mask according to the present invention;

FIG. 10 is a block diagram of one embodiment of a system for generating a live mask of the present invention;

fig. 11 is a schematic hardware structure diagram of a computer device of a live broadcast mask generation method according to an embodiment of the present invention.

Detailed Description

The advantages of the invention are further illustrated in the following description of specific embodiments in conjunction with the accompanying drawings.

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

In the description of the present invention, it should be understood that the numerical references before the steps do not identify the order of performing the steps, but merely serve to facilitate the description of the present invention and to distinguish each step, and thus should not be construed as limiting the present invention.

The display terminal of the live-broadcast mask application of the embodiment of the application can be a large-scale video playing device, a game machine, a desktop computer, a smart phone, a tablet computer, a laptop portable computer, an e-book reader and other terminals which can be used for live-broadcast display.

The video in the embodiment of the application can be applied to not only a specific live application program, but also any application scene capable of presenting a live effect, for example, some programs and the like. The embodiment of the application takes the application of the live interface to the large-area live scene display as an example, but is not limited to this.

In the embodiment of the present application, please refer to fig. 1, and fig. 1 is a system architecture diagram for the live broadcast mask provided in the embodiment of the present application. The method comprises the steps that a main broadcasting end (a stream pushing end) sends a live video stream to a video cloud source station, the video cloud source station sends a transcoding request to a mask control end, the mask control end sends the transcoding request to a mask scheduling end, and the mask scheduling end sends a task allocation request to a mask cluster after receiving the transcoding request and inquires whether idle AI machines exist in the mask cluster or not, wherein the AI machines are mask identification examples, and each AI machine serves a live broadcast room; if no idle AI machine exists, an abnormal state callback is fed back to a mask control end, if an idle AI machine exists, an RTMP (Real Time Messaging Protocol) video stream is pulled to a video cloud source station, each frame image in the video stream is identified by the AI machine to generate mask frame data, the mask frame data is pushed to the video cloud source station, the video cloud source station synthesizes the mask frame data and the frame image in the source video stream to generate a video stream carrying the mask frame data and pushes the video stream to a CDN (Content Delivery Network) node, when a user watches a live video, a play link is requested to a configuration background through a play end (pull end), after the configuration background receives the play link request, an opening state request is inquired to the mask control end, the mask control end inquires a Database (DB) to obtain whether the live room allows opening the mask service, and notifying the obtained feedback, if a live broadcast room accessed by the user allows to start the mask service, a playing end of the user can pull a video stream carrying mask frame data through the CDN, the playing end analyzes the video stream, the player plays video information and renders a mask barrage, so that a video image, a mask frame and barrage information are displayed on a display screen of the playing end, the barrage is displayed in an area except the mask frame, and the watching effect of the user is improved.

Only two configuration backgrounds, one play end and one anchor end are given here, and the application scenario here may further include a plurality of configuration backgrounds, a plurality of play ends, and a plurality of anchor ends. The video cloud source station can be a cloud server or a local server. The devices of the playing end and the main playing end can be mobile devices or other intelligent terminals capable of uploading videos.

In order to solve the problems that the semantic segmentation technology in the prior art causes a large amount of calculation and a slow operation speed and cannot meet the use requirement of a real-time live broadcast scene, an embodiment of the present invention provides a live broadcast mask generation method, which is shown in fig. 2, which is a schematic flow diagram of a live broadcast mask generation method according to a preferred embodiment of the present invention, and the live broadcast mask generation method is applied to a server side, and as can be seen from fig. 2, the method includes the following steps:

s1: establishing an image segmentation model, and training the image segmentation model to obtain a target model;

specifically, the image segmentation model includes a first feature extraction module, a second feature extraction module, and a fusion processing module, in this embodiment, the first feature extraction module processes frame data of the input image segmentation model, and performs feature extraction using a lightweight convolutional neural network after processing to obtain a feature map, in the above process, the processing is to down-sample the frame data by 16 times, the down-sampling by 16 times is mainly set to keep balance in accuracy and speed, the down-sampled frame data passes through a more complex convolutional neural network, which can reduce the amount of computation and improve the computation speed, the feature map output by the first feature extraction module is the down-sampling by 16 times of the input image, which can keep the processing speed of the first feature extraction module fast and obtain a feature map with higher accuracy, the complex convolutional neural network uses mobilene v2, the mobilenetet v2 is an existing lightweight convolutional neural network, and in the application scenario, in order to meet the 16-time down-sampling requirement, a standard mobilenetv2 is processed, and the specific steps are as follows: the method comprises the steps of adopting a deeplab + mobilenet structure, setting stride of the last block of a standard mobilenetv2 as 1, adding expansion convolution, and carrying out down-sampling processing on an input image, wherein in the application scenario, an ASPP layer in deeplabv3 is not used in consideration of the fact that the range change of a first feature extraction module main body is not so severe and the calculation amount in feature extraction in the feature extraction process.

The second feature extraction module is used for extracting features of frame data of the input image segmentation model to obtain a feature map, a simpler convolutional neural network, specifically a five-layer convolutional network, is adopted in the second feature extraction module, the calculated amount is reduced by reducing the number of layers of convolution, in order to achieve better segmentation accuracy, the feature maps obtained by the first feature extraction module and the second feature extraction module are fused through a fusion processing module to obtain a combined feature map, and feature loss caused in the process that the frame data passes through the first feature extraction module or the second feature extraction module is reduced.

Except the mobilenetv2 and the five-layer convolutional neural network, other existing complex or simple neural networks can be respectively replaced, the requirement that the down-sampled frame data passes through the complex neural network and the original frame data passes through the simple neural network is met, and the reduction of the calculated amount and the improvement of the operation speed are achieved.

Specifically, the training of the image segmentation model to obtain the target model, referring to fig. 3, includes the following steps:

s11: obtaining a training sample;

before the training samples are obtained, referring to fig. 4, the following steps are included:

s10-1: acquiring training video data, and acquiring example mask data corresponding to a time stamp of the video data by adopting an example segmentation model;

specifically, each frame image in the video data is identified by adopting an example segmentation model, a main body area in the frame image is obtained, and example mask frame data is generated according to the main body area; the timestamp of each instance mask frame data corresponds to the timestamp of its corresponding frame image.

Wherein the main body region may be selected from at least one of:

a person area range, an animal area range, a landscape area range, a building area range, an artwork area range, a text area range, and a background area range distinguished from a person, an animal, a building, and an art.

It should be noted that, in this embodiment, the training video data is offline meeting place live video, and the main area corresponding to the mask frame data is a character area range.

S10-2: screening the example mask data, and filtering abnormal mask frame data to obtain at least one mask sample frame data set;

in the implementation steps, example mask data are screened and filtered according to a preset rule, for example, mask data sets with too small area, serious jitter, too short data set and the like are filtered, the preset rule can be adjusted according to a use scene, the effect of the training sample is enhanced through screening and filtering the example mask data, a stable training sample is obtained, and the effect of a target model obtained after the model is segmented through the training sample training image is improved.

S10-3: and matching the mask sample frame data set with sample frame data in corresponding training video data according to the time stamp to generate a training sample.

Specifically, in the implementation step, each mask sample frame data is paired with sample frame data in corresponding training video data according to the timestamp, and the specific implementation manner is to calibrate the time axis of the example mask frame data set and the training video data, cut out video segment data containing the corresponding frame data, and combine the mask frame data and the corresponding video frame data into a sample pair data.

By the embodiment, the effect of the example segmentation model can be transferred to the image segmentation model through the training samples, the example segmentation is not only classified in a pixel level, but also different examples are distinguished on the basis of specific categories, the training sample training image segmentation model is obtained through the example segmentation model, the target model is obtained, the image segmentation effect of the target model can be greatly improved, the application in a real-time scene is met, and the accuracy of obtaining the mask frame data through the target model is improved.

S12: and training an image segmentation model according to the training sample to obtain a target model.

Specifically, the training of the image segmentation model according to the training sample to obtain the target model as described above with reference to fig. 5 includes the following steps:

s121: acquiring at least one sample pair of data in the training sample;

the sample pair data comprises mask sample frame data and sample frame data in training video data corresponding to the mask sample frame data;

and inputting the sample frame data into an image segmentation model to obtain an image segmentation result, comparing the image segmentation result with the mask sample frame data to learn, and training the image segmentation model until a target model is obtained.

S122: inputting the sample frame data into the image segmentation model;

s123: adopting the first feature extraction module to perform down-sampling on the sample frame data, and performing feature extraction on the down-sampled sample frame data to obtain a first sample feature map;

in this embodiment, the down-sampling is specifically to perform 1/2 down-sampling on the sample frame data, so that on one hand, characteristic information with identification can be retained, on the other hand, the calculation amount of the input convolutional neural network is reduced, the operation speed is increased, and the specific down-sampling multiple can be matched with the used convolutional neural network.

S124: mapping the first sample feature map to obtain a first segmentation result;

s125: performing feature extraction on the sample frame data by adopting a second feature extraction module to obtain a second sample feature map;

s126: mapping the second sample feature map to obtain a second segmentation result;

s127: fusing the first sample characteristic diagram and the second sample characteristic diagram, and processing the fused characteristic diagram to obtain an image segmentation result;

specifically, after the fused feature map is obtained, single-layer convolution processing is performed on the fused feature map, that is, the fused feature map is processed by using a single-layer convolution neural network, so as to obtain an image segmentation result.

S128: adjusting parameter values in the first feature extraction module, the second feature extraction module and the fusion processing module based on the comparison between the first segmentation result, the second segmentation result and the image segmentation result and the mask sample frame data respectively;

s129: and obtaining the target model until the training of the image segmentation model is completed.

In the training process, a first loss function is calculated according to the comparison of a first segmentation result and the mask sample frame data, a second loss function is calculated according to the comparison of a second segmentation result and the mask sample frame data, a third loss function is calculated according to the comparison of an image segmentation result and the mask sample frame data, and the three loss functions are continuously adjusted in the training process until the training is completed.

In the above embodiment, the first and second segmentation results are obtained from the first and second sample feature maps, respectively, and are implemented by a softmax function in an actual operation process, and after the first and second segmentation results are obtained, the first and second segmentation results are compared with mask sample frame data, respectively, the first and second feature extraction modules may be supervised, when an image segmentation model is trained, the segmentation results are generated by using features output by the first and second feature extraction modules, and respective loss functions are calculated, so that it can be supervised that the first and second feature extraction modules respectively learn information, and further, an object model obtained by training is more stable.

The first segmentation result and the second segmentation result are mapped to the image segmentation result, and the image segmentation result is supplemented and adjusted based on the first segmentation result and the second segmentation result, so that the image segmentation result is closer to the mask sample frame data, and a better training effect is obtained.

Specifically, the fusing the first sample feature map and the second sample feature map, referring to fig. 6, includes the following steps:

s127-1: up-sampling the first sample characteristic diagram until the first sample characteristic diagram mapping image is consistent with the second sample characteristic diagram mapping image in size, and obtaining a processed first sample characteristic diagram;

taking the image size in the sample frame data as 320px × 320px as an example, the down-sampled image size is 160px × 160px, the feature map output by mobilenetv2 is 10px × 10px, and the feature map output by the five-layer convolution network is 20px × 20px, at this time, the output feature map of 10px × 10px needs to be up-sampled to 20px × 20px, and up-sampling and down-sampling are common technical means, so that the use effect is stable and the operation is convenient.

S127-2: and obtaining a fused feature map based on the processed first sample feature map and the second sample feature map.

Specifically, after the processed first sample feature map and the second sample feature map are obtained, features are fused by using concat, which is an operation often used for feature union, and is used for feature fusion of multiple convolutional neural network extractions or for fusing information of convolutional neural network output layers.

In the above embodiment, by fusing the first feature map and the second feature map, the missing features in the first feature map or the second feature map may be supplemented, specifically, the second sample feature map may be adjusted or supplemented based on the processed first sample feature map, or the processed first sample feature map may be adjusted and supplemented based on the second sample feature map.

In the above embodiment, after the features are fused by using the concat, an adaptive attention mechanism is used for adjustment, that is, channel attribute is added after the concat, specifically, two layers of Convolutional Neural Networks (CNNs) are used for learning channel weights, each layer of CNN outputs a feature map of C (channel) x H (height) x W (width), the concat obtains a feature map of C x H x W, for each channel, a corresponding spatial feature of H x W is subjected to global pooling to obtain context information of each channel, the obtained 1x C dimensional features are input to two consecutive fully connected layers to obtain a weight of each channel, and finally the channel weights are multiplied back to the original channel, the feature map is adjusted by using the adaptive attention mechanism, and the adaptive attention mechanism can adjust the feature map fusion based on a hidden feature map of a second hidden feature map (hidden feature map) which is the same as the hidden feature map of the first layer, so as to obtain a more accurate fused feature map.

Specifically, after the fused feature map is obtained, referring to fig. 7, the method includes the following steps:

s127-31: acquiring previous frame sample frame data corresponding to the sample frame data in a training sample;

specifically, the training sample comprises a plurality of mask sample frame data sets and corresponding video segments, the training sample input image segmentation model is divided into a plurality of sections of videos, and fused feature maps corresponding to adjacent frame data in any section of videos can be obtained based on the training sample input image segmentation model.

S127-32: acquiring a fused feature map corresponding to the previous frame of sample frame data;

s127-33: and monitoring the fused feature map corresponding to the sample frame data based on the fused feature map corresponding to the previous frame of sample frame data.

In the above embodiment, the fused feature maps corresponding to the current sample frame data and the previous frame sample frame data are compared, a loss function is calculated, the fused feature map corresponding to the current sample frame data is adjusted according to the fused feature map corresponding to the previous frame sample frame data, and the deviation between the fused feature map corresponding to the current sample frame data and the fused feature map corresponding to the previous frame sample frame data is controlled not to be too large, so that the stability of the target model obtained by training is further improved, and the influence on the accuracy of the mask frame data caused by factors such as the light rays and angles of the image corresponding to the input frame data in the use process of the target model is reduced.

It should be noted that, in this embodiment, the supervision between the previous frame sample frame data and the current frame sample data is implemented by setting a penalty term l2 after the loss function, that is, limiting some parameters in the loss function to obtain the required effect.

S2: and acquiring a video stream, and processing frame data in the video stream one by adopting the target model to acquire mask frame data.

Specifically, the method for acquiring mask frame data by processing frame data in the video stream one by using the target model, referring to fig. 8, includes the following steps:

s21: acquiring at least one frame data in the video stream;

s22: after the frame data are subjected to down-sampling processing, inputting the down-sampled frame data into a feature extraction unit for feature extraction to obtain a first feature map;

s23: performing feature extraction on the frame data to obtain a second feature map;

s24: and fusing the first characteristic diagram and the second characteristic diagram, and acquiring mask frame data based on the fused characteristic diagram.

In the above embodiment, the server side obtains the video stream, inputs the frame data in the video stream into the target model frame by frame, taking a video frame data A as an example, identifying a portrait area in an image of the video frame data A, performing 1/2 downsampling, extracting features to obtain a first feature map, directly extracting the features without downsampling to obtain a second feature map, performing concat and channel attribute processing on the first feature map and the second feature map to obtain a fused feature map, performing single-layer convolution processing to identify the features to obtain portrait area features, and further obtaining mask frame data, in the process, the down-sampled frame data passes through a lightweight convolutional neural network, the non-down-sampled frame data passes through a five-layer convolutional network, the small-size frame data is subjected to deep convolution, and the large-size frame data is subjected to shallow convolution, so that the running speed is improved while the characteristics are obtained, and the application in a real-time scene is met.

The merging of the first feature map and the second feature map, referring to fig. 9, includes the following steps:

s231: the first feature map is subjected to up-sampling until the first feature map mapping image is consistent with the second feature map mapping image in size, and a processed first feature map is obtained;

s232: and obtaining a fused feature map based on the processed first feature map and the second feature map.

In the above embodiment, the first feature map and the second feature map need to be identical in size to enable the fusion processing, and therefore the first feature map obtained from the down-sampled frame data needs to be processed, and after the processing of the first feature map is completed, the second feature map may be adjusted and supplemented based on the processed first feature map to obtain the fused feature map, or the processed first feature map may be reprocessed based on the second feature map to obtain the fused feature map.

After the mask frame data are generated, the mask frame data and frame data in the video stream can be matched and synthesized into a video stream with the mask frame data according to the timestamp, the client side obtains the video stream with the mask frame data and the bullet screen stream and displays the bullet screen data on the display interface, the bullet screen information displays the mask frame data when passing through the region where the mask frame data coincide, namely the bullet screen is hidden when passing through the portrait region, and therefore the phenomenon that the bullet screen shields the live broadcast interface in the live broadcast process to influence the user viewing experience is reduced.

A system for generating a live mask, as shown in fig. 9, includes:

the training unit 31 is configured to establish an image segmentation model, train the image segmentation model, and obtain a target model;

the training unit 31 further includes an obtaining module, configured to obtain a training sample;

and (5) training.

The generating unit 32 is configured to acquire a video stream, process frame data in the video stream one by using the target model, and acquire mask frame data.

The training unit 31 obtains a training sample and trains the image segmentation model;

the image segmentation model comprises a first feature extraction module, a second feature extraction module and a fusion processing module.

In the generation system of the live-broadcast mask, the training unit 31 trains the image segmentation model, and includes the following steps:

inputting the sample frame data into the image segmentation model;

fusing the first sample characteristic diagram and the second sample characteristic diagram, and performing convolution processing on the fused characteristic diagram to obtain an image segmentation result;

The generating unit 32 generates mask frame data by the target model;

adopting the target model to process the frame data in the video stream one by one to obtain mask frame data, and comprising the following steps:

acquiring at least one frame data in the video stream;

performing feature extraction on the frame data to obtain a second feature map;

and fusing the first characteristic diagram and the second characteristic diagram, and performing convolution processing on the fused characteristic diagram to obtain mask frame data.

As shown in fig. 10, a computer device 4, the computer device 4 comprising:

a memory 41 for storing executable program code; and

and a processor 42 for calling the executable program code in the memory 41, wherein the execution steps include the above-mentioned generation method of the live mask.

One processor 42 is illustrated in fig. 11.

The memory 41, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules (e.g., the training unit 31 and the generating unit 32 shown in fig. 10) corresponding to the live mask generation method in the embodiment of the present application. The processor 42 executes various functional applications and data processing of the computer device 4, namely, implements the method for generating the live mask of the above-described method embodiment, by running the nonvolatile software program, instructions and modules stored in the memory 41.

The memory 41 may include a program storage area and a data storage area, wherein the program storage area may store an application program required for at least one function of the operating system; the storage data area may store skin data information of the user at the computer device 4. Further, the memory 41 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 41 optionally includes memory 41 remotely located from the processor 42, and these remote memories 41 may be connected to the generation system of the live mask over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more modules are stored in the memory 41, and when executed by the one or more processors 42, perform the method for generating a live mask in any of the above-described method embodiments, for example, perform the above-described method steps S1 to S2 in fig. 2, S11 to S12 in fig. 3, S10-1 to S10-3 in fig. 4, S121 to S129 in fig. 5, S127-1 to S127-2 in fig. 6, S127-31 to S127-33 in fig. 7, S21 to S24 in fig. 8, and S231 to S232 in fig. 9, to implement the functions of the training unit 31 and the generating unit 32 shown in fig. 10.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The computer device 4 of the embodiments of the present application exists in various forms, including but not limited to:

(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include: smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.

(3) A portable entertainment device: such devices can display and play multimedia content. This type of device comprises: audio, video players (e.g., ipods), handheld game consoles, electronic books, and smart toys and portable car navigation devices.

(4) A server: the device for providing the computing service comprises a processor, a hard disk, a memory, a system bus and the like, and the server is similar to a general computer architecture, but has higher requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like because of the need of providing high-reliability service.

(5) And other electronic devices with data interaction functions.

Embodiments of the present application provide a non-transitory computer-readable storage medium storing computer-executable instructions, which are executed by one or more processors, such as one processor 42 in fig. 11, and enable the one or more processors 42 to perform the method for generating a live mask in any of the method embodiments, such as performing method steps S1 to S2 in fig. 2, method steps S11 to S12 in fig. 3, method steps S10-1 to S10-3 in fig. 4, method steps S121 to S129 in fig. 5, method steps S127-1 to S127-2 in fig. 6, method steps S127-31 to S127-33 in fig. 7, method steps S21 to S24 in fig. 8, method steps S231 to S232 in fig. 9, the functions of the training unit 31 and the generating unit 32 shown in fig. 10 are realized.

Based on the above embodiment, the first practical application process includes:

the anchor broadcasts live through a live broadcast end, the live broadcast end sends a video stream to a server, the server carries out frame-by-frame processing on frame data in the received video stream, the frame data passes through a target model, feature extraction is carried out after down sampling to obtain a first feature map, an original frame data feature extraction is carried out to obtain a second feature map, the first feature map and the second feature map are fused to generate mask frame data of a portrait area, the mask frame data and the video frame data are matched to generate a video stream with a mask according to a time stamp, a user end sends a bullet screen stream to the server, the server pushes the video stream with the mask and the bullet screen stream to a CDN node, the user end obtains the video stream with the mask and the bullet screen stream from the CDN node to enable the video frame data, the corresponding mask frame data and the corresponding bullet screen information to be displayed on a display screen at the moment corresponding to the time stamp simultaneously, when the bullet screen information passes through the mask frame data, the mask frame data are displayed, and the phenomenon that the bullet screen information shields the portrait area and influences user experience is avoided.

Based on the above embodiment, the second practical application process includes:

the platform under a certain line moves, records a real-time video through field shooting and transmits the real-time video to a field display screen, simultaneously, live broadcasting on line, the server end obtains real-time video stream and inputs frame data in the video stream into a target model to obtain mask frame data, matching the mask frame data with the video frame data according to the timestamp to generate a video stream carrying the mask, obtaining barrage information of the watching user terminal by the server terminal, pushing the video stream carrying the mask and the barrage stream to the CDN node by the server terminal, obtaining the video stream carrying the mask and the barrage stream from the CDN node by the user terminal, so that video frame data, corresponding mask frame data and corresponding bullet screen information show simultaneously in the display screen at the moment that the timestamp corresponds and play, when bullet screen information passes through mask frame data, show mask frame data, avoid bullet screen information to shelter from the portrait region, influence user experience.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for generating a live broadcast mask is characterized by comprising the following steps:

and acquiring a video stream, and processing frame data in the video stream one by adopting the target model to acquire mask frame data.

2. The method for generating a live mask according to claim 1, wherein:

3. The method of generating a live mask as claimed in claim 2, wherein:

the training of the image segmentation model to obtain the target model comprises the following steps:

obtaining a training sample;

4. The method of generating a live mask as claimed in claim 3, wherein:

before the training sample is obtained, the method comprises the following steps:

5. The method of generating a live mask as claimed in claim 3, wherein:

the method for training the image segmentation model according to the training sample to obtain the target model comprises the following steps:

inputting the sample frame data into the image segmentation model;

fusing the first sample characteristic diagram and the second sample characteristic diagram, and processing the fused characteristic diagram to obtain an image segmentation result;

6. The method of generating a live mask as claimed in claim 5, wherein:

the fusing the first sample feature map and the second sample feature map comprises the following steps:

7. The method of generating a live mask of claim 6,

after the fused feature map is obtained, the method comprises the following steps:

8. The method for generating a live mask according to claim 1, wherein:

acquiring at least one frame data in the video stream;

performing feature extraction on the frame data to obtain a second feature map;

9. The method of generating a live mask as claimed in claim 8, wherein:

the fusing the first feature map and the second feature map comprises the following steps:

10. A computer-readable storage medium having stored thereon a computer program, characterized in that:

the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 9.

11. A computer device, characterized by: the computer device includes:

a memory for storing executable program code; and

a processor for invoking said executable program code in said memory, the executing step comprising a method of generating a live mask as claimed in any one of claims 1 to 9.