CN113313730B

CN113313730B - Method and device for acquiring image foreground area in live scene

Info

Publication number: CN113313730B
Application number: CN202110853914.2A
Authority: CN
Inventors: 夏洋; 黎雄兵; 郝付壮; 刘声华; 宋道明
Original assignee: Beijing Vhall Time Technology Co ltd
Current assignee: Beijing Vhall Time Technology Co ltd
Priority date: 2021-07-28
Filing date: 2021-07-28
Publication date: 2021-10-08
Anticipated expiration: 2041-07-28
Also published as: CN113313730A

Abstract

The application relates to a method and a device for acquiring an image foreground area in a live broadcast scene, wherein the method comprises the following steps: carrying out downsampling processing on an original image frame, and zooming the original image frame to a thumbnail; performing foreground segmentation on the thumbnail according to preset foreground segmentation region parameters to form a first foreground segmentation mask; amplifying the first foreground segmentation mask to form a mask of an original image frame; and performing dot multiplication operation on the original image frame and the mask of the original image frame to obtain an image foreground area. According to the method and the device, the video image frame can be rapidly segmented in a mode of down-sampling and presetting the segmentation area, and the real-time requirement of a live scene is met.

Description

Method and device for acquiring image foreground area in live scene

Technical Field

The present application relates to the field of online live broadcasting, and in particular, to a method and an apparatus for acquiring a foreground region of an image in a live broadcast scene.

Background

In an online classroom or live video scene, a user needs to shield the background of a live broadcast room based on personal privacy consideration, or needs to change the background of the live broadcast room due to the fact that no green screen condition exists. The foreground region is required to be extracted and scratched in the technical process, and the mainstream technical means with better effect is based on the deep learning algorithm.

However, training of the deep learning model requires a large amount of high-quality labeled data, the amount of computation is huge, and extremely high requirements are placed on the hardware computing performance, especially on the GPU (Graphics Processing Unit) configuration, so model deployment is mostly performed on the server side, and the server side model deployment is not beneficial to protecting data privacy. Taking the deployment requirement of an ari portrait matting AI (Artificial Intelligence) model product as an example, the minimum GPU configuration requirement required for deployment is above NVIDIA GTX 1060, and a common Personal Computer (Personal Computer) in a live broadcast scene cannot meet the high hardware requirement.

Disclosure of Invention

For a personal user applied to a live broadcast scene, the matting effect does not need to be completely perfect and accurate, but the calculated amount needs to be reduced to be acceptable by a common Personal Computer (PC), and in consideration of the application requirement, the application provides an image foreground region acquisition technology applied to a real-time live broadcast scene, wherein the calculated amount is as low as that of the image foreground region acquisition technology capable of being deployed at the personal PC end.

The application provides a technical scheme for acquiring an image foreground region in a live scene, and under the premise of local deployment in a personal PC (personal computer) -level user CPU (central processing unit) environment, the problems that a single-frame video processing time is too long under high resolution and a real-time requirement of the live scene cannot be met by a common image segmentation algorithm can be solved.

According to a first aspect of the present invention, there is provided a method for acquiring a foreground region of an image in a live scene, including:

carrying out downsampling processing on an original image frame, and zooming the original image frame to a thumbnail;

performing foreground segmentation on the thumbnail according to preset foreground segmentation region parameters to form a first foreground segmentation mask;

amplifying the first foreground segmentation mask to form a mask of an original image frame; and

and performing dot multiplication operation on the original image frame and the mask of the original image frame to obtain an image foreground area.

According to a second aspect of the present invention, there is provided an apparatus for acquiring a foreground region of an image in a live scene, including:

the zooming unit is used for carrying out downsampling processing on the original image frame and zooming the original image frame to the thumbnail;

the foreground segmentation unit is used for performing foreground segmentation on the thumbnail according to preset foreground segmentation region parameters to form a first foreground segmentation mask;

the amplifying unit is used for amplifying the first foreground segmentation mask to form a mask of an original image frame; and

and the image foreground region acquisition unit is used for performing dot multiplication operation on the original image frame and the mask of the original image frame to acquire an image foreground region.

According to a third aspect of the present invention, there is provided an electronic apparatus comprising:

a processor; and

a memory storing computer instructions which, when executed by the processor, cause the processor to perform the method of the first aspect.

According to a fourth aspect of the present invention, there is provided a non-transitory computer storage medium storing a computer program which, when executed by a plurality of processors, causes the processors to perform the method of the first aspect.

According to the method, the device, the electronic equipment and the non-transient computer storage medium for acquiring the image foreground area in the live broadcast scene, firstly, the video image frame can be rapidly segmented in a down-sampling and preset segmentation area mode, and the real-time requirement of the live broadcast scene is met; secondly, skin color detection, motion estimation and the like are added, and feedback adjustment is carried out on the segmentation result so as to obtain higher stability and inter-frame consistency of the segmentation result; moreover, the scheme of the invention has lower requirement on computing power, can meet the requirement of local deployment in the environment of personal PC level user CPU, and can meet the requirement of real-time performance; finally, compared with the traditional ML model and the traditional image segmentation algorithm, the method has higher segmentation accuracy, and solves the problem that the traditional segmentation algorithm has difference in the consistency and stability between frames.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without exceeding the protection scope of the present application.

Fig. 1 is a flowchart of a method for acquiring a foreground region of an image in a live scene according to an embodiment of the present invention.

FIG. 2 shows a graph of the segmentation effect of WaterShed + Canny scheme.

FIG. 3 shows a graph of the segmentation effect of the K-Means scheme.

Fig. 4 shows a segmentation effect diagram of the MOG2 scheme.

Fig. 5 shows a graph of the segmentation effect of the scheme of the present application.

FIG. 6 is a graph showing the effect of Mask-RCNN on segmentation.

Fig. 7 is a schematic diagram of an apparatus for acquiring a foreground region of an image in a live scene according to an embodiment of the present invention.

Fig. 8 is a block diagram of an electronic device provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

According to the scheme of the application, the processing of the high-resolution original image is converted into the processing of the low-resolution thumbnail through down-sampling of the original image, and the segmentation area is preset according to the characteristics of live broadcasting, so that the rapid processing of the image can be realized under the environment of a personal PC (personal computer) user CPU (central processing unit); in addition, skin color detection, motion estimation and the like are added, and feedback adjustment is carried out on the segmentation result so as to obtain higher stability and inter-frame consistency of the segmentation result; and, image morphology operation is performed on the divided image mask to enhance the visual effect after image processing.

It should be noted that the live scenes described in the present application include real-time scenes such as live video and online classroom.

According to one aspect of the invention, a method for acquiring an image foreground region in a live scene is provided. Fig. 1 is a flowchart of a method for acquiring a foreground region of an image in a live scene according to an embodiment of the present invention. As shown in fig. 1, the method includes the following steps.

In step S101, downsampling processing is performed on an original image frame to scale the original image frame to a thumbnail.

The original image frame is scaled to the thumbnail by applying a down-sampling process to the original image frame, and the subsequent process is performed on the basis of the thumbnail. The downsampling processing comprises a cubic interpolation algorithm and the like.

Since the original image frame with high resolution is converted into the thumbnail with low resolution, the complexity and the time length of image processing are reduced.

And S102, performing foreground segmentation on the thumbnail according to preset foreground segmentation area parameters to form a first foreground segmentation mask.

At present, a plurality of image segmentation algorithms exist, and a better image foreground segmentation algorithm comprises a GrabCut algorithm.

1. GrabCont Algorithm overview

GrabCut is an image segmentation method based on graph cut, has proven to have good effect in practice, and is imported into the image processing base library OpenCV. However, due to the characteristics of the algorithm, the calculation on the large-pixel-size image is time-consuming, and therefore the algorithm cannot be directly used for live scenes with high real-time requirements.

2. Grabcut algorithm principle

Starting from a user-specified bounding box around the object to be segmented, the algorithm estimates the color distribution of the target object and the background using a gaussian mixture model. This is used to build markov random fields on pixel labels with energy functions of connected regions preferably having the same label, and run graph cut-based optimization to infer their values. Since this estimate may be more accurate than the original estimate in the bounding box, this two-step procedure is repeated until convergence. Since the segmentation process requires solving a gaussian mixture model, the larger the pixel size, the higher the time consumption, and this time-consuming growth is exponential increase.

3. GrabCont algorithm flow

(1) The user defines a foreground rectangular area in the current image, and the interior of the rectangular area needs to contain a foreground object to be segmented. The outside of the rectangle is defaulted to be a background area, and the inside of the rectangle is used for distinguishing a foreground and a background by a default background;

(2) and performing foreground and background modeling by using a Gaussian Mixture Model (GMM), wherein an absolute probability pixel area which can be calculated by the GMM is marked as a determined foreground and a determined background, and an undefined pixel area is marked as a possible foreground and a possible background.

(3) Each pixel is regarded as being connected with surrounding pixels through a virtual edge, and the probability that each edge belongs to the foreground and the background is adjusted based on the color similarity with the surrounding pixels. During the calculation, each pixel is connected to a foreground or background node.

In the two-step flow (2) and (3), the GMM solves for each pixel, and the result of the solution is a decimal on a [0,1] closed interval, that is, the probability that the pixel is a foreground. If the probability is larger than a set threshold, the foreground is judged to be possible, and if the probability is smaller than the set threshold, the background is judged to be possible. Determining that the solution result of the foreground GMM is 1, determining that the solution result of the background is 0, and automatically initializing an external default background area which is set manually during initialization to be 0, namely determining the background.

(4) After the nodes are connected, if the edges between the nodes belong to the foreground and the background, the edges are cut off, and the segmentation of the positions is completed.

(5) And completing the calculation of all the pixels, and completing the segmentation of the whole image.

4. Grabcut algorithm deficiency

(1) The algorithm needs to interact and define a rectangular area, manual frame-by-frame interaction is impossible in real-time live video, and the problem of selecting an initialized foreground area needs to be solved;

(2) the algorithm needs to perform Gaussian modeling on each pixel point, the calculated amount of a large-size image is exponentially increased, and if the algorithm is applied to real-time live video, the problem that the calculation time consumption of a high-resolution image is increased needs to be solved;

(3) the cutting effect of different frame images can be different, the interframe difference is too large when the method is applied to a live video scene, flicker can be generated in visual instability, and the problem of interframe flicker needs to be solved in practical application.

Therefore, the GrabCut algorithm has the characteristics of time consumption for processing large-size images and the need of interactively defining rectangular areas, so that the GrabCut algorithm is not suitable for a PC-level user CPU environment and a live scene.

In the application, the original image frame is zoomed to the thumbnail, so that the complexity of GrabCT algorithm processing is reduced, the processing time is reduced, and the scheme can be applied to a PC-level user CPU environment; in addition, in the live broadcasting process, a live broadcasting host can be located in the central range of a picture to perform activities under most conditions, namely the probability that a foreground segmentation target reaches an edge position is extremely low, which area or areas in the GrabCut algorithm operation process are informed to be the initialization foreground area through preset foreground segmentation area parameters, for example, the area of the central range of each original image frame in the GrabCut algorithm operation process is informed to be the initialization foreground area, so that the edge part area can be selected as a default background area by default for parameter initialization of a gaussian model, and the GrabCut algorithm does not need to interact when each image frame is segmented due to the preset foreground segmentation area parameters, so that the image foreground area segmentation executed by the GrabCut algorithm can be applied to a live broadcasting environment.

Step S103, carrying out interpolation processing on the first foreground segmentation mask to form a mask of an original image frame.

After the thumbnail is subjected to segmentation processing, interpolation processing is carried out on the first foreground segmentation mask subjected to segmentation processing, the first foreground segmentation mask is enlarged to the size of an original image, and a mask of the original image frame is formed. The interpolation processing includes cubic interpolation algorithm and the like.

The real-time requirement is higher in a live broadcast scene, and the faster segmentation speed is more important than the pixel-level segmentation precision, so that before segmentation, the image is zoomed and subjected to floating point processing, then segmentation is performed on the zoomed image, and a probability distribution map of a foreground and a background is obtained. The probability distribution map is reduced at this time, and interpolation and enlargement are required to the original image size.

And step S104, performing dot product operation on the original image frame and the mask of the original image frame to obtain an image foreground area.

And carrying out dot product operation on the original image frame and the mask of the original image frame, and matting the foreground image area to obtain the image foreground area.

After the image foreground region is obtained, various operations may be performed on the image foreground region, for example, the image foreground region may be superimposed on the background image, and an output image with which the background replacement is completed may be obtained.

The embodiment illustrated in fig. 1 shows how to rapidly segment an image foreground region in a live scene in a personal PC level user CPU environment, meeting the real-time requirements of the live scene. However, in the process of image foreground region segmentation, there may be some problems, such as inter-frame flicker, aliasing effect, and abrupt background foreground transition. Therefore, the method for acquiring the image foreground area in the live broadcast scene further comprises the following steps.

A motion state detection step: and executing motion state detection on the thumbnail to obtain the ratio of the motion pixels in the thumbnail.

After obtaining the thumbnail through step S101, the thumbnail may be converted to a single channel, converted from RGB to a gray-scale image, a frame difference method is applied to the gray-scale image, and gaussian filtering with a specific threshold parameter is performed on the frame difference image to eliminate the noise effect of the small motion trace, so that only the motion in a large range is monitored.

In the process of detecting the motion state, each pixel can be assigned, and then the number of the non-0 pixels is counted, so that the ratio of the motion pixels in the thumbnail is obtained.

Skin color detection: and carrying out skin color detection on the thumbnail to form a first skin color mask.

Segmentation is carried out on a small-size image, precision can be inevitably reduced, human faces can be possibly processed into backgrounds in the segmentation, the problem is serious, and human face regions are mostly defaulted to be foreground regions in a live broadcast scene. In order to balance the calculated amount, only skin color detection is carried out without carrying out individual face recognition, the skin color area needs to be corrected before and after segmentation, an elliptical skin color model is adopted in the skin color detection technology, and the skin color detection method based on the split K-means clustering has higher robustness on the problems of environmental illumination change, noise interference and the like.

A skin color mask morphology processing step: and performing image morphology operation on the first skin color mask to form a second skin color mask.

Because the skin color detection is carried out, some non-skin color face areas such as eyes, nostrils and the like are not detected, the image morphology opening operation is carried out on the first skin color mask to communicate the non-skin color face areas such as the eyes, the nostrils and the like, the face area is defaulted to be a determined foreground in the live broadcast process, and a second skin color mask is formed.

It should be noted that the motion state detection step and the skin color detection step are two independent processing steps, and they may be executed successively or in parallel.

After step S102, on the basis of the skin color detection and the morphological processing of the skin color mask, the following processing is further performed on the formed first foreground segmentation mask:

s1: determining a skin color region and a non-skin color region in the first foreground segmentation mask according to the second skin color mask;

s2: and setting the skin color area as a foreground to form a corrected first foreground segmentation mask.

After a second skin color mask is obtained through skin color detection and morphological processing of the skin color mask, the foreground segmentation mask is corrected according to a skin color area and a non-skin color area displayed by the second skin color mask, so that the image foreground segmentation result is more accurate. And segmenting a skin color area in the mask according to the first foreground determined by the second skin color mask, setting the skin color area as the foreground, for example, setting the probability of the pixels of the skin color area as 1, not modifying the non-skin color area, and maintaining the judgment result of the non-skin color area of the first foreground mask, for example, maintaining the probability value of the pixels of the previous non-skin color area.

After the corrected first foreground segmentation mask is obtained, the following processing is further included:

s3: and calculating the foreground rate in the corrected first foreground segmentation mask.

And counting the number of foreground pixels in the corrected first foreground segmentation mask, and calculating to obtain the foreground rate in the corrected first foreground segmentation mask.

After the corrected first foreground segmentation mask is obtained, the following processing procedures are further included:

s4: and performing image morphology operation on the corrected first foreground segmentation mask to form a second foreground segmentation mask.

And performing image morphology operation on the corrected first foreground segmentation mask, eliminating the fine connected domain, connecting the human body region connected domain, performing expansion operation, and expanding the edge of the corrected first foreground segmentation mask.

After the second foreground segmentation mask is obtained, the following processing procedures are further included:

s5: and calculating the offset of the foreground rate in the corrected first foreground segmentation mask relative to the average foreground rate of a plurality of continuous frames before.

S6: determining the second foreground segmentation mask as a current image frame foreground mask in response to the offset being smaller than a preset offset value or in response to the offset being greater than or equal to a preset offset value and the ratio of the moving pixels in the thumbnail being greater than or equal to a preset ratio; otherwise, the second foreground segmentation mask of the previous frame is determined as the foreground mask of the current image frame.

Specifically, an average foreground rate of a plurality of frames (for example, 5 frames) before the current frame is calculated, the foreground rate in the modified first foreground segmentation mask is compared with the average foreground rate to obtain a difference value therebetween, that is, an offset, if the difference value is not large, the difference value is smaller than a preset offset value, or the difference value is large, but at the same time, in the previous motion state detection step (the ratio of the motion pixels in the thumbnail is greater than or equal to a preset ratio), it is shown that the difference value is large due to large-amplitude motion, and the results of the two match, indicating that the current segmentation result is authentic, and the second foreground segmentation mask can be sampled. Otherwise, it indicates that there may be errors in the current segmentation result, then the foreground mask segmentation result of the previous frame is adopted.

In order to further reduce the jitter, the forward monitoring result is referred to for the motion monitoring, a part of false detection is excluded, and if there is almost no motion, that is, the ratio of the motion pixels in the thumbnail is smaller than a preset ratio, the mask (represented as a probability distribution map or referred to as a probability mask) may not be updated, and the mask result of the previous frame is continuously multiplexed. Considering that the contour change in a plurality of frames (for example, 5 frames) of a moving object in a live scene is not obvious, the mask calculated in one frame in a continuous frame can be multiplexed in a plurality of frames. In addition, the memory of the time domain forward result probability distribution map is added into the modified model, and the mask of the current frame is modified according to the time domain forward result, so that the change of the foreground and the background along with the time is smoother and softer, and the fade-in and fade-out effect is achieved visually.

In addition, in order to process local motion more finely, the original image can be divided into a plurality of small image units, for example, a 3 × 3 ninhydrin method is adopted, each small image unit carries out motion monitoring independently, so that even if motion monitoring on a certain small unit is in a problem, the final probability distribution map only affects the region, and the time domain interframe jitter can be further reduced.

S7: and under the condition that the ratio of the motion pixels in the thumbnail is smaller than the preset ratio, correcting the probability that the corresponding pixel point of the foreground mask of the current image frame is the foreground or the background according to the probability that the same pixel point in a plurality of previous continuous frames is the foreground or the background.

Under the condition that the ratio of the moving pixels in the thumbnail is smaller than the preset ratio, the image change in the live broadcast process is not large, the probability that the corresponding pixel point of the foreground mask of the current image frame is the foreground or the background can be corrected according to the probability that the same pixel point in a plurality of previous continuous frames is the foreground or the background, and the probability that each pixel point of the foreground mask of the current image frame is the foreground or the background can be corrected, so that the probability of the foreground or the background of the foreground mask of the current image frame is more accurate, and a more accurate basis is provided for the correction of the pixel points of the subsequent image frame.

After the first foreground segmentation mask is processed through S1-S6 to determine the foreground mask of the current image frame, step S103 is specifically the following steps:

s8: and carrying out interpolation processing on the foreground mask of the current image frame to form a mask of the original image frame.

And performing interpolation processing on the foreground mask of the current image frame, and amplifying the foreground mask to the size of the original image to form the mask of the original image frame.

After obtaining the mask of the original image frame, the method further comprises the following processing steps:

s9: and carrying out image morphology processing on the mask of the original image frame.

S10: and rejecting connected domains lower than a preset connected domain ratio threshold value according to the connected domain ratios of the masks subjected to image morphological processing.

S11: and performing probability interpolation on the segmentation boundary of the mask with the removed connected component.

Eliminating the small connected domain and smoothing the edge sawtooth effect through the step S9; through the step S10, calculating the area ratio of each connected domain, setting a threshold value according to empirical parameters, and rejecting smaller connected domains; at S11, interpolation is performed on the segmentation boundary and probability interpolation is performed on the edge, so that the final result produces a feathering effect.

According to the method for acquiring the image foreground area in the live broadcast scene, firstly, the video image frame can be rapidly segmented in a down-sampling and preset segmentation area mode, and the real-time requirement of the live broadcast scene is met; secondly, skin color detection, motion estimation and the like are added, and feedback adjustment is carried out on the segmentation result so as to obtain higher stability and inter-frame consistency of the segmentation result; moreover, the scheme of the invention has lower requirement on computing power, can meet the requirement of local deployment in the environment of personal PC level user CPU, and can meet the requirement of real-time performance; finally, compared with the traditional ML model and the traditional image segmentation algorithm, the method has higher segmentation accuracy, and solves the problem that the traditional segmentation algorithm has difference in the consistency and stability between frames.

The effects of the present invention will be described below with specific test procedures and results.

1. Test environment and configuration description

The test machine was configured as Intel (R) core (TM) i5-9500 CPU, 8G RAM, not exclusive. And (4) constructing an algorithm model in the C + + language and deploying in the Windows system. A contrast algorithm model (GrabCont algorithm prototype, deep learning algorithm Mask-RCNN) is deployed with the machine configuration in the same way.

2. Compared with the traditional segmentation algorithm

2.1 segmentation accuracy

Classic traditional image segmentation algorithms or background modeling methods such as WaterShed, K-Means, Canny, MOG2 and GMG cannot complete segmentation under a complex background environment. The technical segmentation edge provided by the invention is basically consistent with the segmentation edge of the GrabCut original algorithm of the best-performing traditional ML model on the original image, and the accuracy of the segmentation result can be ensured. Fig. 2, fig. 3 (K = 4), fig. 4 (MOG and MOG2 algorithms only model moving objects), and fig. 5 show graphs of the segmentation effect of watercut + Canny, K-Means, MOG2, and the scheme of the present application, respectively.

2.2 computational resource expenditure

The CPU consumption of a GrabCut original algorithm model for continuously processing a single 480P image is always kept above 50%, the CPU consumption of the algorithm for processing the 480P live video is below 20%, the calculation resource expenditure is saved by 2-3 times, and the calculation resource consumption is also smaller than classical traditional image segmentation algorithms such as WaterShod, K-Means, Canny, MOG2, GMG and the like.

2.3 computing elapsed time

The average time consumption of the GrabCont original algorithm model for processing a single 480P image is more than 1s, and the GrabCont original algorithm model can only be applied to non-real-time image segmentation, the average time consumption of the algorithm provided by the invention for processing a single-thread single frame of 480P live video is about 50ms, which can reach 20fps, the calculation speed is increased by more than 50 times, and the live broadcast real-time requirement can be met.

3. Comparison with deep learning algorithm Performance

3.1 segmentation effect comparison

Under a live broadcast scene, the segmentation accuracy is not large, the trained Mask-RCNN segmentation effect is as follows, and the segmentation edge accuracy is basically not different from the model provided by the invention. FIG. 6 is a graph showing the effect of Mask-RCNN on segmentation.

3.2 comparison of computational expenses

Under the configuration of a testing machine, CPU consumption is always over 70% when Mask-RCNN deployed by a CPU processes 480P video, CPU consumption of 480P live video processed by the algorithm model is below 20%, and calculation cost is saved by 3-4 times.

3.3 calculating elapsed time

Under the same deployment mode of the configuration and under the condition of testing machine hardware, 480P single-frame data processing time is more than 40s under the condition of pure CPU hardware, the time consumed for processing 480P live video single frames by the algorithm model is about 50ms, and the calculation speed is increased by more than 800 times.

According to another aspect of the invention, a device for acquiring a foreground area of an image in a live scene is provided. Fig. 7 is a schematic diagram of an apparatus for acquiring a foreground region of an image in a live scene according to an embodiment of the present invention. As shown in fig. 7, the apparatus includes the following units.

A scaling unit 701, configured to perform downsampling processing on the original image frame and scale the original image frame to a thumbnail.

The original image frame is scaled to the thumbnail by applying a down-sampling process to the original image frame, and the subsequent process is performed on the basis of the thumbnail. The downsampling processing comprises an algorithm of cubic interpolation, bilinear interpolation, nearest neighbor interpolation and the like.

A foreground segmentation unit 702, configured to perform foreground segmentation on the thumbnail according to preset foreground segmentation area parameters to form a first foreground segmentation mask.

The GrabCut algorithm has the characteristics of time consumption for processing large-size images and the need of interactively delimiting rectangular areas, so that the GrabCut algorithm is not suitable for a PC-level user CPU environment and a live scene.

An enlarging unit 703, configured to perform an enlarging process on the first foreground segmentation mask to form a mask of an original image frame.

An image foreground region obtaining unit 704, configured to perform a dot product operation on the original image frame and the mask of the original image frame, and obtain an image foreground region.

The embodiment illustrated in fig. 7 shows how to quickly segment the image foreground region in a live scene under the environment of a personal PC-level user CPU, meeting the real-time requirements of the live scene. However, in the process of image foreground region segmentation, there may be some problems, such as inter-frame flicker, aliasing effect, and abrupt background foreground transition. Therefore, the device for acquiring the image foreground area in the live broadcast scene further comprises the following units.

A motion state detection unit: the thumbnail image processing device is used for performing motion state detection on the thumbnail image and obtaining the ratio of the motion pixels in the thumbnail image.

After the thumbnail is obtained by the scaling unit 701, the thumbnail may be converted to a single channel, converted from RGB to a grayscale, a frame difference method is applied to the grayscale, and a gaussian filtering with a specific threshold parameter is performed on the frame difference image to eliminate the noise effect of the small motion trace and only monitor the motion in a large range.

A skin color detection unit: and the thumbnail is used for carrying out skin color detection to form a first skin color mask.

A skin tone mask morphology processing unit: and the second skin color mask is used for carrying out image morphology operation on the first skin color mask to form a second skin color mask.

It should be noted that the processing procedures executed by the motion state detection unit and the skin color detection unit are two independent processing steps, and the two processing steps may be executed successively or in parallel.

After the processing by the foreground segmentation unit 702, the unit for performing processing on the formed first foreground segmentation mask based on the skin color detection and the morphological processing of the skin color mask further comprises:

a first determination unit: for determining skin tone regions and non-skin tone regions in the first foreground segmentation mask from the second skin tone mask;

a first foreground division mask forming unit: and the foreground segmentation mask is used for setting the skin color area as a foreground and forming a first foreground segmentation mask after correction.

After obtaining the corrected first foreground segmentation mask, performing processing further includes:

the first calculation unit: for calculating a foreground rate in the modified first foreground segmentation mask.

a second foreground division mask forming unit: and the foreground segmentation mask is used for carrying out image morphology operation on the corrected first foreground segmentation mask to form a second foreground segmentation mask.

After obtaining the second foreground segmentation mask, performing processing further comprises:

a second calculation unit: for calculating an offset of the foreground rate in the modified first foreground segmentation mask from an average foreground rate of a plurality of previous consecutive frames.

A second determination unit: the second foreground segmentation mask is determined as a current image frame foreground mask in response to the offset being smaller than a preset offset value or in response to the offset being greater than or equal to the preset offset value and the ratio of the moving pixels in the thumbnail being greater than or equal to a preset ratio; otherwise, the second foreground segmentation mask of the previous frame is determined as the foreground mask of the current image frame.

Specifically, an average foreground rate of a plurality of frames (for example, 5 frames) before the current frame is calculated, the foreground rate in the modified first foreground segmentation mask is compared with the average foreground rate to obtain a difference value, that is, an offset value, between the foreground rate and the average foreground rate, if the difference value is not large, the difference value is smaller than a preset offset value, or the difference value is large, but at the same time, in the previous motion state detection process (the ratio of the motion pixels in the thumbnail is greater than or equal to a preset ratio), it is shown that the difference value is large and is caused by large-amplitude motion, and the results of the two match, which indicates that the current segmentation result is credible, and the second foreground segmentation mask can be sampled. Otherwise, it indicates that there may be errors in the current segmentation result, then the foreground mask segmentation result of the previous frame is adopted.

And the correcting unit is used for correcting the probability that the corresponding pixel point of the foreground mask of the current image frame is the foreground or the background according to the probability that the same pixel point in a plurality of previous continuous frames is the foreground or the background under the condition that the ratio of the motion pixels in the thumbnail is smaller than the preset ratio.

After the first foreground segmentation mask is processed by the first determining unit, the first foreground segmentation mask forming unit, the first calculating unit, the second foreground segmentation mask forming unit, the second calculating unit and the second determining unit to determine the foreground mask of the current image frame, the enlarging unit 703 is further configured to:

and carrying out interpolation processing on the foreground mask of the current image frame to form a mask of the original image frame.

After obtaining the mask for the original image frame, performing processing further comprises:

an image morphology processing unit: and performing image morphology processing on the mask of the original image frame.

A rejection unit: and the method is used for eliminating the connected domains which are lower than a preset connected domain ratio threshold value according to the connected domain ratios of the masks subjected to image morphology processing.

An interpolation unit: for probabilistic interpolation on the segmentation boundaries of the mask with the culled connected components.

Eliminating small connected domains and smoothing edge sawtooth effect through the processing of the image morphology processing unit; calculating the area ratio of each connected domain through the processing of a rejection unit, setting a threshold value according to experience parameters, and rejecting a smaller connected domain; and (4) interpolating on the segmentation boundary and performing probability interpolation on the edge through the processing of the interpolation unit, so that the final result produces a feather effect.

According to the device for acquiring the image foreground area in the live broadcast scene, firstly, the video image frame can be rapidly segmented in a down-sampling and preset segmentation area mode, and the real-time requirement of the live broadcast scene is met; secondly, skin color detection, motion estimation and the like are added, and feedback adjustment is carried out on the segmentation result so as to obtain higher stability and inter-frame consistency of the segmentation result; moreover, the scheme of the invention has lower requirement on computing power, can meet the requirement of local deployment in the environment of personal PC level user CPU, and can meet the requirement of real-time performance; finally, compared with the traditional ML model and the traditional image segmentation algorithm, the method has higher segmentation accuracy, and solves the problem that the traditional segmentation algorithm has difference in the consistency and stability between frames.

Referring to fig. 8, fig. 8 provides an electronic device comprising a processor; and a memory storing computer instructions which, when executed by the processor, cause the processor to carry out the method and refinement scheme as shown in figure 3 when executing the computer instructions.

It should be understood that the above-described device embodiments are merely exemplary, and that the devices disclosed herein may be implemented in other ways. For example, the division of the units/modules in the above embodiments is only one logical function division, and there may be another division manner in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented.

In addition, unless otherwise specified, each functional unit/module in each embodiment of the present invention may be integrated into one unit/module, each unit/module may exist alone physically, or two or more units/modules may be integrated together. The integrated units/modules may be implemented in the form of hardware or software program modules.

If the integrated unit/module is implemented in hardware, the hardware may be digital circuits, analog circuits, etc. Physical implementations of hardware structures include, but are not limited to, transistors, memristors, and the like. The processor or chip may be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, ASIC, etc., unless otherwise specified. Unless otherwise specified, the on-chip cache, the off-chip Memory, and the Memory may be any suitable magnetic storage medium or magneto-optical storage medium, such as resistive Random Access Memory rram (resistive Random Access Memory), Dynamic Random Access Memory dram (Dynamic Random Access Memory), Static Random Access Memory SRAM (Static Random-Access Memory), enhanced Dynamic Random Access Memory edram (enhanced Dynamic Random Access Memory), High-Bandwidth Memory HBM (High-Bandwidth Memory), hybrid Memory cubic hmc (hybrid Memory cube), and so on.

The integrated units/modules, if implemented in the form of software program modules and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Embodiments of the present application also provide a non-transitory computer storage medium storing a computer program, which when executed by a plurality of processors causes the processors to perform the method and refinement scheme as shown in fig. 1.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the description of the embodiments is only intended to facilitate the understanding of the methods and their core concepts of the present application. Meanwhile, a person skilled in the art should, according to the idea of the present application, change or modify the embodiments and applications of the present application based on the scope of the present application. In view of the above, the description should not be taken as limiting the application.

Claims

1. A method of obtaining an image foreground region in a live scene, comprising:

performing motion state detection on the thumbnail to obtain the ratio of motion pixels in the thumbnail;

performing skin color detection on the thumbnail to form a first skin color mask;

performing image morphology operation on the first skin color mask to form a second skin color mask;

determining a skin color region and a non-skin color region in the first foreground segmentation mask according to the second skin color mask;

setting the skin color area as a foreground to form a corrected first foreground segmentation mask;

calculating a foreground rate in the corrected first foreground segmentation mask;

performing image morphology operation on the corrected first foreground segmentation mask to form a second foreground segmentation mask;

calculating the offset of the foreground rate in the corrected first foreground segmentation mask relative to the average foreground rate of a plurality of continuous frames before;

determining the second foreground segmentation mask as a current image frame foreground mask in response to the offset being smaller than a preset offset value or in response to the offset being greater than or equal to a preset offset value, wherein the ratio of moving pixels in the thumbnail is greater than or equal to a preset ratio; otherwise, determining the second foreground segmentation mask of the previous frame as the foreground mask of the current image frame;

amplifying the foreground mask of the current image frame to form a mask of an original image frame; and

2. The method of claim 1, further comprising:

and under the condition that the ratio of the motion pixels in the thumbnail is smaller than the preset ratio, correcting the probability that the corresponding pixel point of the foreground mask of the current image frame is the foreground or the background according to the probability that the same pixel point in a plurality of previous continuous frames is the foreground or the background.

3. The method of claim 1, wherein the magnifying the first foreground segmentation mask, forming a mask of an original image frame, comprises:

4. The method of claim 3, further comprising:

performing image morphology processing on a mask of the original image frame;

removing connected domains lower than a preset connected domain ratio threshold value according to the connected domain ratios of the masks subjected to image morphological processing; and

and performing probability interpolation on the segmentation boundary of the mask with the removed connected component.

5. A device for obtaining an image foreground region in a live scene, comprising:

a motion state detection unit for performing motion state detection on the thumbnail to obtain a ratio of motion pixels in the thumbnail;

the skin color detection unit is used for carrying out skin color detection on the thumbnail to form a first skin color mask;

a skin color mask morphology processing unit, configured to perform image morphology operation on the first skin color mask to form a second skin color mask;

a first determining unit, configured to determine a skin color region and a non-skin color region in the first foreground segmentation mask according to the second skin color mask;

a first foreground segmentation mask forming unit, configured to set the skin color region as a foreground and form a corrected first foreground segmentation mask;

a first calculating unit, configured to calculate a foreground rate in the modified first foreground segmentation mask;

a second foreground segmentation mask forming unit, configured to perform an image morphological operation on the corrected first foreground segmentation mask to form a second foreground segmentation mask;

a second calculating unit, configured to calculate a shift amount of a foreground rate in the modified first foreground segmentation mask from an average foreground rate of a plurality of previous frames;

a second determining unit, configured to determine the second foreground segmentation mask as a current image frame foreground mask in response to the offset being smaller than a preset offset value or in response to the offset being greater than or equal to a preset offset value, wherein a ratio of moving pixels in the thumbnail is greater than or equal to a preset ratio; otherwise, determining the second foreground segmentation mask of the previous frame as the foreground mask of the current image frame;

the amplifying unit is used for amplifying the current image frame foreground mask to form a mask of an original image frame; and

6. An electronic device, comprising:

a processor; and

a memory storing computer instructions that, when executed by the processor, cause the processor to perform the method of any of claims 1-4.

7. A non-transitory computer storage medium storing a computer program that, when executed by a plurality of processors, causes the processors to perform the method of any one of claims 1-4.