CN112533024A

CN112533024A - Face video processing method and device and storage medium

Info

Publication number: CN112533024A
Application number: CN202011354376.4A
Authority: CN
Inventors: 蔡晓霞; 张元尊; 黄晓政; 闻兴; 于冰
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2021-03-19

Abstract

The application discloses a face video processing method, a face video processing device and a storage medium, which relate to the field of image processing and aim to reduce the defects that the image quality problem of an uploaded video caused by coding of a mobile terminal is solved, and therefore color blocks, noise points, artifacts and the like exist at the edge of a face. According to the method, face detection is carried out on each frame of image of a video to be processed, a face area of the image to be processed is determined, the determined face area is subjected to filtering processing, a processed image is obtained, an original image and the processed image are subjected to image fusion, an optimized image is obtained, and the optimized video is determined according to each frame of the optimized image. Therefore, the image quality problem of the uploaded video caused by coding of the mobile terminal can be reduced, and the defects of color blocks, noise, artifacts and the like at the edge of the human face can be further overcome.

Description

Face video processing method and device and storage medium

Technical Field

The present application relates to the field of image processing, and in particular, to a method and an apparatus for processing a face video, and a storage medium.

Background

With the rapid development of the live broadcast industry, more and more people become a main broadcast. Recently, live broadcasting on the mobile device side becomes a popular social application scene and receives wide attention. However, in the related art, the live video stream uploaded by the anchor client is encoded and the video uploaded to the server generally has various image quality problems due to limitations of the carrying capacity of the mobile device and the environmental bandwidth, so that defects such as color blocks, noise, artifacts and the like exist at the edge of a face.

Disclosure of Invention

The embodiment of the application provides a face video processing method, a face video processing device and a storage medium, which are used for reducing the defects of color blocks, noise points, artifacts and the like at the edge of a face caused by the image quality problem of an uploaded video due to coding of a mobile terminal.

According to a first aspect of the embodiments of the present application, there is provided a face video processing method, including:

acquiring a video to be processed; the video to be processed comprises at least two frames of images to be processed, and the video to be processed is a coded video;

carrying out face detection on each frame of image to be processed to determine a face area of the image to be processed;

carrying out image filtering processing on the image where the face area is located to obtain an optimized image;

and taking the video formed by the optimized images of each frame as an optimized video.

In a possible implementation manner, the performing face detection on the image to be processed and determining a face region of the image to be processed includes:

performing key point detection on the image to be processed to obtain a face key point of the image to be processed;

and carrying out face edge positioning processing according to the face key points to obtain a face area of the image to be processed.

In a possible implementation manner, the performing face edge location processing according to the face key points to obtain a face region of the image to be processed includes:

determining a cheek key point from the face key points according to the coordinates of the face key points;

and carrying out face edge positioning processing on the cheek key points to obtain a face contour of the image to be processed.

In a possible implementation manner, after performing face detection on the image to be processed and determining a face region of the image to be processed, the method further includes:

and carrying out skin color detection on the image to be processed, and excluding non-skin areas of the face area.

In a possible implementation manner, the performing image filtering processing on the image where the face region is located to obtain an optimized image includes:

performing multi-level median filtering processing on the image where the face area is located to obtain a filtered image;

and carrying out image fusion on the filtered image and the image to be processed to obtain the optimized image.

According to a second aspect of the embodiments of the present application, there is provided a face video processing apparatus, including:

an acquisition module configured to perform acquiring a video to be processed; the video to be processed comprises at least two frames of images to be processed, and the video to be processed is a coded video;

the detection module is configured to execute face detection on each frame of image to be processed and determine a face area of the image to be processed;

the processing module is configured to perform image filtering processing on the image where the face area is located to obtain an optimized image;

and the determining module is configured to execute the video formed by the optimized images of each frame as the optimized video.

In one possible implementation, the detection module includes:

the detection unit is configured to perform key point detection on the image to be processed to obtain a face key point of the image to be processed;

and the positioning unit is configured to perform face edge positioning processing according to the face key points to obtain a face area of the image to be processed.

In one possible implementation, the positioning unit includes:

a determining subunit configured to perform determining cheek key points from the face key points according to coordinates of the face key points;

and the positioning subunit is configured to perform face edge positioning processing on the cheek key points to obtain a face contour of the image to be processed.

In one possible implementation, the apparatus further includes:

and the exclusion module is configured to execute the detection module to perform face detection on the image to be processed, and after the face area of the image to be processed is determined, perform skin color detection on the image to be processed to exclude a non-skin area of the face area.

In one possible implementation, the processing module includes:

the filtering unit is configured to perform multilevel median filtering processing on the image where the face area is located to obtain a filtered image;

and the image fusion unit is configured to perform image fusion on the filtered image and the image to be processed to obtain the optimized image.

According to a third aspect of embodiments of the present application, there is provided an electronic apparatus, including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement a face video processing method;

according to a fourth aspect of embodiments of the present application, there is provided a storage medium having instructions that, when executed by a processor of an electronic device, enable the electronic device to perform a face video processing method;

according to a fifth aspect of embodiments herein, there is provided a computer program product comprising at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the face video processing method provided by the embodiment of the application.

The technical scheme provided by the embodiment of the application at least has the following beneficial effects:

the application provides a face video processing method, a face video processing device and a storage medium, wherein a face area of an image to be processed is determined by performing face detection on each frame of image of the video to be processed, the determined face area is subjected to filtering processing to obtain a processed image, an original image and the processed image are subjected to image fusion to obtain an optimized image, and the optimized video is determined according to each frame of the optimized image. Therefore, through the processing, the defects of color blocks, noise, artifacts and the like at the edge of the face caused by the image quality problem of the uploaded video caused by encoding of the mobile terminal can be reduced.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic diagram of a live link in an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating a color block effect of an object edge in an upload stream obtained by live broadcast shooting according to an embodiment of the present disclosure;

fig. 3 is a schematic flow chart of a face video processing method in an embodiment of the present application;

fig. 4 is a schematic flowchart of a complete face video processing method in an embodiment of the present application;

fig. 5 is a schematic structural diagram of a face video processing apparatus in an embodiment of the present application;

fig. 6 is a schematic structural diagram of a terminal device in an embodiment of the present application.

Detailed Description

In order to reduce the defects that color blocks, noise, artifacts and the like exist at the edge of a human face due to the image quality problem of an uploaded video caused by encoding of a mobile terminal, the embodiments of the present application provide a human face video processing method, an apparatus and a storage medium. In order to better understand the technical solution provided by the embodiments of the present application, the following brief description is made on the basic principle of the solution:

it should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The technical scheme provided by the embodiment of the application is described below with reference to the accompanying drawings.

With the rapid development of the live broadcast industry, more and more people become a main broadcast. Recently, live broadcasting on the mobile device side becomes a popular social application scene and receives wide attention. As shown in fig. 1, which is a schematic diagram of a live link. In general, a live full link is probably composed of the following parts:

the method comprises the steps of shooting, preprocessing, coding and the like. After the original video is shot, a series of pre-processing operations such as image quality enhancement, color beautification, filtering and the like may be performed, and then the video enters a hardware encoder for encoding. The encoder performs self-adaptive encoding according to the code stream set by the platform and the bandwidth condition of the field network, and ensures that the code stream can be stably uploaded to the server.

The second step is streaming, i.e. the anchor client uploads the encoded video to the server.

The third step is a server processing part, which comprises preprocessing, transcoding processing and the like.

And finally, pulling the stream, namely, sending the video to the client and playing the video by the client.

And transcoding the received code stream at the server end according to different distribution strategies to reduce the code rate of the video stream and save the bandwidth.

However, in the related art, the live video stream uploaded by the anchor client is encoded and the video uploaded to the server generally has various image quality problems due to limitations of the carrying capacity of the mobile device and the environmental bandwidth, so that defects such as color blocks, noise, artifacts and the like exist at the edge of a face. Fig. 2 is a schematic diagram of color block effect of an object edge in an upload stream in live broadcast shooting.

In view of the above, in order to solve the above problems, the present application provides a face video processing method, which determines a face region of an image to be processed by performing face detection on each frame of image of the video to be processed, performs filtering processing on the determined face region to obtain a processed image, performs image fusion on an original image and the processed image to obtain an optimized image, and determines the optimized video according to each frame of the optimized image. Therefore, through the processing, the defects of color blocks, noise, artifacts and the like at the edge of the face caused by the image quality problem of the uploaded video caused by encoding of the mobile terminal can be reduced.

The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it should be understood that the preferred embodiments described herein are merely for illustrating and explaining the present application, and are not intended to limit the present application, and that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The following further explains the face video processing and transferring method provided in the embodiment of the present application. As shown in fig. 3, the following steps are included.

In step S31, a video to be processed is acquired; the video to be processed comprises at least two frames of images to be processed, and the video to be processed is coded video.

In the embodiment of the application, the video to be processed is an encoded video uploaded by the client.

In step S32, for each frame of image to be processed, face detection is performed on the image to be processed, and a face region of the image to be processed is determined.

In the embodiment of the application, the image extraction is performed on the video to be processed, so that each frame of image to be processed is obtained. For example, if a piece of video is composed of 30 images, when the video is processed, the 30 images are processed separately.

In the live broadcasting process, most of the attention of the audience is focused on the face of the anchor, so that the phenomena of color blocks, noise, artifacts and the like in the face area in the video are the problems to be solved firstly. Therefore, the area of the face in the video needs to be determined first in the processing process, so that the subsequent optimization processing is facilitated.

In the embodiment of the application, the face area in the image to be processed is determined by determining the key points of the face in the image to be processed. Specifically, the method can be implemented as steps A1-A2:

step A1: and carrying out key point detection on the image to be processed to obtain the face key points of the image to be processed.

The image to be processed can be input into a pre-trained key point detection model to obtain the key points of the human face; and the key points of the human face can also be determined through algorithm processing such as feature extraction and the like.

Step A2: and carrying out face edge positioning processing according to the face key points to obtain a face area of the image to be processed.

In the embodiment of the application, after the face key points are obtained, the positions of the face in the image are determined according to the coordinates of the face key points, and the face area is determined according to the key points at the edge of the face. In this way, the subsequent optimization processing is facilitated by determining the area of the face in the video.

As can be seen from fig. 2, the phenomena of color blocks, noise, artifacts, etc. at the edges of the face in the image are serious, and therefore, the face contour needs to be determined to facilitate image optimization. The method can be specifically implemented as follows: determining a cheek key point from the face key points according to the coordinates of the face key points; and carrying out face edge positioning processing on the cheek key points to obtain a face contour of the image to be processed.

And for further accurate face area, after the steps are carried out, whether an occlusion exists in the face area can be determined through skin color detection. The method can be specifically implemented as follows: and carrying out skin color detection on the image to be processed, and excluding non-skin areas of the face area.

In the embodiment of the present application, the skin color detection may be performed according to an RGB (one color mode) color model, or based on an elliptical skin model.

In the RGB color model, pixel points in a defined skin color range are found out, the pixel points in the range are used as skin pixel points, and the pixel points outside the range are used as non-skin pixel points, so that a skin area and a skin charge area in the face area are determined.

In the elliptical skin model, the skin information is mapped to YCrCb (optimized color video signal) space, and then the skin pixel points are approximately distributed in an ellipse in CrCb two-dimensional space. Therefore, if we obtain an ellipse of CrCb, we only need to judge whether it is in the ellipse (including the boundary) next time, if so, it can be judged as a skin pixel, otherwise, it is a non-skin pixel.

During viewing, the viewer is largely focused on the skin area, and therefore the non-skin area is removed.

In step S33, an image in which the face region is located is subjected to image filtering processing, so as to obtain an optimized image.

In the embodiment of the application, after the face region is obtained, the face region is filtered to solve the phenomena of color blocks, noise, artifacts and the like in the image. Specifically, the method can be implemented as steps B1-B2:

step B1: and carrying out multi-level median filtering processing on the image where the face area is located to obtain a filtered image.

Median filtering is a non-linear digital filter technique that is often used to remove noise from images or other signals. The idea is to examine the samples in the input signal and determine whether it represents a signal, and to use an observation window consisting of an odd number of samples to achieve this function. The values in the observation window are sorted, and the median value in the middle of the observation window is used as output. The oldest value is then discarded, a new sample is taken, and the above calculation is repeated.

And multilevel median filtering is a non-linear filtering method used for image processing. It finds major application in image processing because it can smooth the voice while protecting the image details from corruption. Through the multi-stage median filtering processing, the phenomena of color blocks, noise points, artifacts and the like in the face image can be eliminated.

Step B2: and carrying out image fusion on the filtered image and the image to be processed to obtain the optimized image.

The image fusion refers to that image data which are collected by a multi-source channel and related to the same target are processed by an image processing technology, a computer technology and the like, so that favorable information in each channel is extracted to the maximum extent, and finally, the favorable information is synthesized into a high-quality image.

In the embodiment of the application, the face image after filtering processing and the original image to be processed are subjected to image fusion, so that the contour of the face and the face image in the image can be optimized under the condition of ensuring that details are not lost.

In the embodiment of the application, in order to further optimize the contour of the face in the image and the face image, on the basis of performing multilevel median filtering on the face image, further optimization is performed in an image weighting mode.

For example, when the edge contour of the face is optimized, multi-level median filtering processing needs to be performed on pixel points near the edge contour. At this time, the pixel points of the edge contour and the pixel points near the edge contour are weighted respectively, and multilevel median filtering processing is performed. The weight occupied by the pixel points of the edge contour is higher than that of the pixel points near the edge contour. The set weight may be preset, or may be calculated when the image is optimized. Such as: when the face edge contour is optimized, 6 pixels near the edge contour are respectively pixel 1, pixel 2, pixel 3, pixel 4, pixel 5 and pixel 6, wherein the pixels located at the edge contour are pixel 1 and pixel 2, and the rest pixels are pixels near the edge contour. Therefore, when the 6 pixels are subjected to multi-level median filtering, a weighting factor of 0.8 is given to the pixel 1 and the pixel 2, and a weighting factor of 0.2 is given to the pixel 3, the pixel 4, the pixel 5 and the pixel 6.

It should be noted that the weighting factors assigned to each pixel point in the same type of pixel point may be the same or different. If a weighting factor of 0.8 can be given to the pixel 1 and the pixel 2 of the edge contour, a weighting factor of 0.8 can also be given to the pixel 1 of the edge contour, and a weighting factor of 0.7 is given to the pixel 2 of the edge contour.

The combination mode can effectively remove the defects of color blocks, noise points, artifacts and the like generated by coding, has strong real-time performance, and simultaneously supports transcoding of more than 15 paths of 720p live broadcast video streams.

In step S34, the video composed of the frame-optimized images is taken as the optimized video.

The embodiment of the application also discloses a complete face video processing method, and a flow chart of the complete face video processing method is shown in fig. 4. The method specifically comprises the following steps:

step 41: acquiring a video to be processed; the video to be processed comprises at least two frames of images to be processed, and the video to be processed is coded video.

Step 42: and aiming at each frame of image to be processed, carrying out key point detection on the image to be processed to obtain the face key points of the image to be processed.

Step 43: and determining the cheek key points from the face key points according to the coordinates of the face key points.

Step 44: and carrying out face edge positioning processing on the cheek key points to obtain a face contour of the image to be processed.

Step 45: and carrying out skin color detection on the image to be processed, and excluding non-skin areas of the face area.

Step 46: and carrying out multi-level median filtering processing on the image where the face area is located to obtain a filtered image.

Step 47: and carrying out image fusion on the filtered image and the image to be processed to obtain an optimized image.

And 48: and taking the video formed by the optimized images of each frame as an optimized video.

Therefore, through the processing, the defects of color blocks, noise, artifacts and the like at the edge of the face caused by the image quality problem of the uploaded video caused by encoding of the mobile terminal can be reduced. And because only the human face edge region is processed, the influence on other regions caused by the side effect of the algorithm is effectively avoided.

Based on the same inventive concept, the application also provides a human face video processing device. Fig. 5 is a schematic diagram of a human face video processing apparatus according to the present application. The device includes:

an obtaining module 501 configured to perform obtaining a video to be processed; the video to be processed comprises at least two frames of images to be processed, and the video to be processed is a coded video;

a detection module 502 configured to perform face detection on each frame of image to be processed, and determine a face region of the image to be processed;

the processing module 503 is configured to perform image filtering processing on the image where the face region is located, so as to obtain an optimized image;

a determining module 504 configured to perform video composed of the optimized images of the frames as the optimized video.

In one possible implementation, the detection module 502 includes:

In one possible implementation, the positioning unit includes:

In one possible implementation, the apparatus further includes:

and the excluding module is configured to execute the detecting module 502 to perform face detection on the image to be processed, and after the face region of the image to be processed is determined, perform skin color detection on the image to be processed to exclude a non-skin region of the face region.

In one possible implementation, the processing module 503 includes:

As shown in fig. 6, based on the same technical concept, the embodiment of the present application further provides an electronic device 60, which may include a memory 601 and a processor 602.

The memory 601 is used for storing computer programs executed by the processor 602. The memory 601 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to use of the task management device, and the like. The processor 602 may be a Central Processing Unit (CPU), a digital processing unit, or the like. The specific connection medium between the memory 601 and the processor 602 is not limited in the embodiments of the present application. In the embodiment of the present application, the memory 601 and the processor 602 are connected by a bus 603 in fig. 6, the bus 603 is represented by a thick line in fig. 6, and the connection manner between other components is merely for illustrative purposes and is not limited thereto. The bus 603 may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus.

The memory 601 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 601 may also be a non-volatile memory (non-volatile memory) such as, but not limited to, a read-only memory (rom), a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD), or any other medium which can be used to carry or store desired program code in the form of instructions or data structures and which can be accessed by a computer. The memory 601 may be a combination of the above memories.

A processor 602 for executing the method performed by the device in the embodiment shown in fig. 3 when invoking the computer program stored in said memory 601.

In some possible embodiments, various aspects of the methods provided herein may also be implemented in the form of a program product including program code for causing a computer device to perform the steps of the methods according to various exemplary embodiments of the present application described above in this specification when the program product is run on the computer device, for example, the computer device may perform the methods as performed by the devices in the embodiments shown in fig. 1-3.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application. Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A face video processing method is characterized by comprising the following steps:

2. The method according to claim 1, wherein the performing face detection on the image to be processed and determining the face region of the image to be processed comprises:

3. The method according to claim 2, wherein the performing face edge location processing according to the face key points to obtain the face region of the image to be processed comprises:

4. The method according to claim 2, wherein after the face detection is performed on the image to be processed and the face region of the image to be processed is determined, the method further comprises:

5. The method according to claim 1, wherein the image filtering processing on the image where the face region is located to obtain an optimized image comprises:

6. A face video processing apparatus, the apparatus comprising:

7. The apparatus of claim 6, wherein the detection module comprises:

8. The apparatus of claim 7, wherein the positioning unit comprises:

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the face video processing method of any of claims 1 to 5.

10. A storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the face video processing method of any one of claims 1 to 5.