CN111669647B

CN111669647B - Real-time video processing method, device and equipment and storage medium

Info

Publication number: CN111669647B
Application number: CN202010537321.0A
Authority: CN
Inventors: 李鑫; 李甫; 林天威; 何栋梁; 张赫男; 孙昊; 文石磊; 丁二锐
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-12
Filing date: 2020-06-12
Publication date: 2022-11-25
Anticipated expiration: 2040-06-12
Also published as: CN111669647A

Abstract

The application discloses a method, a device and equipment for processing a real-time video and a storage medium, and relates to the technical field of digital image processing and deep learning. The specific implementation scheme is as follows: adding the video frame into a video frame set, and acquiring a current processing frame from the video frame set; inputting the current processing frame into a face style conversion model, and acquiring a first type style conversion frame output by the model; generating a corresponding second-class style conversion frame by taking the current processing frame and the first-class style conversion frame as starting points according to the position relation between the key points of each face in the subsequent video frames and the previous video frame with the set number in the video frame set; and after acquiring a new current processing frame from the video frame set, returning to execute the operation of inputting the current processing frame into the face style conversion model, and taking the first type style conversion frame and the second type style conversion frame as real-time video processing results. The technical scheme of the embodiment of the application can generate the matched style face in real time based on the real face in the video.

Description

Real-time video processing method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to image processing and deep learning technologies, in particular to a digital image processing technology, and specifically relates to a real-time video processing method, device, equipment and storage medium.

Background

Along with the continuous improvement of living standard, the demands of users for entertainment are more and more diversified, and the users pay more and more attention and love from users to transform the real faces in the videos into the faces in the cartoon style.

In the prior art, a matched style face is generated based on a real face in a video usually in an off-line state, or a fixed style face which is generated in advance is driven in real time by using the real face in the video to generate a style face with a same expression as the real face. However, both of these methods cannot generate a style face matching a real face in real time based on the real face in the video.

Disclosure of Invention

The embodiment of the application provides a method, a device and equipment for processing a real-time video and a storage medium, which realize real-time generation of a style face matched with a real face based on the real face in the video.

In a first aspect, an embodiment of the present application provides a method for processing a real-time video, including:

adding a video frame acquired in real time into a video frame set, and acquiring a current processing frame from the video frame set, wherein the video frame comprises a real face;

inputting the current processing frame into a face style conversion model, and acquiring a first type of style conversion frame output by the face style conversion model, wherein the style conversion frame comprises a style face;

generating a set number of second-class style conversion frames by taking a current processing frame and a first-class style conversion frame as starting points according to the position relation between the key points of each face in a set number of subsequent video frames and a previous video frame in a video frame set;

and after acquiring a new current processing frame from the video frame set, returning to execute the operation of inputting the current processing frame into the face style conversion model, and taking the first type style conversion frame and the second type style conversion frame as real-time video processing results.

In a second aspect, an embodiment of the present application further provides a device for processing a real-time video, including:

the acquisition module is used for adding the video frames acquired in real time into the video frame set and acquiring the current processing frame from the video frame set, wherein the video frames comprise real faces;

the first conversion module is used for inputting the current processing frame into the face style conversion model and acquiring a first type of style conversion frame output by the face style conversion model, wherein the style conversion frame comprises a style face;

the second transformation module is used for generating the second type of style transformation frames with the set number according to the position relation between the key points of each face in the subsequent video frames and the previous video frames with the set number in the video frame set by taking the current processing frame and the first type of style transformation frames as starting points;

and the circulating module is used for returning and executing the operation of inputting the current image frame into the face style conversion model after acquiring a new current processing frame from the video frame set, and taking the first type style conversion frame and the second type style conversion frame as real-time video processing results.

In a third aspect, an embodiment of the present application further provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method for processing real-time video as provided in any of the embodiments of the present application.

In a fourth aspect, the embodiments of the present application further provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the processing method of real-time video provided in any of the embodiments of the present application.

According to the technical scheme of the embodiment of the application, the video frames acquired in real time are added into the video frame set, and the current processing frame is acquired from the video frame set, wherein the video frames comprise real faces; inputting the current processing frame into a face style conversion model, and acquiring a first type of style conversion frame output by the face style conversion model, wherein the style conversion frame comprises a style face; generating a set number of second-class style conversion frames by taking a current processing frame and a first-class style conversion frame as starting points according to the position relation between the key points of each face in a set number of subsequent video frames and a previous video frame in a video frame set; after a new current processing frame is acquired from the video frame set, the operation of inputting the current processing frame into the face style conversion model is returned to be executed, and the first type style conversion frame and the second type style conversion frame are used as real-time video processing results, so that a style face matched with the real face is generated in real time based on the real face in the video, and the problem that the style face matched with the real face in the video cannot be generated in real time in the prior art is solved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a flowchart of a method for processing real-time video according to a first embodiment of the present application;

fig. 2a is a flow chart of a method for processing real-time video according to a second embodiment of the present application;

FIG. 2b is a flow chart of a process for real-time video according to a second embodiment of the present application;

fig. 3 is a schematic structural diagram of a real-time video processing apparatus according to a third embodiment of the present application;

fig. 4 is a block diagram of an electronic device for implementing a method for processing real-time video according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

First embodiment

Fig. 1 is a flowchart of a method for processing a real-time video in a first embodiment of the present application, where the technical solution of this embodiment is suitable for a case where a style face matched with a real face in a video is generated in real time, and the method may be implemented by processing a real-time video, and the apparatus may be implemented in a software and/or hardware manner, and may be generally integrated in an electronic device, for example, a terminal device, and the method of this embodiment specifically includes the following steps:

and step 110, adding the video frames acquired in real time into a video frame set, and acquiring a current processing frame from the video frame set.

Wherein, the video frame comprises a real face.

In this embodiment, the video frame may be a certain frame of image acquired in the process of shooting a video by the terminal device, and exemplarily, the video frame may be a frame of image acquired in real time in the process of recording the video, or may be a frame of image acquired in real time in the process of live broadcasting the video. A video frame set is a set for storing video frames captured in real time, and usually all video frames of a video are put into one video frame set to distinguish video frames of different videos by different video frame sets.

Optionally, adding the video frames acquired in real time to the video frame set may include: and in the video recording process, responding to the face style conversion request, and adding the video frames acquired in real time into the video frame set.

In this optional embodiment, the video recording process refers to a process of only performing video shooting and not playing the shot content in real time, and in the video recording process, if a user clicks a face style conversion option in a current shooting page, a face style conversion request is responded, and after the user clicks the face style conversion option, video frames collected in real time are added to a video frame set, or all the video frames collected in real time from the beginning of video recording are added to the video frame set.

Optionally, adding the video frames acquired in real time to the video frame set may include: in the video live broadcast process, responding to a face style conversion request, and adding the video frames collected in real time into a video frame set.

In this optional embodiment, the live video broadcast process refers to a process of playing a shot content in real time while shooting a video, and in the live video broadcast process, if a user clicks a face style conversion option in a current live broadcast page, a face style conversion request is responded, and after the user clicks the face style conversion option, a video frame acquired in real time is added to a video frame set, or all video frames acquired in real time from video recording are added to the video frame set.

In this embodiment, after the video frames acquired in real time are added to the video frame set, a video frame which has the most previous current acquisition time and is not subjected to face style conversion processing is acquired from the video frame set according to the sequence of the acquisition times of the video frames, and the video frame is used as a current processing frame.

And 120, inputting the current processing frame into a face style conversion model, and acquiring a first type of style conversion frame output by the face style conversion model.

Wherein the style conversion frame comprises a style face.

In this embodiment, the face style conversion model is configured to generate a style face matched with a real face according to the real face in the input video frame, and output a first type of style conversion frame including the style face. The first type of style conversion frame is a video frame obtained by replacing a real face in a current processing frame with a style face, each style face is only matched with a unique real face, namely each real face has an exclusive style face, and the size and the included face features of the two types of real faces are consistent.

And step 130, taking the current processing frame and the first-class style conversion frame as starting points, and generating a set number of second-class style conversion frames according to the position relation between the key points of each face in the subsequent video frames and the previous video frames in the set number of video frames.

In the embodiment, considering that the face style conversion model is independently adopted to perform the style conversion on the real faces in all the video frames acquired in real time, the calculation amount is large, the time consumption is long, and the real-time effect cannot be achieved, therefore, the way of performing the style conversion on the real faces in the video frames according to the face style conversion model is combined with the way of performing the style conversion on the real faces in the video frames according to the position variation between the key points of each face in two adjacent video frames, and the style faces matched with the real faces in the video frames are generated in real time through the alternate use of the two ways.

In this embodiment, after the first-type style conversion frame matched with the currently processed frame is obtained through the face style conversion model, the second-type style conversion frames with the set number respectively matched with the subsequent video frames with the set number after the currently processed frame in the video frame set are generated according to the position variation between the key points of each face in the two adjacent video frames, and then the above processes are repeated to generate the first-type style conversion frame or the second-type style conversion frame respectively corresponding to the subsequent video frames in the video frame set.

Optionally, generating the second type of style transformation frames in the set number according to the position relationship between the key points of the face in the previous video frame and the subsequent video frames in the set number with the current processing frame and the first type of style transformation frame as starting points may include: taking the current processing frame as a processing starting point frame, and acquiring a subsequent video frame of the processing starting point frame in the video frame set; generating a face key point transformation matrix according to the image positions of each face key point in the subsequent video frame and the processing starting point frame in the corresponding video frame; generating a second type style conversion frame of a next video frame according to the face key point conversion matrix and the first type style conversion frame or the second type style conversion frame matched with the processing starting point frame; and taking the next video frame as a processing starting point frame, returning and executing the operation of obtaining the next video frame of the processing starting point frame in the video frame set until the processing quantity of the next video frame reaches the set quantity.

In this optional embodiment, a specific manner is provided for generating the second type of style conversion frames in the set number according to the position relationship between the key points of each face in the set number of subsequent video frames and the previous video frame in the video frame set, with the currently processed frame and the first type of style conversion frame as starting points, and the specific process is as follows:

firstly, a current processing frame is taken as a processing starting point frame, a next video frame of the processing starting point frame is obtained from a video frame set, then, coordinate variation quantity between each face key point in two adjacent video frames is calculated according to position coordinates of each face key point in the next video frame and position coordinates of each face key point in the video frame in the processing starting point frame, a face key point transformation matrix is further obtained, then, according to the face key point transformation matrix, coordinates of each face key point of a style face in a first type of style transformation frame matched with the processing starting point frame are adjusted, a second type of style transformation frame of the next video frame is generated, whether the number of continuously generated second type of style transformation frames is equal to a set number or not is judged, if yes, operation on the video frame after the next video frame in the video frame set is stopped, if not, the next video frame is taken as the processing starting point frame, operation on the next video frame after the processing starting point frame obtained in the video frame set is returned, and the operation on the next video frame after the processing starting point frame is performed, and the second type of the face key point transformation frame in the adjacent video frame is adjusted according to the face key point type of the second type of the face key point transformation frame, and the second type of the face key point transformation frame in the next video frame.

It should be noted that the set number in the embodiment of the present application may be 1, 2 or other values set according to requirements. When the second type of style conversion frame is generated according to the face key point conversion matrix, the position of the face key point is only required to be adjusted for the style face of the style conversion frame matched with the previous video frame, the calculated amount is less, and the time consumption is shorter, so that the speed of realizing face style conversion for the video frames in the video frame set is higher when the set amount is larger. However, in order to maintain a high degree of matching between the style face and the real face, the set number is not too large.

And step 140, after acquiring a new current processing frame from the video frame set, returning to execute the operation of inputting the current processing frame into the face style conversion model, and taking the first type style conversion frame and the second type style conversion frame as real-time video processing results.

In this embodiment, after the number of second-class style conversion frames generated continuously according to the face key point conversion matrices corresponding to the two adjacent video frames is equal to the set number, a new current processing frame needs to be acquired from the video frame set, the operation of inputting the current processing frame into the face style conversion model is performed, the first-class style conversion frame matched with the current processing frame is obtained again through the face style conversion model, and the set number of second-class style conversion frames respectively matched with the set number of subsequent video frames after the current processing frame in the video frame set is generated according to the face key point conversion matrices corresponding to the two adjacent video frames.

Optionally, the processing, with the first type of style transformation frame and the second type of style transformation frame as the real-time video processing result, may include: and recording and playing the first type style conversion frame and the second type style conversion frame in real time, and generating a recorded video.

In this optional embodiment, if the video frames included in the video frame set are acquired in real time during the video recording process, the first type of style conversion frames generated according to the face style conversion model and the second type of style conversion frames generated according to the face key point conversion matrix are recorded and played in real time, and a recorded video is generated.

Optionally, the processing, with the first-class style transformation frame and the second-class style transformation frame as the real-time video processing result, may include: and generating a live broadcast video stream according to the first type style conversion frame and the second type style conversion frame, and sending the live broadcast video stream to a live broadcast server for video live broadcast.

In this optional embodiment, if the video frames included in the video frame set are acquired in real time during the live video broadcast process, a live video stream is generated according to the first type of style conversion frame generated by the face style conversion model and the second type of style conversion frame generated by the face key point conversion matrix, and the live video stream is sent to a live broadcast server for live video broadcast.

According to the technical scheme of the embodiment of the application, the video frames collected in real time are added into the video frame set, and the current processing frame including the real face is obtained from the video frame set; inputting the current processing frame into a face style conversion model, and acquiring a first type of style conversion frame output by the face style conversion model, wherein the style conversion frame comprises a style face; taking a current processing frame and a first type of style conversion frame as starting points, and generating a set number of second type of style conversion frames according to the position relationship between the key points of each face in a set number of subsequent video frames and a previous video frame in a video frame set; after a new current processing frame is acquired from the video frame set, the operation of inputting the current processing frame into the face style conversion model is returned to be executed, and the first type style conversion frame and the second type style conversion frame are used as real-time video processing results, so that a style face matched with the real face is generated in real time based on the real face in the video, and the problem that the style face matched with the real face in the video cannot be generated in real time in the prior art is solved.

Second embodiment

Fig. 2a is a flowchart of a real-time video processing method in a second embodiment of the present application, which is further detailed based on the above embodiments, and provides a specific step of determining a face style conversion model before a current processing frame is input into the face style conversion model, a specific step of recording and playing a first type of style conversion frame and a second type of style conversion frame in real time, and a specific step of generating a live video stream according to the first type of style conversion frame and the second type of style conversion frame, and sending the live video stream to a live server. A method for processing a real-time video according to a second embodiment of the present application is described below with reference to fig. 2a, which includes the following steps:

and step 210, determining a face style conversion model.

In this embodiment, a preset machine learning model is subjected to model training to generate a face style conversion model that can generate a style face matched with a real face according to the real face in an input video frame, and output a first type of style conversion frame including the style face.

Optionally, determining the face style conversion model may include: acquiring a training sample set, wherein the training sample set comprises a plurality of sample image pairs, and each sample image pair comprises an original image and a transformed image; setting a machine learning model for training by using each sample image in the training sample set to obtain the face style conversion model; the original image comprises a real face, and the transformed image comprises a style face matched with the real face.

In this optional embodiment, a plurality of high-quality sample image pairs are obtained to form a training sample set, each sample image pair includes an original image and a transformed image, the original image includes a real face, such as a face in the current processing frame in fig. 2b, and the transformed image includes a style face matched with the real face, such as a face in the first style transformation frame in fig. 2 b. And then, training a set machine learning model by using each sample image in the training sample set, so that the set machine learning model learns and generates a style face matched with the real face in the transformed image according to the real face in the input original image, and the trained machine learning model is a face style conversion model.

Optionally, the machine learning model includes: a countermeasure network is generated.

In this alternative embodiment, the generation countermeasure network includes a generator and a discriminator, the generator is configured to generate a style face matching with a real face in the original image sample, and the discriminator is configured to judge whether the style face generated by the generator is a style face in the transformed image sample. Training an antagonistic network, wherein if the style face generated by the generator is the same as the style face in the converted image sample, and the discriminator cannot find out the difference between the style face generated by the generator and the style face in the converted image sample, namely, the discriminator judges that the style face generated by the generator is the same as the style face in the converted image sample, the game ideas of the generator and the discriminator converge, and the generated antagonistic network is the face style conversion model.

In this embodiment, the style face in the first-class style transformation frame output by the face style transformation model has a high matching degree with the real face, and the sizes of the face and the face features included in the face and the style transformation frame are consistent, so that the real face can be vividly transformed into the style face, for example, a cartoon-style quadratic element image face, to attract the attention of the user and increase the interest of the user.

And step 220, adding the video frames acquired in real time into the video frame set, and acquiring the current processing frame from the video frame set.

As can be seen from the explanation of step 110 in the first embodiment, adding the video frames collected in real time to the video frame set may refer to adding the video frames collected in real time to the video frame set in response to the face style conversion request during the video recording process, or may refer to adding the video frames collected in real time to the video frame set in response to the face style conversion request during the video live broadcast process.

And step 230, inputting the current processing frame into the face style conversion model, and acquiring a first type style conversion frame output by the face style conversion model.

Exemplarily, as shown in fig. 2b, the current processing frame is input into the face style conversion model, a quadratic image face matched with the real face in the current processing frame is generated through the trained face style conversion model, and a first-class style conversion frame including the quadratic image face is obtained.

And step 240, taking the current processing frame as a processing starting point frame, and acquiring a video frame subsequent to the processing starting point frame in the video frame set.

And step 250, generating a face key point transformation matrix according to the image positions of each face key point in the subsequent video frame and the processing starting point frame in the corresponding video frame.

For example, as shown in fig. 2b, based on the position coordinates of each face key point in the subsequent video frame in the video frame and the position coordinates of each face key point in the processing start frame in the video frame, the coordinate variation of each face key point in the subsequent video frame with respect to each face key point in the processing start frame, for example, the coordinate variation of each key point of the nose, the coordinate variation of each key point of the face contour, and the like, is calculated, and the face key point transformation matrix is generated.

And step 260, generating a second-class style conversion frame of the next video frame according to the face key point conversion matrix and the first-class style conversion frame or the second-class style conversion frame matched with the processing starting point frame.

In this embodiment, coordinates of each face key point of the style face in the first type of style conversion frame matched with the processing start point frame are correspondingly adjusted according to the coordinate variation of each face key point in the face key point conversion matrix, so as to generate a second type of style conversion frame of the subsequent video frame.

Step 270, determining whether the number of the second-type style conversion frames generated continuously is equal to the set number, if yes, executing step 290, otherwise, executing step 280.

Step 280, the next video frame is taken as the processing starting point frame, and the step 240 is executed again.

Step 290, obtaining a new current processing frame from the video frame set, and returning to execute step 230.

And step 2110, taking the first-class style transformation frame and the second-class style transformation frame as a real-time video processing result.

In this embodiment, after a set number of second-type style conversion frames are continuously generated according to the face keypoint conversion matrix, a new current processing frame is obtained from the video frame set, and the process returns to execute step 230, and the above process is repeated to generate a first-type style conversion frame or a second-type style conversion frame matched with the unprocessed video frames in the video frame set.

In this embodiment, step 290 and step 2110 may be performed in parallel by two threads, that is, implemented in one thread, to obtain a new currently processed frame from the video frame set, and obtain a first-type style transformation frame matching the currently processed frame according to the face style transformation model, and if the obtained first-type style transformation frame and second-type style transformation frame satisfy the processing condition, the first-type style transformation frame and second-type style transformation frame may be processed in another thread at the same time.

In this embodiment, taking the first-type style transformation frame and the second-type style transformation frame as the real-time video processing result may include: and recording and playing the first type style conversion frame and the second type style conversion frame in real time to generate a recorded video, or generating a live broadcast video stream according to the first type style conversion frame and the second type style conversion frame, and sending the live broadcast video stream to a live broadcast server for live broadcast.

Optionally, the recording and playing the first-type style conversion frame and the second-type style conversion frame in real time may include: sequentially storing a first type style conversion frame and a second type style conversion frame which are generated in real time into a set buffer queue; and when the preset hard delay condition is met in the cache queue, sequentially acquiring the first type of style conversion frames or the second type of style conversion frames from the cache queue for real-time recording and playing.

In this optional embodiment, in a video recording scene, the first type of style conversion frames and the second type of style conversion frames generated in real time may be stored in a set buffer queue according to a generation time sequence, and when the number of style conversion frames in the buffer queue reaches a preset number, for example, 20, or style conversion frames are obtained from the buffer queue last time for playing, and an interval time is equal to a preset time, for example, 3 seconds, the first type of style conversion frames or the second type of style conversion frames are obtained from the buffer queue in sequence for real-time recording and playing.

Optionally, generating a live video stream according to the first-type style conversion frame and the second-type style conversion frame, and sending the live video stream to a live server, where the generating may include: sequentially storing the first-class style conversion frame and the second-class style conversion frame which are generated in real time in a set buffer queue; and when the preset hard delay condition is met in the cache queue, sequentially acquiring a first type of style conversion frame or a second type of style conversion frame from the cache queue, generating a live broadcast video stream, and sending the live broadcast video stream to a live broadcast server.

In this optional embodiment, in a live video scene, the first-type style conversion frames and the second-type style conversion frames generated in real time may be stored in a set buffer queue according to a generation time sequence, and when the number of style conversion frames in the buffer queue reaches a preset number, for example, 20, or style conversion frames are obtained from the buffer queue last time and played, and an interval time is equal to a preset time, for example, 3 seconds, the first-type style conversion frames or the second-type style conversion frames are obtained sequentially from the buffer queue, so as to generate a live video stream, and the live video stream is sent to a live broadcast server.

In this embodiment, the storing of the style conversion frames and the obtaining of the style conversion frames from the buffer queue for playing may be divided into two threads to be executed in parallel, that is, when the first type of style conversion frames or the second type of style conversion frames are obtained, the obtained style conversion frames may be sequentially stored in the buffer queue in one thread, and if the number of style conversion frames in the buffer queue reaches the preset number, or the interval time from the last obtaining of the style conversion frames from the buffer queue is equal to the preset time, the operation of sequentially obtaining the style conversion frames from the buffer queue may be simultaneously implemented in another thread.

According to the technical scheme, the face style conversion model capable of generating the style face highly matched with the real face is obtained by performing model training on the preset machine learning model, then the face style conversion mode according to the face style conversion model is combined with the face style conversion mode according to the face key point transformation matrix, and the style face matched with the real face in the video frame is generated in real time by alternately using the two modes, so that the problem that the style face matched with the real face in the video cannot be generated in real time in the prior art is solved.

Third embodiment

Fig. 3 is a schematic structural diagram of a real-time video processing apparatus in a third embodiment of the present application, where the real-time video processing apparatus includes: an acquisition module 310, a first transformation module 320, a second transformation module 330, and a loop module 340.

An obtaining module 310, configured to add a video frame acquired in real time to a video frame set, and obtain a current processing frame from the video frame set, where the video frame includes a real face;

the first transformation module 320 is configured to input the current processed frame into a face style transformation model, and obtain a first type of style transformation frame output by the face style transformation model, where the style transformation frame includes a style face;

a second transformation module 330, configured to use the current processing frame and the first-class style transformation frame as starting points, and generate a set number of second-class style transformation frames according to a position relationship between each face key point in a set number of subsequent video frames and a previous video frame in a video frame set;

and the loop module 340 is configured to return to execute an operation of inputting the current image frame into the face style conversion model after acquiring a new current processing frame from the video frame set, and use the first-type style conversion frame and the second-type style conversion frame as real-time video processing results.

According to the technical scheme of the embodiment of the application, the video frames collected in real time are added into the video frame set, and the current processing frame including the real face is obtained from the video frame set; inputting the current processing frame into a face style conversion model, and acquiring a first type of style conversion frame output by the face style conversion model, wherein the style conversion frame comprises a style face; generating a set number of second-class style conversion frames by taking a current processing frame and a first-class style conversion frame as starting points according to the position relation between the key points of each face in a set number of subsequent video frames and a previous video frame in a video frame set; after a new current processing frame is acquired from the video frame set, the operation of inputting the current processing frame into the face style conversion model is returned to be executed, and the first type style conversion frame and the second type style conversion frame are used as real-time video processing results, so that a style face matched with the real face is generated in real time based on the real face in the video, and the problem that the style face matched with the real face in the video cannot be generated in real time in the prior art is solved.

Optionally, the second transforming module 330 includes:

a subsequent video frame acquisition unit, configured to take the current processing frame as a processing start frame, and acquire a subsequent video frame of the processing start frame in the video frame set;

the transformation matrix generating unit is used for generating a face key point transformation matrix according to the image positions of each face key point in the next video frame and the processing starting point frame in the corresponding video frame;

the second type style conversion frame generating unit is used for generating a second type style conversion frame of the next video frame according to the face key point conversion matrix and the first type style conversion frame or the second type style conversion frame matched with the processing starting point frame;

and the cyclic operation unit is used for taking the next video frame as a processing starting point frame, returning and executing the operation of acquiring the next video frame of the processing starting point frame in the video frame set until the processing quantity of the next video frame reaches the set quantity.

Optionally, the method further includes: the model training module is used for acquiring a training sample set before inputting a current processing frame into a face style conversion model, wherein the training sample set comprises a plurality of sample image pairs, and each sample image pair comprises an original image and a transformed image;

setting a machine learning model for training by using each sample image in the training sample set to obtain the face style conversion model;

the original image comprises a real face, and the transformed image comprises a style face matched with the real face.

Optionally, the obtaining module 310 includes:

the recording acquisition unit is used for responding to a face style conversion request in the video recording process and adding the video frames acquired in real time into a video frame set;

wherein, the circulation module 340 includes:

and the recording and playing unit is used for recording and playing the first type of style conversion frame and the second type of style conversion frame in real time and generating a recorded video.

Optionally, the recording and playing unit is specifically configured to:

sequentially storing the first-class style conversion frame and the second-class style conversion frame which are generated in real time in a set buffer queue;

and when the preset hard delay condition is met in the cache queue, sequentially acquiring the first type of style conversion frames or the second type of style conversion frames from the cache queue to record and play in real time.

Optionally, the obtaining module 310 includes:

the live broadcast acquisition unit is used for responding to a face style conversion request in the live broadcast process of the video and adding the video frames acquired in real time into a video frame set;

wherein, the circulation module 340 includes:

and the live broadcast playing unit is used for generating a live broadcast video stream according to the first type style conversion frame and the second type style conversion frame and sending the live broadcast video stream to a live broadcast server so as to carry out live video broadcast.

Optionally, the live broadcast unit is specifically configured to:

and when the preset hard delay condition is met in the cache queue, sequentially acquiring a first type of style conversion frame or a second type of style conversion frame from the cache queue, generating a live broadcast video stream, and sending the live broadcast video stream to a live broadcast server.

The real-time video processing device provided by the embodiment of the application can execute the real-time video processing method provided by any embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method.

Fourth embodiment

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 4 is a block diagram of an electronic device according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 4, the electronic apparatus includes: one or more processors 401, memory 402, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). One processor 401 is illustrated in fig. 4.

Memory 402 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for processing real-time video provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the processing method of real-time video provided by the present application.

The memory 402, as a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the processing method of real-time video in the embodiment of the present application (for example, the obtaining module 310, the first transforming module 320, the second transforming module 330, and the circulating module 340 shown in fig. 3). The processor 401 executes various functional applications of the server and data processing, i.e., implements the processing method of the real-time video in the above-described method embodiment, by running non-transitory software programs, instructions, and modules stored in the memory 402.

The memory 402 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the processing electronics of the real-time video, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 402 may optionally include memory located remotely from processor 401, which may be connected to real-time video processing electronics over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method for processing a real-time video may further include: an input device 403 and an output device 404. The processor 401, the memory 402, the input device 403 and the output device 404 may be connected by a bus or other means, and fig. 4 illustrates an example of a connection by a bus.

The input device 403 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the processing electronics of the real-time video, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, or other input device. The output devices 404 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, the video frames acquired in real time are added into the video frame set, and the current processing frame including the real face is acquired from the video frame set; inputting the current processing frame into a face style conversion model, and acquiring a first type of style conversion frame output by the face style conversion model, wherein the style conversion frame comprises a style face; generating a set number of second-class style conversion frames by taking a current processing frame and a first-class style conversion frame as starting points according to the position relation between the key points of each face in a set number of subsequent video frames and a previous video frame in a video frame set; after a new current processing frame is acquired from the video frame set, the operation of inputting the current processing frame into the face style conversion model is returned to be executed, and the first type style conversion frame and the second type style conversion frame are used as real-time video processing results, so that a style face matched with the real face is generated in real time based on the real face in the video, and the problem that the style face matched with the real face in the video cannot be generated in real time in the prior art is solved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments are not intended to limit the scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of processing real-time video, comprising:

adding a video frame acquired in real time into a video frame set, and acquiring a current processing frame from the video frame set, wherein the video frame comprises a real human face;

inputting a current processing frame into a face style conversion model, and acquiring a first type of style conversion frame output by the face style conversion model, wherein the size and the face characteristics of a style face included in the style conversion frame are consistent with those of a real face in the current processing frame;

taking a current processing frame and a first type of style conversion frame as starting points, and generating a set number of second type of style conversion frames according to the position relationship between the key points of each face in a set number of subsequent video frames and a previous video frame in a video frame set;

2. The method of claim 1, wherein generating a set number of second-type style transformation frames from the current processing frame and the first-type style transformation frame according to the position relationship between the face key points in a set number of subsequent video frames and a previous video frame in the video frame set comprises:

taking the current processing frame as a processing starting point frame, and acquiring a subsequent video frame of the processing starting point frame in the video frame set;

generating a face key point transformation matrix according to the image positions of each face key point in the subsequent video frame and the processing starting point frame in the corresponding video frame;

generating a second type style conversion frame of a next video frame according to the face key point conversion matrix and the first type style conversion frame or the second type style conversion frame matched with the processing starting point frame;

and taking the next video frame as a processing starting point frame, returning and executing the operation of obtaining the next video frame of the processing starting point frame in the video frame set until the processing quantity of the next video frame reaches the set quantity.

3. The method of claim 1, further comprising, prior to inputting the current processed frame into the face style conversion model:

acquiring a training sample set, wherein the training sample set comprises a plurality of sample image pairs, and each sample image pair comprises an original image and a transformed image;

4. The method of claim 3, wherein the machine learning model comprises: a countermeasure network is generated.

5. The method of any of claims 1-4, wherein adding the real-time captured video frames to the set of video frames comprises:

in the video recording process, responding to a face style conversion request, and adding a video frame acquired in real time into a video frame set;

the method for processing the real-time video by using the first-class style conversion frame and the second-class style conversion frame as a real-time video processing result comprises the following steps:

and recording and playing the first type style conversion frame and the second type style conversion frame in real time, and generating a recorded video.

6. The method of claim 5, wherein recording and playing the first type of frame and the second type of frame in real time comprises:

7. The method of any of claims 1-4, wherein adding the real-time captured video frames to the set of video frames comprises:

in the video live broadcast process, responding to a face style conversion request, and adding a video frame acquired in real time into a video frame set;

and generating a live broadcast video stream according to the first type style conversion frame and the second type style conversion frame, and sending the live broadcast video stream to a live broadcast server for live video broadcast.

8. The method of claim 7, wherein generating a live video stream from the first type of stylistic transformation frames and the second type of stylistic transformation frames and transmitting the live video stream to a live server comprises:

sequentially storing a first type style conversion frame and a second type style conversion frame which are generated in real time into a set buffer queue;

9. A device for processing real-time video, comprising:

the first conversion module is used for inputting the current processing frame into the face style conversion model and acquiring a first type of style conversion frame output by the face style conversion model, wherein the style conversion frame comprises a style face which is consistent with the size and the face characteristics of a real face in the current processing frame;

10. The apparatus of claim 9, wherein the second transformation module comprises:

11. The apparatus of claim 9, further comprising:

the model training module is used for acquiring a training sample set before a current processing frame is input into the face style conversion model, wherein the training sample set comprises a plurality of sample image pairs, and each sample image pair comprises an original image and a transformed image;

12. The apparatus of claim 11, wherein the machine learning model comprises: a countermeasure network is generated.

13. The apparatus of any of claims 9-12, wherein the means for obtaining comprises:

wherein, the circulation module includes:

14. The apparatus according to claim 13, wherein the recording and playing unit is specifically configured to:

and when the preset hard delay condition is met in the cache queue, sequentially acquiring the first type of style conversion frames or the second type of style conversion frames from the cache queue for real-time recording and playing.

15. The apparatus of any of claims 9-12, wherein the means for obtaining comprises:

wherein, the circulation module includes:

16. The apparatus of claim 15, wherein the live playback unit is specifically configured to:

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.