CN111464828A

CN111464828A - Virtual special effect display method, device, terminal and storage medium

Info

Publication number: CN111464828A
Application number: CN202010408745.7A
Authority: CN
Inventors: 陈文琼
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2020-05-14
Filing date: 2020-05-14
Publication date: 2020-07-28

Abstract

The application discloses a virtual special effect display method, a virtual special effect display device, a virtual special effect display terminal and a storage medium, and belongs to the technical field of live broadcast. The method comprises the following steps: after the push streaming client acquires a live video frame, firstly, facial recognition is carried out on the live video frame, facial position information corresponding to a facial feature region in the live video frame is determined, the live video frame and the facial position information are sent to the pull streaming client through the streaming server, after the pull streaming client acquires the live video frame and the facial position information, if a virtual special effect adding instruction is received, special effect rendering can be carried out on the face in the live video frame according to the virtual special effect adding instruction and the facial position information, and the live video frame after the special effect rendering is displayed. The virtual special effect adding can be carried out on the live video frames in real time, the display delay of the virtual special effect is avoided, and the display efficiency of the virtual special effect is improved.

Description

Virtual special effect display method, device, terminal and storage medium

Technical Field

The embodiment of the application relates to the technical field of live broadcast, in particular to a virtual special effect display method, a virtual special effect display device, a virtual special effect display terminal and a storage medium.

Background

With the development of the network live broadcast technology, the interaction between the anchor and the audience through the live broadcast platform is more and more abundant, wherein the interaction comprises the presentation of the virtual special-effect gift of the anchor by the audience, and the live broadcast picture after the rendering of the virtual special-effect gift can be displayed correspondingly in the live broadcast interfaces corresponding to the user client and the anchor client.

In the related art, the display mode of the virtual special effect gift is as follows: after receiving a presentation instruction of the virtual special-effect gift, the user client sends the presentation information of the virtual special-effect gift to the anchor client, and after the anchor client carries out special-effect rendering on the current live video frame, the live video stream obtained after rendering is pushed to the user client, so that the user client can display the live video frame after special-effect rendering according to the received live video stream.

Obviously, by adopting the display mode of the virtual special effect gift in the related technology, the user client receives the giving instruction of the virtual special effect gift and displays the live video frame after the special effect rendering, so that time delay exists and the real-time performance is poor.

Disclosure of Invention

The embodiment of the application provides a virtual special effect display method, a virtual special effect display device, a terminal and a storage medium. The technical scheme is as follows:

in one aspect, an embodiment of the present application provides a virtual special effect display method, where the method is applied to a plug flow client, and the method includes:

acquiring a live video frame;

carrying out facial recognition on the live video frame, and determining facial position information corresponding to a facial feature area in the live video frame, wherein the facial position information is used for indicating the position of the facial feature area in the live video frame;

and sending the face position information and the live video frame to a pull streaming client through a streaming server, wherein the pull streaming client is used for rendering a special effect on a face in the live video frame according to the face position information when receiving a virtual special effect adding instruction, and displaying the live video frame after the special effect rendering.

On the other hand, an embodiment of the present application provides a virtual special effect display method, where the method is applied to a pull streaming client, and the method includes:

acquiring a live video frame and face position information sent by a stream pushing client through a stream server, wherein the face position information is used for indicating the position of a face feature area in the live video frame, and the face position information is obtained by carrying out face identification on the live video frame by the stream pushing client;

receiving a virtual special effect adding instruction;

and performing special effect rendering on the face in the live video frame according to the virtual special effect adding instruction and the face position information, and displaying the live video frame after the special effect rendering.

On the other hand, an embodiment of the present application provides a virtual special effect display device, where the device is applied to a plug flow client, and the device includes:

the first acquisition module is used for acquiring a live video frame;

the first determining module is used for carrying out facial recognition on the live video frame and determining facial position information corresponding to a facial feature area in the live video frame, wherein the facial position information is used for indicating the position of the facial feature area in the live video frame;

and the sending module is used for sending the face position information and the live video frame to a pull streaming client through a streaming server, and the pull streaming client is used for performing special effect rendering on the face in the live video frame according to the face position information and displaying the live video frame after the special effect rendering when receiving a virtual special effect adding instruction.

On the other hand, an embodiment of the present application provides a virtual special effect display apparatus, where the apparatus is applied to a pull streaming client, and the apparatus includes:

the second acquisition module is used for acquiring a live video frame and face position information sent by a stream pushing client through a stream server, wherein the face position information is used for indicating the position of a face feature area in the live video frame, and the face position information is obtained by performing face identification on the live video frame by the stream pushing client;

the receiving module is used for receiving a virtual special effect adding instruction;

and the rendering display module is used for performing special effect rendering on the face in the live video frame according to the virtual special effect adding instruction and the face position information and displaying the live video frame after the special effect rendering.

In another aspect, an embodiment of the present application provides a terminal, where the terminal includes a processor and a memory; the memory stores at least one instruction for execution by the processor to implement a virtual special effects display method as described in the above aspect.

In another aspect, an embodiment of the present application provides a computer-readable storage medium, where at least one instruction is stored, and the at least one instruction is used for being executed by a processor to implement the virtual special effects display method according to the above aspect.

In another aspect, an embodiment of the present application further provides a computer program product, where at least one instruction is stored, and the at least one instruction is loaded and executed by the processor to implement the virtual special effect display method according to the above aspect.

The technical scheme provided by the embodiment of the application can bring the following beneficial effects:

after the push streaming client acquires a live video frame, firstly, facial recognition is carried out on the live video frame, facial position information (the position of a facial feature region in the live video frame) corresponding to the facial feature region in the live video frame is determined, the live video frame and the facial position information are sent to the pull streaming client through the streaming server, after the pull streaming client acquires the live video frame and the facial position information, if a virtual special effect adding instruction is received, special effect rendering can be carried out on a face in the live video frame according to the virtual special effect adding instruction and the facial position information, and the live video frame after the special effect rendering is displayed. According to the embodiment of the application, the face recognition is carried out on the stream pushing client side, the face recognition result (face position information) is sent to the stream pulling client side, so that when the stream pulling client side receives the virtual special effect adding instruction, the live video frame can be accurately rendered and displayed according to the face position information.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 shows a schematic structural diagram of a live broadcast system according to an exemplary embodiment of the present application;

FIG. 2 illustrates a flow chart of a virtual special effects display method provided by an exemplary embodiment of the present application;

FIG. 3 illustrates a flow chart of a virtual special effects display method provided by another exemplary embodiment of the present application;

FIG. 4 illustrates a flow chart of a method of virtual special effects display shown in an exemplary embodiment of the present application;

FIG. 5 illustrates a flow chart of a method of face position information determination shown in an exemplary embodiment of the present application;

FIG. 6 illustrates a flow chart of a virtual special effects display method, shown in another exemplary embodiment of the present application;

fig. 7 is a block diagram illustrating a structure of a virtual special effect display apparatus in a plug-flow client according to an exemplary embodiment of the present application;

fig. 8 is a block diagram illustrating a structure of a virtual special effect display apparatus in a pull stream client according to an exemplary embodiment of the present application;

fig. 9 is a block diagram illustrating a structure of a terminal according to an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

In the related technology, there are two ways of displaying virtual special effects, one is that after a stream pulling client receives a virtual special effect adding instruction, facial recognition is performed on a current live video frame to determine facial position information in the current live video frame, and a virtual special effect adding area corresponding to the virtual special effect adding instruction is used, for example, if the virtual special effect adding area is an eye, the current live video frame is rendered according to the recognized eye position; and the other method is that after the pull stream client receives the virtual special effect adding instruction, the virtual special effect adding information is sent to the pull stream client, and after the pull stream client carries out special effect rendering on the current live video frame, the live video stream obtained after rendering is sent to the pull stream client so that the pull stream client can display the live video frame after special effect rendering according to the received live video stream.

Obviously, by adopting the method in the related technology, on one hand, because the push streaming client performs facial recognition when performing facial beautification or other facial processing on the live video frame before pushing streaming, and performs facial recognition and then performs virtual special effect rendering on the live video frame at the pull streaming client, the method not only repeats facial recognition and is complex in operation, but also consumes time for facial recognition, which results in virtual special effect display delay; on the other hand, after receiving the virtual special effect adding instruction, the stream pushing client performs facial special effect rendering and then pushes the rendered face special effect to the stream pulling client, and due to the fact that operations such as stream pushing and stream pulling are performed in the middle of the stream pushing client, display delay exists in virtual special effect display.

Different from a virtual special effect display method in the related art, the embodiment of the application provides a virtual special effect display method. Referring to fig. 1, a schematic structural diagram of a live broadcast system according to an exemplary embodiment of the present application is shown, where the live broadcast system includes: a first terminal 101, a streaming server 102 and a second terminal 103.

A push streaming client (anchor client) used by a network anchor is installed and operated in the first terminal 101. The network anchor can register a live broadcast room in the stream pushing client, and can perform interaction such as audio, video, desktop sharing, document sharing and the like with other users watching the live broadcast through the live broadcast room. In this embodiment of the application, the first terminal 101 may perform facial recognition on the obtained live video frame, determine face position information of each facial feature area in the live video frame, encode and package the live video frame and the corresponding face position information into a live video stream, and send the live video stream to the streaming server 102.

The first terminal 101 is connected to the streaming server 102 through a wireless network or a wired network.

The stream server 102 is a transfer station for exchanging information between live broadcast rooms in a live broadcast system, and is configured to receive a live broadcast video stream sent from a stream pushing client and push the live broadcast video stream to a stream pulling client (or a user client) used by a user watching a live broadcast; or receiving the information from the pull client and pushing the information to the push client, so as to realize the transmission of the real-time interactive information between the pull client and the push client. The system can be a server, a server cluster formed by a plurality of servers or a cloud computing center. In this embodiment of the application, the streaming server 102 may receive a live video stream (the live video stream includes a live video frame and corresponding facial position information) sent by the first terminal 101, and push the live video stream to a stream pulling client installed in the second terminal 103; optionally, the streaming server 102 may further send the virtual special effect addition information sent by the pull streaming client to the push streaming client.

The streaming server 102 is connected to the second terminal 103 through a wireless network or a wired network.

A pull client (user client) used by a user viewing a live broadcast is installed and operated in the second terminal 103. The user can select the live broadcast room which the user wants to enter from the pull streaming client, and can perform actions of approving, paying attention, sending information, giving a main broadcast virtual gift and the like in the live broadcast room. In this embodiment, the second terminal 103 may receive a live video stream sent by the streaming server 102, and obtain facial position information corresponding to a facial feature area in each live video frame from the live video stream, and after the second terminal 103 receives the virtual special effect adding instruction, perform special effect rendering on the live video frame according to the facial position information and the virtual gift special effect information, and display the rendered live video frame.

In this embodiment, after the push streaming client acquires a live video frame, first, face recognition is performed on the live video frame, face position information (a position of a face feature region in the live video frame) corresponding to a face feature region in the live video frame is determined, the live video frame and the face position information are sent to the pull streaming client through the streaming server, and after the pull streaming client acquires the live video frame and the face position information, if a virtual special effect adding instruction is received, special effect rendering can be performed on a face in the live video frame according to the virtual special effect adding instruction and the face position information, and the live video frame after the special effect rendering is displayed. According to the embodiment of the application, the face recognition is carried out on the stream pushing client side, the face recognition result (face position information) is sent to the stream pulling client side, so that when the stream pulling client side receives the virtual special effect adding instruction, the live video frame can be accurately rendered and displayed according to the face position information.

Referring to fig. 2, a flowchart of a virtual special effect display method provided in an exemplary embodiment of the present application is shown, and the embodiment of the present application takes application of the method to the live broadcast system shown in fig. 1 as an example.

The method comprises the following steps:

step 201, the stream pushing client acquires a live video frame.

The live video frame is acquired by a push streaming client through a camera, the content contained in the live video frame changes along with acquisition time, and the content of the live video frame is related to the type of the anchor, for example, if the type of the anchor is a singing anchor, the live video frame may contain an anchor face image.

In a possible implementation manner, the stream pushing client acquires a live broadcast image through an image sensor, and acquires a sound signal through a sound sensor to acquire a live broadcast video frame.

Step 202, the stream pushing client performs face recognition on the live video frame to determine face position information corresponding to the face feature area in the live video frame.

In one possible implementation, the plug-streaming client may invoke a face recognition Software Development Kit (SDK) to perform face image recognition on the live video frame, and determine face position information corresponding to a face feature region in the live video frame, where the face position information is used to indicate a position of the face feature region in the live video frame, for example, a position of the face feature region such as eyes, lips, nose, ears, and the like in the live video frame.

Alternatively, the face position information may be expressed in the form of two-dimensional coordinates.

Alternatively, the face position information may also be determined by setting a deep neural network model, and the embodiment of the present application does not limit the manner of face identification.

And step 203, the stream pushing client sends the face position information and the live video frame to the stream pulling client through the stream server.

In a possible implementation manner, after the push streaming client performs face recognition on a live video frame and determines face position information corresponding to the live video frame, in order to avoid repeated face recognition by the pull streaming client, the push streaming client needs to push the live video frame and the face position information together to the pull streaming client through the streaming server.

It should be noted that, because there may be a change in the face position information between each frame of the live video frame, it is necessary to perform face recognition on each frame of the live video frame, and send the live video frame and the corresponding face position information together to the pull streaming client.

And step 204, the stream pulling client acquires the live video frame and the face position information sent by the stream pushing client through the stream server.

In a possible implementation manner, the push streaming client sends the live video frame and the face position information to the streaming server, the streaming server is configured to send the live video frame and the face position information to the pull streaming client, and accordingly, the pull streaming client receives the live video frame and the corresponding face position information.

In step 205, the pull streaming client receives a virtual special effect adding instruction.

The virtual special effect adding instruction indicates that a virtual special effect needs to be added in a facial feature region in a live video frame, for example, the virtual special effect adding instruction indicates that a cat-ear special effect is added above a head of a anchor.

In a possible implementation manner, the virtual special effect adding instruction is triggered by clicking a budding face gift giving control by a user corresponding to the streaming client, where the budding face gift refers to a virtual special effect gift that needs to be added on the anchor face, that is, special effect addition needs to be performed based on facial recognition, for example, when the user clicks a cat ear in the budding face gift, a facial contour in the anchor face needs to be recognized, and a virtual special effect corresponding to the cat ear is added on the anchor face according to the facial contour.

Optionally, after the stream pulling client receives the virtual special effect adding instruction, the stream pushing (anchor) client needs to be notified, so that the anchor client can render the local special effect of the anchor client according to the virtual special effect adding instruction, and display the rendered live video frame in a live interface corresponding to the anchor client, thereby implementing interaction between an anchor corresponding to the anchor client and a user corresponding to the stream pushing client.

Optionally, after the local special effect rendering is performed by the stream pushing client, the live video frame after the special effect rendering does not need to be pushed to the stream pulling client, and the display of the stream pulling client is not affected.

And step 206, the stream pulling client performs special effect rendering on the face in the live video frame according to the virtual special effect adding instruction and the face position information, and displays the live video frame after the special effect rendering.

Because the push streaming client synchronously sends the live video frame and the corresponding face position information to the pull streaming client, after the pull streaming client receives the virtual special effect adding instruction, the pull streaming client does not need to perform face recognition on the live video frame again, can directly perform special effect rendering on the live video frame according to the obtained face position information and the virtual special effect, and displays the live video frame after the special effect rendering, so that the power consumption of the pull streaming client can be saved (the face recognition is not needed), and the time delay of the pull streaming client in performing virtual special effect rendering is shortened.

Illustratively, if the virtual special effect instruction indicates that the glasses special effect is added to the anchor eye, after the stream pulling client acquires the virtual special effect instruction, the virtual special effect "glasses special effect" is added to the area indicated by the eye position information according to the corresponding special effect area (eye area) and the acquired eye position information in the live video frame, and the synthesized live video frame is displayed in a live interface corresponding to the stream pulling client.

To sum up, in the embodiment of the present application, after the push streaming client acquires a live video frame, first perform face recognition on the live video frame, determine face position information (where a face feature region is located in the live video frame) corresponding to a face feature region in the live video frame, and send the live video frame and the face position information to the pull streaming client through the streaming server, after the pull streaming client acquires the live video frame and the face position information, if a virtual special effect addition instruction is received, the special effect rendering may be performed on a face in the live video frame according to the virtual special effect addition instruction and the face position information, and the live video frame after the special effect rendering is displayed. According to the embodiment of the application, the face recognition is carried out on the stream pushing client side, the face recognition result (face position information) is sent to the stream pulling client side, so that when the stream pulling client side receives the virtual special effect adding instruction, the live video frame can be accurately rendered and displayed according to the face position information.

In a possible implementation manner, when live data transmission is performed between a push streaming client and a pull streaming client, live data transmission is performed in a manner of encoding and packaging a live video frame into a live video stream, so that the live video frame and corresponding facial position information can be transmitted to the pull streaming client at the same time, and therefore, the live video frame and the facial position information need to be encoded and packaged together, and live data transmission is performed.

Referring to fig. 3, a flowchart of a virtual special effect display method according to another exemplary embodiment of the present application is shown, and the embodiment of the present application takes application of the method to the live broadcast system shown in fig. 1 as an example. The method comprises the following steps:

step 301, the stream pushing client acquires a live video frame.

Step 302, the stream pushing client performs facial recognition on the live video frame to determine facial position information corresponding to the facial feature area in the live video frame.

Step 201 and step 202 may be referred to in the implementation of step 301 and step 302, which is not described herein again in this embodiment.

And 303, the stream pushing client encodes and encapsulates the face position information and the live video frame to obtain a live video stream.

Because the data volume of the live video frames acquired by the plug-flow client is large, in order to facilitate the transmission of the live video frames, the live video frames generally need to be encoded and encapsulated, and the live video streams are used for live data transmission after being obtained. Therefore, in a possible implementation manner, in order to enable the streaming client to synchronously acquire the live video frame and the corresponding face position information, the face position information and the live video frame are encoded and encapsulated together to obtain a live video stream, and the live video stream is transmitted.

Since the live video frame and the face position Information do not belong to the same type of data, the live video frame belongs to image type Information, and the face position Information belongs to text type Information, in order to enable the live video stream to carry the face position Information, in one possible implementation, Supplemental Enhancement Information (SEI) is used to carry the face position Information, and the SEI provides a method for adding additional Information to the live video stream.

Illustratively, the generation of the live video stream may include the following steps:

firstly, an SEI message is generated according to the face position information.

In one possible implementation, the face position information is added to the SEI, resulting in an SEI message carrying the face position information.

And secondly, encoding and packaging the live video frame and the SEI message to obtain the live video stream.

In a possible implementation manner, a live video frame is encoded to obtain encoded video data, an SEI message carrying facial position information is added to the encoded video data, and then the encoded video data and the SEI message are encapsulated to form a live video stream.

The method for encoding the live video frame may adopt h.264 or h.265, and the encoding method supports carrying of an SEI message. Optionally, other coding modes supporting the SEI message may also be adopted, which is not limited in this application.

And step 304, the stream pushing client sends the live video stream to the stream pulling client through the stream server.

Since the SEI message may be lost in the links of the live video stream transmission process, decapsulation, and decoding, in order to enable the facial position information carried in the SEI message to be transmitted to the stream pulling client, the streaming server needs to support the process of decapsulating, decoding, re-encoding, and re-encapsulating the live video stream, and also needs to carry the original SEI message.

In a possible implementation manner, a stream pushing client sends an encapsulated live video stream to a stream server, and the stream server can determine whether to directly push the live video stream to a stream pulling client according to the requirement of the stream pulling client, wherein if the network condition of the stream pulling client is poor, the stream server may need to transcode the live video stream, that is, perform the processes of decapsulation, decoding, recoding and repackaging, and in the process, the stream server needs to ensure that the transcoded live video stream still carries an SEI message; and if the network condition of the stream pulling client is good, the stream server directly pushes the received live video stream to the stream pulling client without transcoding.

And 305, the pull streaming client receives the live video stream sent by the push streaming client through the streaming server.

In one possible implementation, the stream pulling client pulls the stream through the stream server, that is, receives the live video stream sent by the stream server.

Step 306, the stream pulling client decodes the live video stream to obtain the live video frame and the SEI message.

In a possible implementation manner, after receiving a live video stream sent by a streaming server, a stream pulling client decapsulates and decodes the live video stream to obtain a live video frame and face position information corresponding to the live video frame.

In step 307, the pull streaming client receives a virtual special effect adding instruction.

And 308, the stream pulling client performs special effect rendering on the face in the live video frame according to the virtual special effect adding instruction and the face position information, and displays the live video frame after the special effect rendering.

Step 307 and step 308 may refer to step 205 and step 206, and this embodiment is not described herein again.

In the embodiment, the SEI message carries the face position information and is encapsulated together with the encoded live video frame to form a live video stream, and the live video stream is sent to the streaming client through the streaming server, so that the streaming client can obtain the live video frame and the corresponding face position information after de-encapsulating and decoding the live video stream after receiving the live video stream, and the live video frame and the face position information are synchronously transmitted to the streaming client.

In a possible implementation manner, after the live video frame and the corresponding facial position information are acquired by the streaming client, special effect rendering can be performed on the live video frame according to the facial feature region corresponding to the virtual special effect adding instruction, and the live video frame after the special effect rendering is displayed.

Referring to fig. 4, a flowchart of a method for displaying a virtual special effect according to an exemplary embodiment of the present application is shown, and the embodiment of the present application takes application of the method to the pull streaming client shown in fig. 1 as an example. The method comprises the following steps:

in step 401, a special effect area indicated by the virtual special effect adding instruction is determined.

In a possible embodiment, when the pull streaming client receives a virtual special effect adding instruction, a special effect region is first determined according to the virtual special effect adding instruction, for example, if the virtual special effect adding instruction indicates that the glasses special effect is added to the anchor eye, the special effect region indicated by the virtual special effect adding instruction is an eye region.

Step 402, determining face position information corresponding to the special effect region according to the face feature region to which the special effect region belongs.

Because the pull streaming client acquires the face position information corresponding to each face region in the live video frame in advance, after the pull streaming client acquires the special effect region corresponding to the virtual special effect adding instruction, the face position information corresponding to the special effect region can be determined according to the face feature region indicated by the special effect region.

Illustratively, if the special effect region is an eye region, that is, a left eye region and a right eye region corresponding to a main broadcaster, position information corresponding to the left eye region and the right eye region may be determined from the face position information.

And 403, performing special effect rendering on the facial feature region in the live video frame according to the facial position information and the virtual special effect, and displaying the live video frame after the special effect rendering.

In a possible implementation manner, after the face position information indicated by the virtual special effect adding instruction is determined, special effect rendering can be performed on a face feature region in a live video frame according to a virtual special effect corresponding to the virtual special effect adding instruction, that is, an animation corresponding to the virtual special effect is synthesized with the live video frame to obtain a live video frame after the special effect rendering, and the live video frame is displayed in a live interface of a pull streaming client.

In this embodiment, after receiving the virtual special effect adding instruction, the stream pulling client may determine, from the pre-obtained facial position information, facial position information corresponding to the special effect region in the live video frame according to the special effect region (facial feature region) indicated by the virtual special effect adding instruction, and synthesize the virtual special effect and the live video frame at the facial position information, to obtain a live video frame after rendering the special effect, and display the live video frame.

It should be noted that, after receiving the virtual special effect adding instruction, the stream pulling client sends the virtual special effect adding information to the stream pushing client through the stream server, so that the anchor client can perform local rendering according to the virtual special effect adding information; the plug-flow client side can perform face recognition on the current live video frame in real time and determine face position information corresponding to the current live video frame, so that the plug-flow client side can perform local special effect rendering and display according to the virtual special effect adding information.

In a possible implementation manner, because the acquisition time interval between two adjacent live video frames is small, and the possibility that the corresponding face position information changes is also relatively small, in order to reduce the power consumption of the push streaming client for real-time face recognition, in a possible implementation manner, it may be determined whether the face position information needs to be determined for a subsequent live video frame by determining the similarity between the two adjacent live video frames.

Referring to fig. 5, a flowchart of a method for determining face position information according to an exemplary embodiment of the present application is shown, and the embodiment of the present application takes an example in which the method is applied to the push streaming client shown in fig. 1. The method comprises the following steps:

step 501, obtaining similarity between adjacent live video frames, where the adjacent live video frames include a first live video frame and a second live video frame.

The method for determining the similarity between adjacent live broadcast video frames may refer to a method for determining the similarity between two pictures, for example, a Peak Signal to Noise Ratio (PSNR) method is adopted, and the similarity is determined by comparing errors of pixels at the same position; or measuring the image Similarity from three aspects of brightness, contrast and structure by adopting a Structural Similarity (SSIM) mode. The embodiment of the present application does not limit the manner of determining the similarity.

In a possible implementation manner, after the push streaming client continuously acquires adjacent live video frames, the similarity between the adjacent live video frames is judged, and the similarity between the adjacent live video frames is determined.

Alternatively, the similarity may be expressed in percentage, for example, the similarity between two adjacent live video frames is 98%.

Step 502, in response to the similarity being higher than the similarity threshold, performing face recognition on the first live video frame to determine first face position information corresponding to the face feature region in the first live video frame.

The similarity threshold may be preset by a developer, for example, the similarity threshold is 95%.

In a possible implementation manner, if the similarity between adjacent live video frames is greater than a similarity threshold, for example, the similarity between adjacent live video frames is 98% and greater than the similarity threshold 95%, which indicates that the difference between two adjacent live video frames is small and negligible, at this time, only face recognition needs to be performed on the first live video frame, first face position information corresponding to the first live video frame is determined, and the first face position information is also determined as second face position information corresponding to the second live video frame.

Step 503, determining the first face position information as second face position information corresponding to the second live video frame.

Because the similarity between two adjacent live videos is higher than the similarity threshold, it indicates that the difference between the first live video frame and the second live video frame is small, and the difference between the corresponding face position information should also be small, and at this time, in order to avoid performing repeated face recognition on the second live video frame and determining face position information, the power consumption of the stream pushing client is increased, and the first face position information may be directly determined as the second face position information corresponding to the second live video frame.

Optionally, if the stream pushing client determines that the similarity between adjacent live video frames is lower than the similarity threshold, facial recognition needs to be performed on the second live video frame again to determine second facial position information corresponding to the second live video frame.

In this embodiment, by obtaining the similarity between adjacent live video frames, comparing the similarity with a preset similarity threshold, and when the similarity is higher than the similarity threshold, only face recognition needs to be performed on the first live video frame to determine second face position information corresponding to the first live video frame, repeated face recognition does not need to be performed on the second live video frame, and the first face position information is determined as the second face position information, so that the power consumption of the streaming client can be reduced.

In another possible implementation, if the similarity between adjacent live video frames is lower than the similarity threshold, it indicates that the difference between adjacent live video frames is large, but even if the difference is large, there may be an area where the face position information is similar, for example, the position information of the eye area in the adjacent live video frames is similar, and at this time, when carrying the face position information, only the face position information with the difference may be carried, so as to reduce the amount of data transmitted.

Schematically, as shown in fig. 6, a flowchart of a virtual special effect display method shown in another exemplary embodiment of the present application is shown, and the embodiment of the present application takes application of the method to the live broadcast system shown in fig. 1 as an example to explain, the method includes:

step 601, the stream pushing client acquires a live video frame.

Step 602, the stream pushing client obtains a similarity between adjacent live video frames, where the adjacent live video frames include a first live video frame and a second live video frame.

The implementation manners of step 601 and step 602 may refer to the above embodiments, which are not described herein.

Step 603, in response to the similarity being lower than the similarity threshold, the stream pushing client performs face recognition on the first live video frame to determine first face position information.

In a possible implementation manner, when the streaming client determines that the similarity between adjacent live video frames is lower than a similarity threshold, for example, the similarity between adjacent live video frames is 60% and lower than the similarity threshold 95%, which indicates that there is a difference between adjacent live video frames, at this time, it is necessary to perform facial recognition on each live video frame to determine facial position information corresponding to each live video frame, that is, determine first facial position information corresponding to a first live video frame.

And step 604, the stream pushing client performs facial recognition on the second live video frame to determine second facial position information corresponding to the facial feature area in the second live video frame.

Correspondingly, after the position information of the first face corresponding to the first live video frame is determined, face recognition needs to be continuously performed on the second live video frame, so that the position information of the second face corresponding to the second live video frame is determined.

Step 605, the stream pushing client generates a first SEI message according to the first face position information, and generates a second SEI message according to the facial position information distinguished between the first face position information and the second face position information.

Since the live video stream may include a plurality of live video frames, and if each live video frame carries the same amount of face position information, it is obvious that the data amount of the live video stream will be increased, so in a possible implementation manner, a difference storage manner may be adopted, that is, the first face position information is stored in the first SEI message, and the difference position information between the second face position information and the first face position information is stored in the second SEI message, so that the data transmission amount may be appropriately reduced.

Illustratively, if the stream pushing client identifies that only the mouth position information is different between the first live video frame and the second live video frame, and other face position information is the same, the first SEI message corresponding to the first live video frame carries face position information such as eyes, ears, mouths, and noses, and correspondingly, the second SEI message corresponding to the second live video frame only carries the mouth position information, and the stream pushing client does not need to carry information the same as the first face position information, so that the data transmission amount can be reduced.

And 606, the stream pushing client sends the first SEI message and the first live video frame to the stream pulling client through the stream server, and sends the second SEI message and the second live video frame to the stream pulling client.

In a possible implementation manner, the stream pushing client encodes the first live video frame, encapsulates the first live video frame with the first SEI message, encodes the second live video frame, encapsulates the second live video frame with the second SEI message to form a live video stream, and sends the live video stream to the stream pushing client through the stream server.

Step 607, the stream pulling client obtains the first live video frame and the first SEI message sent by the stream pushing client through the stream server, and obtains the second live video frame and the second SEI message sent by the stream pushing client.

In a possible implementation manner, the stream pulling client performs stream pulling from the stream server, receives a live video stream sent by the stream server, and decapsulates and decodes the live video stream to obtain a first live video frame and a corresponding first SEI message, and a second live video frame and a corresponding second SEI message.

In step 608, the stream pushing client determines first face position information corresponding to the first live video frame according to the first SEI message.

And because the first SEI message carries the first face position information, the stream pulling client decapsulates the live video stream to obtain the first face position information corresponding to the first live video frame.

Illustratively, the face position information carried in the first SEI message may include: eye position information, ear position information, mouth position information, nose position information, and the like.

And step 609, the stream pushing client determines second face position information corresponding to the second live video frame according to the second SEI message and the first SEI message.

Since the second SEI message is generated from the distinctive location information between the second face location information and the first face location information, in one possible implementation, the second face location information corresponding to the second live video frame may be determined according to the first SEI message and the second SEI message.

Illustratively, if only mouth position information is carried in the second SEI message, which indicates that only mouth position information difference exists between the first live video frame and the second live video frame, and other face position information is the same, eye position information, ear position information, nose position information, and the like (face position information other than the mouth position information) in the first face position information may be determined as second face position information corresponding to the second live video frame, that is, face position information corresponding to the second live video frame may be determined according to the first SEI message and the second SEI message.

In step 610, the plug flow client receives a virtual special effect adding instruction.

And 611, performing special effect rendering on the face in the live video frame by the plug-flow client according to the virtual special effect adding instruction and the face position information, and displaying the live video frame after the special effect rendering.

The implementation manner of step 610 and step 611 may refer to the above embodiments, which are not described herein.

In this embodiment, the similarity between adjacent live video frames is determined, and when the similarity is lower than the similarity threshold value and an SEI message corresponding to each live video frame is generated, for a second live video frame, only the distinguishing facial position information between two facial position information needs to be carried, so that the data volume of video streaming transmission can be reduced.

In another possible implementation, if the similarity between adjacent live video frames is higher than the similarity threshold, the face position information corresponding to the first live video frame and the second live video frame is the same, at this time, the second live video frame may also not carry the corresponding SEI message, that is, the streaming client pushes the first live video frame, the first SEI message, and the second live video frame to the streaming client through the streaming server, and accordingly, the streaming client may obtain the first face position information from the first SEI message, and determine the first face position information as the second face position information corresponding to the second live video frame.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 7, a block diagram of a virtual special effect display apparatus in a plug flow client according to an exemplary embodiment of the present application is shown. The apparatus can be implemented by software, hardware or a combination of the two as all or part of the plug flow client in fig. 1, and the apparatus includes: a first obtaining module 701, a first determining module 702 and a sending module 703.

A first obtaining module 701, configured to obtain a live video frame;

a first determining module 702, configured to perform facial recognition on the live video frame, and determine facial position information corresponding to a facial feature area in the live video frame, where the facial position information is used to indicate a position of the facial feature area in the live video frame;

a sending module 703 is configured to send, through a streaming server, the face position information and the live video frame to a pull streaming client, where the pull streaming client is configured to perform special effect rendering on a face in the live video frame according to the face position information when receiving a virtual special effect addition instruction, and display the live video frame after the special effect rendering.

Optionally, the sending module 703 includes:

the first processing unit is used for encoding and packaging the face position information and the live video frame to obtain a live video stream;

and the first sending unit is used for sending the live video stream to the pull stream client through the stream server.

Optionally, the first processing unit is further configured to:

generating an SEI message according to the face position information;

and encoding and packaging the live video frame and the SEI message to obtain the live video stream.

Optionally, the first determining module 702 includes:

the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring the similarity between adjacent live video frames, and the adjacent live video frames comprise a first live video frame and a second live video frame;

a first determining unit, configured to perform face recognition on the first live video frame in response to that the similarity is higher than a similarity threshold, and determine first face position information corresponding to a face feature region in the first live video frame;

a second determining unit, configured to determine the first face position information as second face position information corresponding to the second live video frame.

Optionally, the apparatus further comprises:

a second determining module, configured to perform face recognition on the first live video frame in response to the similarity being lower than the similarity threshold, and determine the first face position information;

a third determining module, configured to perform facial recognition on the second live video frame, and determine second facial position information corresponding to a facial feature region in the second live video frame;

the sending module 703 further includes:

a generating unit, configured to generate a first SEI message according to the first face position information, and generate a second SEI message according to different face position information between the first face position information and the second face position information;

and the second sending unit is used for sending the first SEI message and the first live video frame to the stream pulling client through a stream server, and sending the second SEI message and the second live video frame to the stream pulling client.

In the embodiment of the application, after the stream pushing client acquires a live video frame, firstly, facial recognition is performed on the live video frame, facial position information (the position of the facial feature region in the live video frame) corresponding to a facial feature region in the live video frame is determined, the live video frame and the facial position information are sent to the stream pulling client through the stream server, after the stream pulling client acquires the live video frame and the facial position information, if a virtual special effect adding instruction is received, special effect rendering can be performed on a face in the live video frame according to the virtual special effect adding instruction and the facial position information, and the live video frame after the special effect rendering is displayed. According to the embodiment of the application, the face recognition is carried out on the stream pushing client side, the face recognition result (face position information) is sent to the stream pulling client side, so that when the stream pulling client side receives the virtual special effect adding instruction, the live video frame can be accurately rendered and displayed according to the face position information.

Referring to fig. 8, a block diagram of a virtual special effects display apparatus in a pull stream client according to an exemplary embodiment of the present application is shown. The apparatus may be implemented by software, hardware or a combination of the two as all or part of the pull client in fig. 1, and includes: a second acquisition module 801, a receiving module 802 and a rendering display module 803.

A second obtaining module 801, configured to obtain, by a stream server, a live video frame and facial position information sent by a stream pushing client, where the facial position information is used to indicate a position of a facial feature area in the live video frame, and the facial position information is obtained by the stream pushing client performing facial recognition on the live video frame;

a receiving module 802, configured to receive a virtual special effect adding instruction;

and a rendering and displaying module 803, configured to perform special effect rendering on the face in the live video frame according to the virtual special effect adding instruction and the face position information, and display the live video frame after the special effect rendering.

Optionally, the rendering and displaying module 803 includes:

a third determination unit configured to determine a special effect region indicated by the virtual special effect addition instruction;

a fourth determining unit, configured to determine, according to the facial feature region to which the special effect region belongs, the facial position information corresponding to the special effect region;

and the rendering display unit is used for performing special effect rendering on the facial feature region in the live video frame according to the facial position information and the virtual special effect and displaying the live video frame after the special effect rendering.

Optionally, the second obtaining module 801 includes:

a receiving unit, configured to receive, by the streaming server, a live video stream sent by the stream pushing client;

and the second processing unit is used for decoding the live video stream to obtain the live video frame and an SEI message, wherein the SEI message is generated by the face position information.

Optionally, the second obtaining module 801 further includes:

a second obtaining unit, configured to obtain, by the streaming server, a first live video frame and a first SEI message sent by the streaming client, and obtain a second live video frame and a second SEI message sent by the streaming client, where the first live video frame and the second live video frame are adjacent live video frames, and a similarity between the first live video frame and the second live video frame is lower than a similarity threshold;

a fifth determining unit, configured to determine, according to the first SEI message, first face position information corresponding to the first live video frame;

a sixth determining unit, configured to determine, according to the second SEI message and the first SEI message, second face position information corresponding to the second live video frame, where the second SEI message is generated from position information of a difference between the second face position information and the first face position information.

It should be noted that: in the above embodiment, when the device implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Referring to fig. 9, a block diagram of a terminal 900 according to an exemplary embodiment of the present application is shown. The terminal 900 may be an electronic device such as a smart phone, a tablet computer, a portable personal computer, or the like, in which a live application is installed and operated. Terminal 900 in the present application may include one or more of the following components: a processor 902, a memory 901, and a touch display 903.

The processor 902 may include one or more Processing cores, the processor 902 may interface various portions throughout the terminal 900 with various interfaces and lines, perform various functions of the terminal 900 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 901, and calling data stored in the memory 901, the processor 902 may alternatively be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), Programmable logic Array (Programmable L Array, P L a), the processor 902 may be implemented in the form of at least one of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and a modem, wherein the CPU primarily processes operating systems, user interfaces, application programs, etc., the GPU is responsible for displaying touch screen content, the modem may be implemented for rendering, and the modem may be implemented for communication, or a separate communication chip 903.

The Memory 901 may include a Random Access Memory (RAM) or a Read-Only Memory (ROM). Optionally, the memory 901 includes a non-transitory computer-readable medium. The memory 901 may be used to store an instruction, a program, code, a set of codes, or a set of instructions. The memory 901 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, and the like), instructions for implementing the above method embodiments, and the like, and the operating system may be an Android (Android) system (including a system based on Android system depth development), an IOS system developed by apple inc (including a system based on IOS system depth development), or other systems. The stored data area may also store data created by terminal 900 during use (e.g., phone book, audio-visual data, chat log data), etc.

The touch display 903 is used for receiving a touch operation of a user on or near the touch display using a finger, a touch pen, or any other suitable object, and displaying a user interface of each application. The touch display screen is generally provided at a front panel of the terminal 900. The touch display screen may be designed as a full-face screen, a curved screen, or a profiled screen. The touch display screen can also be designed to be a combination of a full-face screen and a curved-face screen, and a combination of a special-shaped screen and a curved-face screen, which is not limited in the embodiment of the present application.

In addition, those skilled in the art will appreciate that the configuration of terminal 900 illustrated in the above-identified figures is not meant to be limiting with respect to terminal 900, and that terminal may include more or less components than those illustrated, or some components may be combined, or a different arrangement of components. For example, the terminal 900 further includes a radio frequency circuit, a shooting component, a sensor, an audio circuit, a Wireless Fidelity (WiFi) component, a power supply, a bluetooth component, and other components, which are not described herein again.

In an exemplary embodiment, a non-transitory computer-readable storage medium is also provided, in which a computer program is stored, which when executed by a processor, implements the above-described virtual special effects display method.

In an exemplary embodiment, a computer program product is also provided for implementing the above-described virtual special effects display method when executed by a processor.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. In addition, the step numbers described herein only exemplarily show one possible execution sequence among the steps, and in some other embodiments, the steps may also be executed out of the numbering sequence, for example, two steps with different numbers are executed simultaneously, or two steps with different numbers are executed in a reverse order to the order shown in the figure, which is not limited by the embodiment of the present application.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A virtual special effect display method is applied to a plug flow client, and comprises the following steps:

acquiring a live video frame;

2. The method of claim 1, wherein sending the facial position information and the live video frame to a pull streaming client via a streaming server comprises:

encoding and packaging the face position information and the live video frame to obtain a live video stream;

and sending the live video stream to the pull stream client through the stream server.

3. The method of claim 2, wherein said encoding and encapsulating the face position information and the live video frame to obtain a live video stream comprises:

generating a supplemental enhancement information SEI message according to the face position information;

4. The method according to any one of claims 1 to 3, wherein the performing facial recognition on the live video frame to determine facial position information corresponding to a facial feature region in the live video frame includes:

acquiring similarity between adjacent live video frames, wherein the adjacent live video frames comprise a first live video frame and a second live video frame;

performing face recognition on the first live video frame in response to the similarity being higher than a similarity threshold value, and determining first face position information corresponding to a face feature region in the first live video frame;

and determining the first face position information as second face position information corresponding to the second live video frame.

5. The method of claim 4, wherein after obtaining the similarity between adjacent live video frames, the method further comprises:

performing face recognition on the first live video frame in response to the similarity being lower than the similarity threshold, and determining the first face position information;

performing facial recognition on the second live video frame to determine second facial position information corresponding to a facial feature area in the second live video frame;

the sending the face position information and the live video frame to a pull streaming client through a streaming server comprises:

generating a first SEI message according to the first face position information, and generating a second SEI message according to the different face position information between the first face position information and the second face position information;

and sending the first SEI message and the first live video frame to the stream pulling client through a stream server, and sending the second SEI message and the second live video frame to the stream pulling client.

6. A virtual special effect display method is applied to a pull flow client side, and comprises the following steps:

receiving a virtual special effect adding instruction;

7. The method according to claim 6, wherein performing special effect rendering on the face in the live video frame according to the virtual special effect adding instruction and the face position information, and displaying the live video frame after special effect rendering comprises:

determining a special effect area indicated by the virtual special effect adding instruction;

determining the face position information corresponding to the special effect area according to the face feature area to which the special effect area belongs;

and according to the face position information and the virtual special effect, carrying out special effect rendering on the face feature region in the live video frame, and displaying the live video frame after the special effect rendering.

8. The method of claim 6, wherein the obtaining, by the streaming server, the live video frame and the facial position information sent by the stream pushing client comprises:

receiving, by the streaming server, a live video stream sent by the stream pushing client;

and decoding the live video stream to obtain the live video frame and an SEI message, wherein the SEI message is generated by the face position information.

9. The method according to any one of claims 6 to 6, wherein the obtaining, by the streaming server, the live video frame and the face position information sent by the stream pushing client comprises:

acquiring, by the streaming server, a first live video frame and a first SEI message sent by the streaming client, and acquiring a second live video frame and a second SEI message sent by the streaming client, where the first live video frame and the second live video frame are adjacent live video frames, and a similarity between the first live video frame and the second live video frame is lower than a similarity threshold;

determining first face position information corresponding to the first live video frame according to the first SEI message;

and determining second face position information corresponding to the second live video frame according to the second SEI message and the first SEI message, wherein the second SEI message is generated by the different position information between the second face position information and the first face position information.

10. A virtual special effect display apparatus, wherein the apparatus is applied to a plug flow client, the apparatus comprising:

the first acquisition module is used for acquiring a live video frame;

11. A virtual special effects display apparatus, the apparatus being applied to a pull streaming client, the apparatus comprising:

12. A terminal, characterized in that the terminal comprises a processor and a memory; the memory stores at least one instruction for execution by the processor to implement a virtual special effects display method according to any one of claims 1 to 5, or to implement a virtual special effects display method according to any one of claims 6 to 9.

13. A computer-readable storage medium, wherein the storage medium stores at least one instruction for execution by a processor to implement the virtual special effects display method of any of claims 1 to 5, or to implement the virtual special effects display method of any of claims 6 to 9.