CN112770173A

CN112770173A - Live broadcast picture processing method and device, computer equipment and storage medium

Info

Publication number: CN112770173A
Application number: CN202110120986.6A
Authority: CN
Inventors: 刘平
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-01-28
Filing date: 2021-01-28
Publication date: 2021-05-07

Abstract

The application discloses a live broadcast picture processing method and device, computer equipment and a storage medium, and belongs to the technical field of computers. According to the method and the device, the function of adjusting the background image of the second video data is set in the live broadcast scene containing the two paths of video data, the adjusting mode of the background image in the second video data is determined according to the recognition result of any one of the first video frame in the first video data, the second video frame in the second video data and the voice data, the background image of the second video data is adjusted flexibly, the display effect of the second video data and the overall visual effect of the live broadcast picture are improved, and the influence of the second video data on the display effect of the first video picture can be avoided.

Description

Live broadcast picture processing method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a live view processing method and apparatus, a computer device, and a storage medium.

Background

With the development of internet and multimedia technology, live webcasting is becoming an important entertainment mode, and live content is more and more diversified. Currently, two paths of video data can be displayed simultaneously in a live broadcast interface, for example, live game play is taken as an example, first video data obtained by recording a running picture of a game can be displayed in the live broadcast interface, and second video data obtained by shooting a main broadcast user can also be displayed.

In general, the second video data is superimposed on the first video data for display, which results in that the second video data blocks a part of the frames in the first video data, and the display effect of the second video data also affects the visual effect of the whole live broadcast frame. Therefore, how to perform data processing on the live broadcast picture to reduce the influence of the second video data on the display effect of the first video data is an important research direction to improve the visual effects of the second video data and the whole live broadcast picture.

Disclosure of Invention

The embodiment of the application provides a live broadcast picture processing method and device, computer equipment and a storage medium, which can improve the display effect of video data and the overall visual effect of a live broadcast picture. The technical scheme is as follows:

in one aspect, a method for processing a live broadcast frame is provided, and the method includes:

acquiring first video data and second video data, wherein the first video data is obtained by recording a current display interface, and the second video data is obtained by real-time acquisition;

identifying at least one of a first video frame in the first video data, a second video frame in the second video data or voice data in the second video data to obtain an identification result;

adjusting the background image in the second video frame based on the adjustment mode corresponding to the identification result;

and generating a live broadcast picture according to the first video frame and the adjusted second video frame, and displaying the live broadcast picture in a live broadcast interface.

In one aspect, a live view processing apparatus is provided, the apparatus including:

the acquisition module is used for acquiring first video data and second video data, wherein the first video data is obtained by recording a current display interface, and the second video data is obtained by real-time acquisition;

the identification module is used for identifying at least one of a first video frame in the first video data, a second video frame in the second video data or voice data in the second video data to obtain an identification result;

the adjusting module is used for adjusting the background image in the second video frame based on the adjusting mode corresponding to the identification result;

and the generation module is used for generating a live broadcast picture according to the first video frame and the adjusted second video frame and displaying the live broadcast picture in a live broadcast interface.

In one possible implementation, the image capture unit is configured to:

acquiring target position information, wherein the target position information is used for indicating the position of the target area in the first video frame image;

and intercepting the target image of the target area from the first video frame image based on the target position information.

In one possible implementation, the apparatus further includes a model training module to:

acquiring at least two first sample images, wherein the first sample images carry position marking information and matching parameter marking information, the marking position information is used for indicating the position of a target area containing key information in the first sample images, and the matching parameter marking information is used for indicating whether the target image of the target area comprises the key information to be identified;

and training the first recognition model based on the target image of the target area in the at least two first sample images to obtain the trained first recognition model.

In one possible implementation, the identification module includes, including at least one of:

the second recognition submodule is used for performing gesture recognition on the second video frame to obtain a gesture included in the second video frame, and the gesture is used for indicating to adjust a background image in the second video frame;

and the third recognition submodule is used for performing voice recognition on the voice data to obtain a voice instruction, and the voice instruction is used for indicating to adjust the background image in the second video frame.

In one possible implementation, the adjusting module is configured to perform any one of:

removing the background image in the second video frame in response to the identification result corresponding to the first adjustment mode;

responding to the identification result corresponding to a second adjustment mode, and performing fuzzy processing on the background image in the second video frame;

and replacing the background image in the second video frame with the reference image in response to the identification result corresponding to the third adjustment mode.

In one possible implementation, the apparatus further includes:

and the sending module is used for sending the live broadcast picture to a server, and the server is used for sending the live broadcast picture to terminals of audience users.

In one possible implementation, the identification module is configured to:

and in response to the background adjusting function being in an on state, at least one of a first video frame in the first video data, a second video frame in the second video data or voice data in the second video data is identified to obtain an identification result, and the background adjusting function is used for indicating whether to allow the background image of the second video data to be adjusted.

In one possible implementation, the identification module is configured to:

responding to the background adjusting function in an opening state and automatically adjusting the background function in the opening state, and identifying a first video frame in the first video data to obtain a first identification result;

responding to the background adjusting function in an opening state and the gesture adjusting background function in the opening state, and identifying the video frame image in the second video data to obtain a second identification result;

and in response to the background adjusting function being in the open state and the voice adjusting background function being in the open state, identifying the live voice data to obtain a third identification result.

In one aspect, a computer device is provided that includes one or more processors and one or more memories having at least one computer program stored therein, the at least one computer program being loaded and executed by the one or more processors to perform operations performed by the live view processing method.

In one aspect, a computer-readable storage medium is provided, in which at least one computer program is stored, the at least one computer program being loaded and executed by a processor to implement operations performed by the live view processing method.

In one aspect, a computer program product is provided that includes at least one computer program stored in a computer readable storage medium. The processor of the computer device reads the at least one computer program from the computer-readable storage medium, and executes the at least one computer program to cause the computer device to perform the operations performed by the live view processing method.

According to the technical scheme, the function of adjusting the background image of the second video data is set in the live broadcast scene containing the two paths of video data, the adjusting mode of the background image in the second video data is determined according to the recognition result of any one of the first video frame in the first video data, the second video frame in the second video data and the voice data, and the background image of the second video data is adjusted flexibly, so that the display effect of the second video data and the overall visual effect of a live broadcast picture are improved, and the influence of the second video data on the display effect of the first video picture can be avoided.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of an implementation environment of a live view processing method according to an embodiment of the present application;

fig. 2 is a schematic view of a live interface provided in an embodiment of the present application;

fig. 3 is a flowchart of a live view processing method according to an embodiment of the present application;

fig. 4 is a schematic view of a live interface provided in an embodiment of the present application;

fig. 5 is a flowchart of a live view processing method according to an embodiment of the present application;

fig. 6 is a schematic diagram of a live configuration interface provided in an embodiment of the present application;

FIG. 7 is a diagram illustrating a background adjustment configuration interface according to an embodiment of the present disclosure;

fig. 8 is a schematic view of a live interface provided in an embodiment of the present application;

FIG. 9 is a schematic diagram of an image segmentation provided by an embodiment of the present application;

fig. 10 is a schematic diagram of a live view provided in an embodiment of the present application;

fig. 11 is a schematic diagram of a live view processing procedure provided in an embodiment of the present application;

fig. 12 is a schematic structural diagram of a live view processing apparatus according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the purpose, technical solutions and advantages of the present application clearer, the following will describe embodiments of the present application in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution.

The present application relates to Artificial Intelligence (AI) technology, which is a theory, method, technique, and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend, and expand human Intelligence, perceive the environment, acquire knowledge, and use the knowledge to obtain optimal results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like. The application mainly relates to a machine learning technology, a computer vision technology and a natural language processing technology in an artificial intelligence technology, and an image recognition model, an image segmentation model and a voice recognition model carried by a terminal are trained based on the machine learning technology, so that the terminal has the functions of image recognition, image segmentation and voice recognition. Illustratively, in the live broadcast process, the terminal invokes an image recognition function and a voice recognition function to recognize images and voice data in live broadcast data so as to determine a processing instruction for a background image in the live broadcast data, and then invokes an image segmentation function to segment the background image in the live broadcast data, and performs fuzzy processing, replacement and the like on the background image.

The following explains the terms related to the embodiments of the present application:

virtual camera: the software camera can be simulated to be a real camera, and can be used in any application supporting a camera.

A trigger: the method includes that a control module triggers a next step when a certain condition is met, and in the embodiment of the application, a trigger is deployed in a first application run by a first terminal and used for identifying an instruction for adjusting a background image.

And (4) live game: the method is characterized in that the running interface of the game is played while the game is running by applying the Internet and the streaming media technology, so that audience users can see the game running interface presented on the terminal of the anchor user.

And (3) voice recognition: refers to a computer device recognizing human voice content as corresponding words.

Gesture recognition: refers to a means for computer equipment to understand human body language, aiming at recognizing human gestures through mathematical algorithms.

Image recognition: refers to a technique for processing, analyzing and understanding images using a computer device to recognize various different patterns of objects and objects.

Fig. 1 is a schematic diagram of an implementation environment of a live view processing method according to an embodiment of the present application, and the implementation environment illustratively includes a first terminal 101 and a server 102.

The first terminal 101 is a terminal used by a main broadcast user, the first terminal 101 is installed and operated with a first application supporting live video, for example, the first application is a live broadcast assistant application, and the first terminal 101 can generate a live broadcast picture through the first application and push the live broadcast picture to the server 102 in a data stream form. Optionally, the first terminal 101 may be a smart phone, a tablet computer, a notebook computer, an e-book reader, an MP3(Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3) player, an MP4(Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4) player, a laptop computer, a desktop computer, an intelligent television, an intelligent vehicle-mounted device, and the like, which is not limited in the embodiment of the present application. In a possible implementation manner, the first terminal 101 is configured with a camera, or is connected with a camera, and the first terminal 101 collects video data through the camera. In a possible implementation manner, the first terminal 101 may further be connected to other terminals to obtain video data acquired by other terminals, for example, the first terminal 101 is a notebook computer, the notebook computer may be connected to a mobile phone in a wired or wireless communication manner, and the mobile phone may acquire the video data through a camera or perform screen recording to generate video data, and send the video data to the notebook computer.

The server 102 may be a background server of the first application, and is configured to forward the data stream transmitted by the first terminal 101 to a second terminal used by the viewer user. Optionally, the server 102 is an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, which is not limited in this embodiment of the present application.

Optionally, the first terminal 101 and the server 102 are directly or indirectly connected through wired or wireless communication. Those skilled in the art will appreciate that the number of the terminals may be greater or smaller, for example, the number of the terminals may be only one, or may be several tens or hundreds, or greater. The embodiment of the present application does not limit the number of terminals and the device type in the implementation environment.

The live broadcast picture processing method provided by the embodiment of the application can be combined with various types of live broadcast scenes. Taking a game live broadcast scene as an example, in the game live broadcast scene, in order to enable audience users to see the state of a main broadcast user in the game process more intuitively, two paths of video data are usually collected, one path is video data for presenting a game running interface, and the other path is video data for presenting a picture of the main broadcast user, that is, the live broadcast interface displays the picture of the main broadcast user in addition to the game running interface, see fig. 2, fig. 2 is a schematic diagram of a live broadcast interface provided by the embodiment of the present application, the live broadcast interface includes the game running interface 201 and a picture 202 of the main broadcast user, as shown in fig. 2, the picture 202 of the main broadcast user is displayed by being superimposed on the game running interface 201, which causes a shielding to the game running interface 201, and when an environment presented in the picture 202 of the main broadcast user is disordered, the overall visual effect of the live interface can be influenced, and by applying the scheme under the condition, the first terminal can automatically adjust the background image in the picture of the anchor user, for example, the background image is removed, the background image with better visual effect is replaced, and the like, so that the picture of the anchor user is prevented from influencing the display effect of the game operation interface, and the overall visual effect of the live interface is improved. Of course, the technical solution provided in the embodiment of the present application may also be applied to other types of live scenes, and the embodiment of the present application does not limit this.

Fig. 3 is a flowchart of a live view processing method according to an embodiment of the present application. The method may be applied to the foregoing implementation environment, in this embodiment of the present application, the first terminal is used as an execution subject, and the live view processing method is described, referring to fig. 3, in a possible implementation manner, the embodiment includes the following steps:

301. the method comprises the steps that a first terminal obtains first video data and second video data, wherein the first video data are obtained by recording a current display interface, and the second video data are obtained by real-time collection.

For example, the first video data is obtained by the first terminal recording an operation interface of a second application currently running, or may be obtained by other terminals recording an operation interface of a third application currently running, and the other terminals send the first video data to the first terminal. The second application and the third application may be any type of application, for example, the second application and the third application are game-type applications, which is not limited in this embodiment of the present application. The second video data is shot by the first terminal in real time through a built-in camera or an external camera, and may also be obtained through a virtual camera. It should be noted that, in the embodiment of the present application, the content and the obtaining method of the first video data and the second video data are not limited.

302. The first terminal identifies at least one of a first video frame in the first video data, a second video frame in the second video data or voice data in the second video data to obtain an identification result.

In a possible implementation manner, the first terminal performs image recognition on the first video frame, and determines whether the relevant key information in the first video frame is blocked; or, performing image recognition on the second video frame in the second video data, for example, recognizing an action such as a gesture of the anchor user; or, recognizing the voice data in the second video data, and judging whether the anchor user triggers the voice command. Of course, the first terminal may identify other information of the first video data and the second video data, and the method for acquiring the identification result is not limited in the embodiment of the present application. In the embodiment of the application, by setting multiple identification modes, namely setting multiple modes for triggering background image adjustment, the anchor user can flexibly adjust the background image of the video data, so that the convenience of man-machine interaction is improved, and the man-machine interaction efficiency is improved.

303. And the first terminal adjusts the background image in the second video frame based on the adjustment mode corresponding to the identification result.

The method for adjusting the background image includes removing the background image, blurring the background image, replacing the background image, and the like.

In one possible implementation manner, the first terminal may determine whether to adjust the background image of the second video frame based on the recognition result, and adjust the background image based on which adjustment manner. For example, if the first terminal recognizes that the relevant key information in the first video frame is blocked, the background image in the second video frame may be removed; if the first terminal identifies the target gesture, adjusting the background image based on an adjusting mode indicated by the target gesture; and if the first terminal identifies the voice command, adjusting the background image based on the adjusting mode indicated by the voice command.

304. And the first terminal generates a live broadcast picture according to the first video frame and the adjusted second video frame, and displays the live broadcast picture in a live broadcast interface.

In a possible implementation manner, the first terminal merges the first video data and the adjusted second video data, and displays the merged video data on the live interface, that is, a live frame displayed on the live interface is obtained by merging the first video frame and the adjusted second video frame. Fig. 4 is a schematic diagram of a live interface provided in an embodiment of the present application, and as shown in fig. 4, the live interface displays a first video frame 401 and a second video frame 402 with a background removed.

The above embodiments are brief descriptions of the technical solutions of the present application, and the following describes a method for processing a live-cast picture in detail with reference to fig. 5. Fig. 5 is a flowchart of a live view processing method provided in an embodiment of the present application, and referring to fig. 5, in a possible implementation manner, the embodiment includes the following steps:

501. and the first terminal displays a live broadcast configuration interface and acquires configuration information in the live broadcast configuration interface.

In the embodiment of the application, the first terminal runs the first application supporting video live broadcasting, and a user can configure live broadcasting materials displayed on a live broadcasting interface and material collecting equipment corresponding to the live broadcasting materials through the first application. For example, the live interface displays at least one live material, the live material may include video materials such as a game screen of a mobile phone, a game screen of a computer, a camera screen, and the like, and may also include images, characters, and the like, wherein the material collecting device corresponding to the game screen of the computer and the images may be the first terminal, the material collecting device corresponding to the game screen of the mobile phone may be other terminals connected to the first terminal, the material collecting device corresponding to the camera screen may be the first terminal, or the camera connected to the first terminal, and the material collecting device corresponding to the characters may be the first terminal. It should be noted that, in the embodiment of the present application, a live broadcast material displayed on the live broadcast interface and an obtaining manner of each live broadcast material are not limited.

Illustratively, the first terminal is displayed with a live broadcast configuration interface, and the live broadcast configuration interface is used for providing a live broadcast material selection function and configuring functions of material acquisition equipment corresponding to live broadcast materials. Fig. 6 is a schematic diagram of a live broadcast configuration interface provided in an embodiment of the present application, and as shown in fig. 6 (a), the live broadcast configuration interface includes a live broadcast material selection area 601 and a material collection device configuration area 602, where the live broadcast material selection area 601 displays a selection control 603 corresponding to at least one live broadcast material. Taking the selection control corresponding to the mobile phone game screen as an example, as shown in (b) of fig. 6, the material collection device configuration area displays a plurality of configuration items, where the configuration items are used to configure information of a target mobile phone running a game, and exemplarily, the material collection device configuration area displays a mobile phone system selection item, a sound playing mode selection item, and the like. In a possible implementation manner, the material collecting device configuration area also displays a two-dimensional code, and the target mobile phone can establish a connection with the first terminal by scanning the two-dimensional code, so that video data and the like are sent to the first terminal. It should be noted that the above description of the method for establishing a connection between the target mobile phone and the first terminal is only an exemplary description of one possible implementation manner, and the embodiment of the present application does not limit which method is specifically used to establish the data connection between the devices. Taking the selection control corresponding to the camera picture in the live broadcast material selection area as an example, as shown in (c) of fig. 6, the material acquisition device configuration area displays the device name of at least one camera connected to the first terminal, and the user can select any one of the cameras as the device applied by the live broadcast, and certainly, the material acquisition device configuration area can also configure resolution, an audio output mode, whether to open beauty and the like, which is not limited in the embodiment of the present application.

It should be noted that the above description of the live configuration interface is only an exemplary description, and the content included in the live configuration interface is not limited in the embodiment of the present application. In the embodiment of the application, the first terminal can acquire the configuration information of the live broadcast configuration interface, determine at least one live broadcast material displayed on the live broadcast interface in the live broadcast process based on the configuration information, and acquire the material acquisition equipment applied to each live broadcast material.

502. The first terminal acquires the first video data and the second video data based on the configuration information.

In the embodiment of the application, the live interface displays at least two live materials, and the two live materials are both video materials. The first terminal determines material collecting equipment corresponding to each video material based on the configuration information, and obtains first video data and second video data through the material collecting equipment. In one possible implementation manner, the first video data is video data obtained by recording a current display interface, and the second video data is video data obtained by real-time acquisition. For example, the first video data is obtained by recording a current display interface of the first terminal, and may also be obtained by recording a current display interface of another terminal connected to the first terminal, for example, the first video data is obtained by recording a running interface of a mobile game. In one possible implementation, the second video data includes image data and voice data. The image data in the second video data may be a camera built in the first terminal or shot in real time by a camera connected to the first terminal, for example, the second video data is video data obtained by shooting a main broadcast user by the camera, and the image data in the second video data may also be video data acquired by a virtual camera, for example, the second video data is video data including the main broadcast user or video data including a virtual image corresponding to the main broadcast user; the voice data in the second video data may be collected by a voice collecting device built in the first terminal, or may be collected by a microphone connected to the first terminal. It should be noted that, in the embodiment of the present application, specific contents and an acquisition manner of the first video data and the second video data are not limited.

503. The first terminal detects a state of a background adjustment function corresponding to the second video data, and performs the following steps 504 to 506 based on the state of the background adjustment function.

In one possible implementation manner, the first application executed by the first terminal can provide a function of adjusting a background image of the video data, for example, removing the background image from the video data, blurring the background image, replacing the background image, and the like. In this embodiment of the application, taking the adjustment of the background image of the second video data as an example, the first terminal displays a background adjustment configuration interface, where the background adjustment configuration interface includes an open control of a background adjustment function, and the background adjustment function is used to indicate whether to allow adjustment of the background image of the second video data, and if the background adjustment function is in an open state, it indicates that the first terminal is allowed to adjust the background image of the second video data, otherwise, it is not allowed to adjust the background image of the second video data. In one possible implementation, the adjusting of the background image of the second video data may be triggered by at least one triggering manner. Illustratively, the background adjusting configuration interface displays an opening control for automatically adjusting the background function, an opening control for gesture adjusting the background function and an opening control for voice adjusting the background function. The automatic background adjusting function means that whether the background image of the second video data needs to be adjusted or not can be intelligently identified by the first terminal, for example, if the first terminal identifies that the second video data blocks the first video data, the first terminal can automatically remove the background image of the second video data; the gesture background adjustment function refers to that a specific gesture is used for triggering adjustment of a background image of the second video data, for example, different gestures correspond to different background image adjustment modes; the voice adjusting background function is to trigger the adjustment of the background image of the second video data through a voice instruction, for example, the anchor user may issue a voice instruction in a live broadcasting process, and if the voice instruction is recognized by the first terminal, the background image of the second video data is adjusted based on the voice instruction. Fig. 7 is a schematic diagram of a background adjustment configuration interface provided in an embodiment of the present application, and referring to fig. 7, the background adjustment configuration interface displays a plurality of opening controls corresponding to functions, and further displays a plurality of background adjustment effect schematic diagrams 701. It should be noted that the above description of the background adjustment configuration interface is only an exemplary description, and the embodiment of the present application does not limit the specific style of the background adjustment configuration interface.

In this embodiment of the application, in response to that the background adjustment function is in the on state and the automatic adjustment background function is in the on state, the first terminal identifies the first video frame in the first video data to obtain an identification result, that is, the first terminal performs the following step 504 to determine whether the second video data blocks the first video data. In response to that the background adjustment function is in the on state and the gesture adjustment background function is in the on state, the second terminal identifies the video frame image in the second video data to obtain an identification result, that is, the second terminal executes the following step 505 to identify the gesture instruction of the anchor user. In response to the background adjustment function being in the on state and the voice adjustment background function being in the on state, the first terminal identifies the live voice data to obtain an identification result, that is, the first terminal executes the following step 506 to identify the voice command of the anchor user. In the embodiment of the application, by setting various modes for triggering background image adjustment, including AI (Artificial Intelligence) intelligent triggering, gesture triggering, voice triggering and the like, the anchor user can flexibly adjust the background image of the video data, so that the convenience of human-computer interaction is improved, and the human-computer interaction efficiency is improved.

In this embodiment, only the adjustment of the background image of the second video data is taken as an example for description, in some embodiments, the adjustment of the background image of the first video data may also be performed, or the adjustment of both the background images of the first video data and the second video data may also be performed, which is not limited in this embodiment.

504. The first terminal identifies a first video frame in the first video data to obtain a first identification result.

In one possible implementation manner, a target area may be determined in the first video frame of the first video data based on the display positions of the first video data and the second video data in the live interface, where the target area is an area that is blocked by the display area of the second video data, that is, the target area coincides with the display area of the second video data in the live interface. In a possible implementation manner, the first terminal may intercept a first video frame from the first video data every first reference duration, perform image recognition on a target image of a target area in the first video frame, and obtain a first recognition result, that is, a key information matching parameter corresponding to the target image. The first reference time length is set by a developer, and is not limited in the embodiment of the present application. The key information matching parameter is used for indicating the matching degree between the information included in the target image and the key information to be identified. The key information to be identified may be set by a developer, and the key information included in different first video data is different, for example, taking the first video data as video data obtained by recording an operation interface of a competitive game, the first video frame may be the operation interface of the competitive game, and for the competitive game, the key information may include a virtual map of the competitive game, virtual prop information, match information, and the like, and the target area may be a display area of the virtual map of the match, the virtual prop information, the match information, and the like. The position of the target area may be set by a developer, the number of the target areas may be one or multiple, which is not limited in the embodiments of the present application, and only one target area is taken as an example in the embodiments of the present application for description. Fig. 8 is a schematic view of a live interface provided in an embodiment of the present application, taking a game live scene as an example, where a picture of a first video frame displayed in the live interface is an operation interface of a competitive game, and a target area in the first video frame is a display area of a virtual map, that is, an area 801.

The process of acquiring and recognizing the target image in the target area is explained below. In one possible implementation manner, the first terminal first captures the target image of the target area in the first video frame image, for example, the first terminal obtains target position information, and based on the target position information, captures the target image of the target area from the first video frame image. Wherein the target position information is used for indicating the position of the target area in the first video frame image, and the target position information may include the coordinates of the top left vertex of the target area, the width value and the height value of the target area. In one possible implementation manner, a trained first recognition model is deployed in a first application run by the first terminal, and the first recognition model is used for recognizing whether the target image contains key information to be recognized. Illustratively, the first recognition model is a lightweight model SqueezeNet (compressed convolutional neural network), the first recognition model is a deep-compressed model with fewer model parameters, and optionally, the first recognition model may be further compressed by combining with deep compression technology, and the compression ratio may reach 461X. In a possible implementation manner, the first terminal inputs the target image into a first recognition model, and compares the image features of the target image with the key features corresponding to the key information through the first recognition model to obtain the key information matching parameters. The trained first recognition model comprises reference features corresponding to key information to be recognized, the first recognition network performs feature extraction on the target image through a plurality of operation layers to obtain image features of the target image, and key information matching parameters are generated based on the reference features and the image features, namely, the first recognition result. It should be noted that the above description of the method for identifying and acquiring the target image is only an exemplary description, and the embodiment of the present application does not limit which method is used to acquire and identify the target image.

In the following, a description is given to a training process of the first recognition model, and in a possible implementation manner, the first recognition model may be trained through a third terminal, where the third terminal is a terminal used by a developer, and in this embodiment of the present application, a deep learning framework applied to model training may be Caffe (convolutional neural network framework). In a possible implementation manner, the first terminal obtains at least two first sample images, where the first sample images carry position annotation information and matching parameter annotation information, the position annotation information is used to indicate a position of a target area containing key information in the first sample image, and the matching parameter annotation information is used to indicate whether the first sample image includes key information to be identified. And training the first recognition model based on the at least two first sample images to obtain the trained first recognition model. Illustratively, the first terminal may intercept an image in the region indicated by the position labeling information from the first sample image at a speed of one per second using ffmpeg (fast forward mpeg), input the intercepted image as training data into the first recognition model, the first recognition model outputs the key information matching parameters corresponding to the training data, determines the error value between the key information matching parameters and the corresponding matching parameter label information, reversely propagates the error value to the first recognition model, adjusting parameters of each operation layer in the first recognition model until the first recognition model meets the model convergence condition, stopping model training to obtain a trained first recognition model, the trained first recognition model is a model of the reference features of the learned key information. It should be noted that the above description of the training method for the first recognition model is only an exemplary description of one possible implementation manner, and the embodiment of the present application does not limit what kind of manner is specifically adopted to train the first recognition model. In this embodiment of the application, after the developer finishes training the first recognition model through the third terminal, the developer may deploy the first recognition model in the first application, and when the first terminal used by the anchor user applies the first application for live broadcasting, the developer may call the first recognition model to perform image recognition on a video frame in the first video data.

In the embodiment of the application, whether the target image contains the key information is judged by identifying the target image of the target area, that is, whether the second video data shields the key information in the first video data is judged, and whether the background image of the second video data is adjusted is further determined, so that the influence of the second video data on the display effect of the first video data is avoided.

505. And the first terminal identifies a second video frame in the second video data to obtain a second identification result.

In a possible implementation manner, the first terminal may intercept a second video frame from the second video data every second reference duration, and perform gesture recognition on the second video frame to obtain a gesture included in the second video frame. The second reference time length is set by a developer, and is not limited in the embodiment of the present application. The gesture is used to adjust the background image in the second video frame, for example, different gestures correspond to different ways of adjusting the background image. In a possible implementation manner, the first terminal may invoke a second recognition model for gesture recognition, where the second recognition model may be deployed in the first application or a server corresponding to the first application, and the second recognition model includes an image feature corresponding to at least one gesture. The above description of the gesture recognition method is only an exemplary description, and the gesture recognition method is not limited in the embodiments of the present application. The training process of the second recognition model is the same as the training process of the first recognition model in step 504, and is not described herein again. In the embodiment of the application, in the training process of the second recognition model, the applied deep learning framework may be a U-Net full convolution neural network, and the trained second recognition model may be deployed in the first application or may be deployed in a server corresponding to the first application.

In a possible implementation manner, the first terminal acquires first video data and second video data, for example, the first video data may be video stream data obtained by recording a game running picture, the second video data is video stream data obtained by shooting a game to the anchor user by a camera, and a display manner of the first video data and the second video data may be as shown in fig. 2, where the first video data is displayed in a region 201 and the second video data is displayed in a region 202. In the embodiment of the application, in the live game process, the anchor user can make any gesture towards the camera, the camera sends the second video data including the gesture of the anchor user to the first terminal, and the first terminal identifies the second video frame in the second video data and identifies the gesture included in the second video frame. Illustratively, the first gesture is to extend a finger, which indicates to remove the live background of the anchor user, i.e. to remove the background image in the second video data; the second gesture, the third gesture and the fourth gesture are respectively two, three or four fingers which extend out, and all indicate that the live background of the anchor user is replaced by other images, and the images indicated by different gestures are different, namely, the background image in the second video data is replaced by the image indicated by the gesture of the anchor user; the fifth gesture is that five fingers are stretched out, and the fuzzy processing is performed on the live background of the anchor user; optionally, the anchor user can also adjust the degree of blurring, for example, the fifth gesture represents a light blurring of the background image, and the sixth gesture represents a fist making, which represents a heavy blurring of the background. It should be noted that the above description of the correspondence between the gesture and the processing manner of the background image is only an exemplary description of one possible implementation manner, and the embodiment of the present application does not limit this. In one possible implementation manner, if the first terminal does not detect that the anchor user makes a gesture, the background image of the second video data is not processed; or, if the first terminal detects that the gesture made by the anchor user is not standard, the background image of the second video data may not be processed, and prompt information is displayed on the live interface and used for prompting the anchor user to make the gesture again.

In the embodiment of the present application, only gesture recognition is performed on the second video frame as an example, and in some embodiments, motion recognition, expression recognition, and the like may also be performed on the second video frame, which is not limited in the embodiment of the present application.

In the embodiment of the application, the second video frame is subjected to image recognition, namely, the video frame containing the anchor user is recognized, so that the background image of the video data can be flexibly adjusted based on the gesture of the anchor user, the anchor user does not need to manually select the background image in the first application through the selection control, and the human-computer interaction efficiency in the background image adjusting process is effectively improved.

506. And the first terminal identifies the voice data in the second video data to obtain a third identification result.

In a possible implementation manner, the first terminal may perform voice recognition on the live voice data to obtain a voice instruction, where the voice instruction is used to instruct to adjust a background image in the second video frame, and different voice instructions correspond to different background image adjustment manners. In a possible implementation manner, the first terminal may invoke a third recognition model to perform the voice recognition step, where the third recognition model may be deployed in the first application, or may be deployed on a server corresponding to the first application, and this is not limited in this embodiment of the present application. For example, the user may issue a voice command by using a reference sentence pattern, where the reference sentence pattern may be "magic mirror, xth background adjustment mode", where X may be any number for indicating a serial number of the background adjustment mode, and of course, the reference sentence pattern may also be other sentence patterns, which is not limited in this embodiment of the present application. The third recognition model can recognize the reference sentence pattern in the voice data to obtain the voice command, namely obtaining a third recognition result. It should be noted that, in the embodiment of the present application, the process of performing speech recognition on the third recognition model and the training process of the third recognition model are not limited.

In the embodiment of the application, the background image of the video data is flexibly adjusted by identifying the voice data in the second video frame, namely identifying the voice instruction of the anchor user, and the anchor user does not need to manually select the background image in the first application through the selection control, so that the human-computer interaction efficiency in the background image adjusting process is effectively improved.

It should be noted that, during the live broadcast process, the first terminal may perform at least one of the above steps 504 to 506, for example, only the first video frame may be identified, the first video frame and the second video frame may be identified, and all of the first video frame, the second video frame, and the voice data may be identified.

507. And the first terminal adjusts the background image in the second video frame based on the adjustment mode corresponding to the identification result.

In one possible implementation, the background image in the second video frame is removed in response to the recognition result corresponding to the first adjustment mode. For example, if the identification result obtained by the first terminal is a key information matching parameter, and the key information matching parameter is greater than a parameter threshold, the target image indicating the target area of the first video frame includes key information, and when the first video frame is displayed on the live interface, the key information in the first video frame may be blocked by the second video frame. In an exemplary scenario of live broadcasting of a competitive combat game, after entering a competitive combat, a virtual map corresponding to the competitive combat of the game is displayed in a target area of a game running interface, that is, relevant key information is displayed, and after not entering the competitive combat or exiting the competitive combat, the target area no longer displays the virtual map, that is, no key information is displayed. The parameter threshold is set by a developer, for example, the parameter threshold may be set to 0.95. Or the gesture and the voice instruction recognized by the first terminal correspond to the first adjustment mode, and the first terminal removes the background image in the second video frame.

In one possible implementation manner, in response to the recognition result corresponding to the second adjustment manner, the first terminal performs blurring processing on the background image in the second video frame. For example, if the gesture or the voice instruction recognized by the first terminal corresponds to the second adjustment mode, the first terminal performs a blurring process on the background image in the second video frame, for example, performs a gaussian blurring process on the background image.

In one possible implementation manner, in response to the recognition result corresponding to the third adjustment manner, the first terminal replaces the background image in the second video frame with the reference image. For example, if the gesture and the voice instruction recognized by the first terminal correspond to the third adjustment mode, the first terminal replaces the background image in the second video frame. In the embodiment of the application, a plurality of reference images can be set, and different gestures or voice instructions can correspond to different reference images, and the reference images can be set by developers or uploaded by users.

The following describes the methods of background image removal, background image blurring processing, and background image replacement described above. In a possible implementation manner, an image segmentation model is deployed in a first application run by the first terminal, the first terminal inputs a second video frame into the image segmentation model, the image segmentation model classifies each pixel in the second video frame, and determines that each pixel belongs to a background region or a foreground region, for example, the image segmentation model may output classification information of each pixel in the second video frame, the classification information may be expressed as 4 bytes of Float data, for example, 1.0 represents that a pixel belongs to a foreground region, 0.0 represents that a pixel belongs to a background region, the data amount of the classification information output by the image segmentation model is equal to the width value and the height value of the second video frame, fig. 9 is a schematic diagram of image segmentation provided in this embodiment, the second video frame is shown as (a) in fig. 9, the image segmentation result obtained based on the image segmentation model is shown in fig. 9 (b). In this embodiment, the first terminal may adjust the background image of the second video frame based on the classification information of each pixel point in the second video frame. In a possible implementation manner, the first terminal represents each pixel point as a [ r, g, b, a ] array based on the color of each pixel point in the second video frame, where r represents a red color value, g represents a green color value, b represents a blue color value, and a represents the transparency of the pixel point. For the adjustment mode of removing the background image, the first terminal may assign the classification information corresponding to each pixel point to a in the corresponding [ r, g, b, a ] array, that is, for the pixel point belonging to the foreground region, the corresponding array is [ r, g, b, 1], that is, the pixel point belonging to the foreground region is opaque, and for the pixel point belonging to the background region, the corresponding array is [ r, g, b, 0], that is, the pixel point belonging to the background region is transparent. In the embodiment of the application, the background image in the second video frame can be flexibly and accurately removed by adjusting the transparency of the pixel point. For the adjustment mode of the background image blurring, the first terminal may perform blurring on each pixel point in the second video frame, where the blurred pixel point is represented as [ r ', g', b ', a ], if the pixel point belongs to the background region, the array corresponding to the pixel point is adjusted as [ r', g ', b', a ], if the pixel point belongs to the foreground region, the array corresponding to the pixel point is not adjusted, that is, is also represented as [ r, g, b, a ]. For the adjustment method of background image replacement, taking replacing the background image in the second video frame with the reference image as an example, the second terminal obtains the corresponding array [ r ", g", b ", a" ] of each pixel in the reference image, if a certain pixel in the second video frame belongs to the background region, the corresponding array of the certain pixel is adjusted to [ r ", g", b ", a ], if a certain pixel in the second video frame belongs to the foreground region, the corresponding array of the certain pixel is not adjusted, namely, the corresponding array is also represented as [ r, g, b, a ]. It should be noted that the above description of the method for adjusting the background image of the second video frame is only an exemplary description of one possible implementation manner, and the embodiment of the present application does not limit which method is specifically used to adjust the background image of the second video frame.

In the embodiment of the application, multiple background adjustment modes are set, the anchor user can adjust the background image of the second video data based on the actual display condition of the video data in the live interface, for example, if the second video data shields the first video data, the background data of the second video data can be directly removed, if the visual effect of the background image of the second video data is poor, the background image can be subjected to fuzzy processing, or image replacement is performed, the overall visual effect of a live picture is effectively improved, and the visual experience of watching live broadcast by audience users is improved.

508. And the first terminal generates a live broadcast picture according to the first video frame and the adjusted second video frame, and displays the live broadcast picture in a live broadcast interface.

In a possible implementation manner, the first terminal may combine the first video frame and the adjusted second video frame based on display positions of the first video frame and the second video frame on the live interface to obtain a live frame, and display the live frame on the live interface. In this embodiment of the application, the first terminal may further push the live view to a server corresponding to the first application in a data stream form, and then the server sends the live view to terminals of each audience user in a data stream form, so that the live view is displayed at the terminals of each audience user. Fig. 10 is a schematic diagram of a live view provided in an embodiment of the present application, where (a) in fig. 10 shows an effect of removing a background image in a second video frame, (b) in fig. 10 shows an effect of blurring the background image in the second video frame (the blurring effect is indicated by oblique lines in the figure), and (c) in fig. 10 shows an effect of replacing the background image in the second video frame.

Fig. 11 is a schematic diagram of a live view processing procedure provided in an embodiment of the present application, and in the following, with reference to fig. 11, a live game scene is taken as an example, to illustrate the above processing of the live view, in one possible implementation, a trigger 1101 is disposed on the first terminal, and is used for performing image recognition and voice data, that is, the trigger is used to execute the above steps 504 to 506, the first terminal can input the game screen, the microphone voice data and the camera screen into the trigger 1101 in the triggering manner already enabled in the first application, acquire the recognition result output by the trigger 1101, if the recognition result can trigger the adjustment of the background image of the camera screen, the current camera view is marked to indicate that background adjustment is allowed for that camera view and subsequent camera views. The first terminal may determine an adjustment method for a background image of the camera screen based on the recognition result, input the camera screen to the image segmentation engine 1102, segment the background image in the camera screen by the image segmentation engine 1102, and input the image segmentation result to the texture synthesizer 1103, which adjusts the background image in the camera screen based on the adjustment method for the background image, including removing the background image, blurring the background image, replacing the background image, and the like. In this application embodiment, through setting up multiple trigger mode, the anchor user can adjust the background image in the camera picture in a flexible way, adjusts the display effect of the live environment of current position promptly, optimizes the display effect of camera picture, also can avoid live background too dull, improves live interest, improves audience user and watches live visual experience.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.

Fig. 12 is a schematic structural diagram of a live view processing apparatus according to an embodiment of the present application, and referring to fig. 12, the apparatus includes:

an obtaining module 1201, configured to obtain first video data and second video data, where the first video data is obtained by recording a current display interface, and the second video data is obtained by acquiring video data in real time;

an identifying module 1202, configured to identify at least one of a first video frame in the first video data, a second video frame in the second video data, or voice data in the second video data, so as to obtain an identification result;

an adjusting module 1203, configured to adjust a background image in the second video frame based on an adjusting manner corresponding to the recognition result;

a generating module 1204, configured to generate a live view according to the first video frame and the adjusted second video frame, and display the live view in a live view interface.

In one possible implementation, the identification module 1202 includes a first identification submodule for:

and performing image recognition on a target image of a target area in the first video frame to obtain a key information matching parameter corresponding to the target image, wherein the key information matching parameter is used for indicating the matching degree between the information included in the target image and the key information to be recognized.

In one possible implementation, the first identification submodule includes:

an image intercepting unit for intercepting the target image of the target area in the first video frame image;

and the image identification unit is used for inputting the target image into a first identification model, comparing the image characteristics of the target image with the key characteristics corresponding to the key information through the first identification model, and obtaining the key information matching parameters.

In one possible implementation, the image capture unit is configured to:

In one possible implementation, the identifying module 1202 includes, including at least one of:

In one possible implementation, the adjusting module 1203 is configured to perform any one of the following:

In one possible implementation, the apparatus further includes:

In one possible implementation, the identifying module 1202 is configured to:

The device provided by the embodiment of the application sets a function of adjusting the background image of the second video data in a live broadcast scene containing two paths of video data, determines an adjustment mode of the background image in the second video data according to the recognition result of any one of the first video frame in the first video data, the second video frame in the second video data and the voice data, and realizes flexible adjustment of the background image of the second video data, so that the display effect of the second video data and the overall visual effect of a live broadcast picture are improved, and the influence of the second video data on the display effect of the first video picture can be avoided.

It should be noted that: in the live view processing apparatus provided in the foregoing embodiment, only the division of the functional modules is illustrated when processing a live view, and in practical applications, the functions may be distributed by different functional modules as needed, that is, the internal structure of the apparatus may be divided into different functional modules to complete all or part of the functions described above. In addition, the live view processing apparatus and the live view processing method provided in the above embodiments belong to the same concept, and specific implementation processes thereof are described in detail in the method embodiments and are not described herein again.

Fig. 13 is a schematic structural diagram of a terminal according to an embodiment of the present application. The terminal 1300 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 1300 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

In general, terminal 1300 includes: one or more processors 1301 and one or more memories 1302.

Processor 1301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1301 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1301 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also referred to as a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1301 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing content that the display screen needs to display. In some embodiments, processor 1301 may further include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

Memory 1302 may include one or more computer-readable storage media, which may be non-transitory. The memory 1302 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1302 is used to store at least one computer program for execution by the processor 1301 to implement the live picture processing method provided by the method embodiments herein.

In some embodiments, terminal 1300 may further optionally include: a peripheral interface 1303 and at least one peripheral. Processor 1301, memory 1302, and peripheral interface 1303 may be connected by a bus or signal line. Each peripheral device may be connected to the peripheral device interface 1303 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1304, display screen 1305, camera assembly 1306, audio circuitry 1307, positioning assembly 1308, and power supply 1309.

Peripheral interface 1303 may be used to connect at least one peripheral associated with I/O (Input/Output) to processor 1301 and memory 1302. In some embodiments, processor 1301, memory 1302, and peripheral interface 1303 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1301, the memory 1302, and the peripheral device interface 1303 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 1304 is used to receive and transmit RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 1304 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1304 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1304 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 1304 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 1304 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 1305 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1305 is a touch display screen, the display screen 1305 also has the ability to capture touch signals on or over the surface of the display screen 1305. The touch signal may be input to the processor 1301 as a control signal for processing. At this point, the display 1305 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, display 1305 may be one, providing the front panel of terminal 1300; in other embodiments, display 1305 may be at least two, either on different surfaces of terminal 1300 or in a folded design; in some embodiments, display 1305 may be a flexible display disposed on a curved surface or on a folded surface of terminal 1300. Even further, the display 1305 may be arranged in a non-rectangular irregular figure, i.e., a shaped screen. The Display 1305 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or the like.

The camera assembly 1306 is used to capture images or video. Optionally, camera assembly 1306 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1306 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 1307 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1301 for processing, or inputting the electric signals to the radio frequency circuit 1304 for realizing voice communication. For stereo capture or noise reduction purposes, multiple microphones may be provided, each at a different location of terminal 1300. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1301 or the radio frequency circuitry 1304 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 1307 may also include a headphone jack.

The positioning component 1308 is used for positioning the current geographic position of the terminal 1300 for implementing navigation or LBS (Location Based Service). The Positioning component 1308 can be a Positioning component based on the GPS (Global Positioning System) of the united states, the beidou System of china, the graves System of russia, or the galileo System of the european union.

Power supply 1309 is used to provide power to various components in terminal 1300. The power source 1309 may be alternating current, direct current, disposable or rechargeable. When the power source 1309 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1300 also includes one or more sensors 1310. The one or more sensors 1310 include, but are not limited to: acceleration sensor 1311, gyro sensor 1312, pressure sensor 1313, fingerprint sensor 1314, optical sensor 1315, and proximity sensor 1316.

The acceleration sensor 1311 can detect the magnitude of acceleration on three coordinate axes of the coordinate system established with the terminal 1300. For example, the acceleration sensor 1311 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 1301 may control the display screen 1305 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1311. The acceleration sensor 1311 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 1312 may detect the body direction and the rotation angle of the terminal 1300, and the gyro sensor 1312 may cooperate with the acceleration sensor 1311 to acquire a 3D motion of the user with respect to the terminal 1300. Processor 1301, based on the data collected by gyroscope sensor 1312, may perform the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensor 1313 may be disposed on a side bezel of terminal 1300 and/or underlying display 1305. When the pressure sensor 1313 is disposed on the side frame of the terminal 1300, a user's holding signal to the terminal 1300 may be detected, and the processor 1301 performs left-right hand recognition or shortcut operation according to the holding signal acquired by the pressure sensor 1313. When the pressure sensor 1313 is disposed at a lower layer of the display screen 1305, the processor 1301 controls an operability control on the UI interface according to a pressure operation of the user on the display screen 1305. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 1314 is used for collecting the fingerprint of the user, and the processor 1301 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 1314, or the fingerprint sensor 1314 identifies the identity of the user according to the collected fingerprint. When the identity of the user is identified as a trusted identity, the processor 1301 authorizes the user to perform relevant sensitive operations, including unlocking a screen, viewing encrypted information, downloading software, paying, changing settings, and the like. The fingerprint sensor 1314 may be disposed on the front, back, or side of the terminal 1300. When a physical button or vendor Logo is provided on the terminal 1300, the fingerprint sensor 1314 may be integrated with the physical button or vendor Logo.

The optical sensor 1315 is used to collect the ambient light intensity. In one embodiment, the processor 1301 may control the display brightness of the display screen 1305 according to the ambient light intensity collected by the optical sensor 1315. Specifically, when the ambient light intensity is high, the display brightness of the display screen 1305 is increased; when the ambient light intensity is low, the display brightness of the display screen 1305 is reduced. In another embodiment, the processor 1301 can also dynamically adjust the shooting parameters of the camera assembly 1306 according to the ambient light intensity collected by the optical sensor 1315.

Proximity sensor 1316, also known as a distance sensor, is typically disposed on a front panel of terminal 1300. Proximity sensor 1316 is used to gather the distance between the user and the front face of terminal 1300. In one embodiment, the processor 1301 controls the display 1305 to switch from the bright screen state to the dark screen state when the proximity sensor 1316 detects that the distance between the user and the front face of the terminal 1300 gradually decreases; the display 1305 is controlled by the processor 1301 to switch from the rest state to the bright state when the proximity sensor 1316 detects that the distance between the user and the front face of the terminal 1300 is gradually increasing.

Those skilled in the art will appreciate that the configuration shown in fig. 13 is not intended to be limiting with respect to terminal 1300 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be employed.

Fig. 14 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1400 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1401 and one or more memories 1402, where the one or more memories 1402 store at least one computer program, and the at least one computer program is loaded and executed by the one or more processors 1401 to implement the methods provided by the foregoing method embodiments. Certainly, the server 1400 may further have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the server 1400 may further include other components for implementing the functions of the device, which is not described herein again.

In an exemplary embodiment, there is also provided a computer-readable storage medium, such as a memory including at least one computer program executable by a processor to perform the live view processing method in the above-described embodiments. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is provided that includes at least one computer program stored in a computer readable storage medium. The processor of the computer device reads the at least one computer program from the computer-readable storage medium, and executes the at least one computer program to cause the computer device to perform the operations performed by the live view processing method.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A live broadcast picture processing method, comprising:

acquiring first video data and second video data, wherein the first video data is obtained by recording a current display interface, and the second video data is obtained by collecting the video data in real time;

2. The method of claim 1, wherein the recognizing at least one of a first video frame in the first video data, a second video frame in the second video data, or speech data in the second video data to obtain a recognition result comprises:

3. The method according to claim 2, wherein the performing image recognition on the target image of the target region in the first video frame to obtain the key information matching parameter corresponding to the target image comprises:

intercepting the target image of the target area in the first video frame image;

and inputting the target image into a first recognition model, and comparing the image characteristics of the target image with the key characteristics corresponding to the key information through the first recognition model to obtain the key information matching parameters.

4. The method of claim 3, wherein said truncating the target image of the target area in the first video frame image comprises:

intercepting the target image of the target area from the first video frame image based on the target position information.

5. The method of claim 3, wherein prior to inputting the target image into the first recognition model, the method further comprises:

acquiring at least two first sample images, wherein the first sample images carry position marking information and matching parameter marking information, the marking position information is used for indicating the position of a target area containing key information in the first sample images, and the matching parameter marking information is used for indicating whether the target image of the target area contains the key information to be identified;

and training the first recognition model based on the target images of the target areas in the at least two first sample images to obtain the trained first recognition model.

6. The method of claim 1, wherein the recognizing at least one of a first video frame in the first video data, a second video frame in the second video data, or speech data in the second video data, resulting in a recognition result, comprises at least one of:

performing gesture recognition on the second video frame to obtain a gesture included in the second video frame, wherein the gesture is used for indicating to adjust a background image in the second video frame;

and performing voice recognition on the voice data to obtain a voice instruction, wherein the voice instruction is used for indicating to adjust the background image in the second video frame.

7. The method according to claim 1, wherein the adjusting the background image in the second video frame based on the adjustment manner corresponding to the recognition result includes any one of:

removing the background image in the second video frame in response to the recognition result corresponding to a first adjustment mode;

and replacing the background image in the second video frame with a reference image in response to the identification result corresponding to a third adjustment mode.

8. The method of claim 1, wherein after generating a live view from the first video frame and the adjusted second video frame, the method further comprises:

and sending the live broadcast picture to a server, wherein the server is used for sending the live broadcast picture to terminals of audience users.

9. The method of claim 1, wherein the recognizing at least one of a first video frame in the first video data, a second video frame in the second video data, or speech data in the second video data to obtain a recognition result comprises:

and in response to the background adjusting function being in an on state, identifying at least one of a first video frame in the first video data, a second video frame in the second video data or voice data in the second video data to obtain an identification result, wherein the background adjusting function is used for indicating whether to allow the background image of the second video data to be adjusted.

10. The method of claim 9, wherein the recognizing at least one of a first video frame in the first video data, a second video frame in the second video data, or speech data in the second video data in response to the background adjustment function being in an on state, and obtaining a recognition result comprises:

responding to the background adjusting function in an opening state and the automatic adjusting background function in the opening state, and identifying a first video frame in the first video data to obtain a first identification result;

and in response to the background adjusting function being in an open state and the voice adjusting background function being in an open state, identifying the live voice data to obtain a third identification result.

11. A live view processing apparatus, comprising:

the device comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring first video data and second video data, the first video data is obtained by recording a current display interface, and the second video data is obtained by real-time acquisition;

12. The apparatus of claim 11, wherein the identification module comprises a first identification submodule configured to:

13. The apparatus of claim 12, wherein the first identification submodule comprises:

an image clipping unit configured to clip the target image of the target area in the first video frame image;

and the image identification unit is used for inputting the target image into a first identification model, and comparing the image characteristics of the target image with the key characteristics corresponding to the key information through the first identification model to obtain the key information matching parameters.

14. A computer device comprising one or more processors and one or more memories having stored therein at least one computer program, the at least one computer program being loaded and executed by the one or more processors to perform operations performed by the live view processing method of any one of claims 1 to 10.

15. A computer-readable storage medium, having at least one computer program stored therein, the at least one computer program being loaded into and executed by a processor to perform operations performed by the live view processing method of any one of claims 1 to 10.