CN112040263A

CN112040263A - Video processing method, video playing method, video processing device, video playing device, storage medium and equipment

Info

Publication number: CN112040263A
Application number: CN202010896672.0A
Authority: CN
Inventors: 廖中远
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2020-12-04

Abstract

The embodiment of the application provides a video processing method, a video playing device, a storage medium and equipment, belongs to the technical field of computers, and relates to artificial intelligence, a voice processing technology and a computer vision technology. The video processing method can identify target key information in voice data of a video through a voice identification technology, acquire visual effect information corresponding to the target key information from a visual effect database, and add a special effect in the video according to the visual effect information. The method can add special effects in real time in the video playing process, so that the manual workload can be reduced, and the efficiency can be improved.

Description

Video processing method, video playing method, video processing device, video playing device, storage medium and equipment

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a video processing method, a video playing device, a storage medium and equipment.

Background

At present, with the rapid development of internet multimedia technology, people have been gradually popularized to watch short videos and live videos online through a network. Through the video playing client, the anchor can upload the short video recorded by the anchor to the video playing platform and share the short video with audiences through the video playing platform. The anchor can also carry out the live broadcast through the video broadcast platform, and spectator can watch the short video that this anchor shared through the video broadcast customer end on the terminal equipment, perhaps gets into the live broadcast video that this anchor watched in the live broadcast room of this anchor.

The video provided by the video playing platform usually attracts the audience by the facial expression or action of the anchor, and the presentation form of the video is single, which affects the display effect of the video. In order to improve the display effect of the video, some current methods are that in the process of performing post-editing processing on the recorded video, a worker manually adds a special effect in the video, for example, the special effect is added in the comprehensive art video in the post period. The special effect is added manually in the video, the operation process is complex, the workload of workers is increased, the efficiency is low, and the special effect cannot be added in real time in the video playing process.

Disclosure of Invention

In order to solve technical problems in the related art, embodiments of the present application provide a video processing method, a video playing device, a storage medium, and a device, which can add a special effect in real time during a video recording or playing process.

In order to achieve the above purpose, the technical solution of the embodiment of the present application is implemented as follows:

in a first aspect, an embodiment of the present application provides a video processing method, where the method includes:

acquiring a video to be processed;

if target key information is identified from the voice data of the video, generating a visual special effect according to the target key information;

adding the visual special effect into the image frame corresponding to the voice data to obtain a target video; the image frames corresponding to the voice data comprise image frames played corresponding to the voice data or image frames after the voice data is played.

In an optional embodiment, the video is a live video, and before generating a visual effect according to target key information if the target key information is identified from the voice data of the video, the method further includes:

receiving a visual effect adding request sent by a target client in a live broadcast process, wherein the visual effect adding request carries a visual effect theme selected by a target user;

acquiring a visual effect database corresponding to the visual effect theme; and

generating a visual special effect according to the target key information specifically comprises: and acquiring the visual special effect corresponding to the target key information from the visual effect database corresponding to the visual effect theme.

In an optional embodiment, after obtaining the target video, the method further includes:

and respectively sending the target video to the target client and the associated client for watching the live broadcast.

In an optional embodiment, after obtaining the visual effect database corresponding to the visual effect theme, the method further includes:

performing word segmentation processing on the text information to obtain word segments contained in the text information;

and comparing each obtained word with the key information stored in the visual effect database respectively, and determining whether the text information comprises target key information matched with the key information stored in the visual effect database.

In an alternative embodiment, the converting the voice data of the live video into text information includes:

extracting acoustic features of the voice data, and inputting the acoustic features into a trained acoustic model to obtain a phoneme data sequence corresponding to the voice data;

respectively searching a text element corresponding to each phoneme in the phoneme data sequence in a pronunciation dictionary to obtain a text element sequence corresponding to the phoneme data sequence;

and inputting the text element sequence into the trained language model to obtain text information corresponding to the voice data.

In an alternative embodiment, determining a display position of a special effect corresponding to the visual effect information in an image frame corresponding to the voice data includes:

determining a position of a face in the image frame;

acquiring a position parameter of the visual effect information; the position parameter is used for indicating the position of the display position of the special effect corresponding to the visual effect information relative to the face in the image frame;

and determining the display position of the special effect corresponding to the visual effect information in the image frame according to the position parameter of the visual effect information and the position of the face in the image frame.

In an alternative embodiment, determining the position of the face in the image frame specifically includes:

carrying out facial feature detection on the image frame, and determining the position of a facial organ in the image frame;

determining the position of the face in the image frame according to the position of the face organ.

In a second aspect, an embodiment of the present application provides a video playing method, where the method includes:

playing voice data in the video;

playing a target video obtained by adding a special effect to the video; the special effect is added into the video in the process of recording or playing the video, and is determined according to the target key information in the voice data and added into the image frame corresponding to the voice data; the image frames corresponding to the voice data comprise image frames played corresponding to the voice data or image frames after the voice data is played.

In a third aspect, an embodiment of the present application further provides a video processing apparatus, where the apparatus includes:

the special effect generating unit is used for acquiring a video to be processed, and generating a visual special effect according to target key information if the target key information is identified from the voice data of the video;

the special effect adding unit is used for adding the visual special effect in the image frame corresponding to the voice data to obtain a target video; the image frames corresponding to the voice data comprise image frames played corresponding to the voice data or image frames after the voice data is played.

In an optional embodiment, the special effect adding unit is specifically configured to:

adding the visual special effect to the video to generate a target video; alternatively, the first and second electrodes may be,

and in the process of playing the video, adding the visual special effect to the video in real time.

determining a display position of a special effect corresponding to the visual effect information in an image frame corresponding to the voice data;

adding the visual special effect at the display location in the image frame.

In an optional embodiment, the special effect generating unit is specifically configured to:

converting voice data of the video into text information;

and when the text information comprises target key information, generating a visual special effect according to the target key information.

In an optional embodiment, the special effect generating unit is further configured to:

and acquiring a visual effect database corresponding to the visual effect theme.

In an optional embodiment, the video to be processed is a live video, and the apparatus further includes:

the system comprises a first data acquisition unit, a second data acquisition unit and a display unit, wherein the first data acquisition unit is used for receiving a visual effect addition request sent by a target client in a live broadcast process, and the visual effect addition request carries a visual effect theme selected by a target user; acquiring a visual effect database corresponding to the visual effect theme;

the special effect generating unit is specifically configured to:

and acquiring a visual effect information special effect corresponding to the target key information from a visual effect database corresponding to the visual effect theme.

In an alternative embodiment, the apparatus further comprises:

and the video sending unit is used for respectively sending the target video to the target client and the associated client for watching the live broadcast.

the second data acquisition unit is used for responding to the operation of starting the visual effect addition and displaying theme selection controls corresponding to visual effect themes of different video scenes in a live video playing interface; responding to the triggering operation of a theme selection control corresponding to any visual effect theme, and acquiring a visual effect database corresponding to the visual effect theme;

the special effect generating unit is specifically configured to:

responding to the operation of starting the visual effect adding, and displaying theme selection controls corresponding to visual effect themes of different video scenes in a live video playing interface;

and responding to the triggering operation of the theme selection control corresponding to any visual effect theme to acquire the visual effect database corresponding to the visual effect theme.

and acquiring visual effect information corresponding to the target key information from a visual effect database corresponding to the visual effect theme, and generating the visual special effect according to the visual effect information.

if the visual effect database contains a plurality of pieces of visual effect information of the target key information, displaying the identification of each piece of visual effect information in a live video playing interface;

responding to a trigger operation aiming at the identifier of any visual effect information, and selecting the visual effect information corresponding to the trigger operation as the visual effect information corresponding to the target key information.

In an optional embodiment, the special effect adding unit is further configured to:

determining a position of a face in the image frame;

acquiring a position parameter of the visual special effect; the position parameter is used for indicating the position of the display position of the visual special effect relative to the face in the image frame;

and determining the display position of the visual special effect in the image frame according to the position parameter of the visual special effect and the position of the face in the image frame.

determining a special effect category to which the visual special effect belongs, and acquiring a position parameter of the visual special effect according to the special effect category to which the visual special effect belongs.

determining a position of a face in the image frame;

selecting, as the display position, a position where a distance from a position of the face satisfies a set distance value in a region other than the face in the image frame.

In a fourth aspect, an embodiment of the present application provides a video playing apparatus, where the apparatus includes:

the voice playing unit is used for playing voice data in the video;

a video playing unit configured to play a target video obtained by adding a special effect to the video; the special effect is added into the video in the process of recording or playing the video, and is determined according to the target key information in the voice data and added into the image frame corresponding to the voice data; the image frames corresponding to the voice data comprise image frames played corresponding to the voice data or image frames after the voice data is played.

In a fifth aspect, this application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the video processing method of the first aspect is implemented.

In a sixth aspect, the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the video playing method of the second aspect is implemented.

In a seventh aspect, this application embodiment further provides an electronic device, including a memory and a processor, where the memory stores a computer program executable on the processor, and when the computer program is executed by the processor, the processor is enabled to implement the video processing method of the first aspect.

In an eighth aspect, an embodiment of the present application further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program that is executable on the processor, and when the computer program is executed by the processor, the processor is enabled to implement the video playing method of the second aspect.

According to the video processing method, the video playing device, the storage medium and the video playing device, the target key information in the user voice data can be identified in the video recording or playing process, and the visual special effect is added to the video according to the target key information. The method can add the visual special effect in real time in the video recording or video playing process, thereby not only reducing the manual workload, but also improving the efficiency.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is an application scene diagram of a video processing method according to an embodiment of the present application;

fig. 2 is an interaction diagram of a terminal device and a server in a video processing process according to an embodiment of the present application;

fig. 3 is an interaction diagram of a terminal device and a server in another video processing process according to an embodiment of the present application;

fig. 4 is a schematic view of an operation interface of video processing according to an embodiment of the present disclosure;

fig. 5 is a schematic view of another operation interface for video processing according to an embodiment of the present disclosure;

fig. 6 is a schematic diagram illustrating a process of converting voice data into text information according to an embodiment of the present application;

fig. 7 is a schematic view of a display interface for video processing according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of another display interface for video processing according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of another display interface for video processing according to an embodiment of the present disclosure;

FIG. 10 is a schematic diagram of another display interface for video processing according to an embodiment of the present disclosure;

fig. 11 is a schematic flowchart of a video processing method according to an embodiment of the present application;

FIG. 12 is a schematic diagram of another display interface for video processing according to an embodiment of the present disclosure;

FIG. 13 is a schematic diagram of another display interface for video processing according to an embodiment of the present disclosure;

fig. 14 is a schematic flowchart of a video playing method according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application;

fig. 16 is a schematic structural diagram of another video processing apparatus according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of another video processing apparatus according to an embodiment of the present application;

fig. 18 is a schematic structural diagram of a video playback device according to an embodiment of the present application;

fig. 19 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 20 is a schematic structural diagram of another electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that references in the specification of the present application to the terms "comprises" and "comprising," and variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Some terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

(1) A client: software installed on a terminal device, for example, an APP installed on a mobile terminal such as a mobile phone, belongs to a software resource in the terminal device. The client in the embodiment of the present application refers to a client having a video playing function. For example, the terminal device may download an installation package of the client via the network, install the client using the installation package, and after the installation is completed, the client may operate on the terminal device.

(2) Live video: the video which utilizes internet and streaming media technology to carry out network interactive live broadcast is one of the mainstream expression modes of the current internet media. Live broadcast is a new social networking mode, and the anchor can adopt independent and controllable audio and video acquisition equipment to acquire audio and video, generate live broadcast video and upload the live broadcast video to a server through a network, and then the server sends the live broadcast video to a client of each user watching the live broadcast.

(3) Short video: the internet content transmission mode is one of internet content transmission modes, and refers to high-frequency push video content which can be played through a client and is suitable for being watched in a mobile state or a short-time leisure state. Generally, a new media platform is used for carrying out a main broadcast of live video, and short video recorded by the user can be sent to a server of the new media platform to be shared by the user for watching.

(4) Special effects are as follows: the comprehensive effect can also be called as a comprehensive effect, and the animation effect, the flower characters and the sound effect which accord with the video atmosphere and the like are utilized to enrich the content of the video, help the audience understand, improve the viewability of the program, enhance the entertainment of the video, manufacture the topicality of the program and leave a deep impression for the audience. The comprehensive special effect can also enable the video to have entertainment and drama, and greatly improve the wonderful degree of the video.

The word "exemplary" is used hereinafter to mean "serving as an example, embodiment, or illustration. Any embodiment described as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The terms "first" and "second" are used herein for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature, and in the description of embodiments of the application, unless stated otherwise, "plurality" means two or more.

Embodiments of the present application relate to Artificial Intelligence (AI) and Machine Learning techniques, and are designed based on Computer Vision (CV) techniques, Speech processing techniques (Speech Technology), and Machine Learning (ML) in the AI.

Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology mainly comprises a computer vision technology, a voice processing technology, machine learning/deep learning and other directions.

With the research and progress of artificial intelligence technology, artificial intelligence is developed and researched in a plurality of fields, such as common smart home, image retrieval, video monitoring, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical treatment and the like.

Computer vision technology is an important application of artificial intelligence, which studies relevant theories and techniques in an attempt to build an artificial intelligence system capable of obtaining information from images, videos or multidimensional data to replace human visual interpretation. Typical computer vision techniques generally include image processing and video analysis. The method for detecting the face position in the image frame belongs to a method for processing images.

Key technologies for speech processing technology are automatic speech recognition technology (ASR) and speech synthesis technology (TTS), as well as voiceprint recognition technology. The computer can listen, see, speak and feel, and is the development direction of man-machine interaction in the future, and at present, voice becomes one of man-machine interaction modes. The embodiment of the application utilizes an automatic voice recognition technology to convert voice data in a video into text information.

The natural language processing technology is an important direction in the fields of computer science and artificial intelligence. It is a research into various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include speech processing, semantic understanding, text processing, and the like.

Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like. In the speech recognition process, the acoustic model and the language model based on machine learning or deep learning are adopted, and the speech data are converted into the text data by processing the acoustic features in the speech data. In the face detection process, a face detection model based on machine learning or deep learning is adopted to extract the characteristics of facial features in an image frame, and then the position of the face in the image frame is determined.

The present application will be described in further detail with reference to the following drawings and specific embodiments.

The video processing method provided by the embodiment of the application can be used for adding special effects in the recorded video and can also be used for adding special effects in the live video. The following embodiments are described by taking an example of adding a special effect to a live video.

Fig. 1 shows an application scenario of the video processing method provided in the embodiment of the present application, and referring to fig. 1, a server 100 is in communication connection with a terminal device through a network 200, where the network 200 may be, but is not limited to, a local area network, a metropolitan area network, or a wide area network, and the number of terminal devices connected to the server 100 may be multiple, for example, the terminal device may include a terminal device 300 of a main broadcast that is in a live broadcast, and may further include a terminal device 301 of a viewer that is in a live broadcast.

The terminal device 300 and the terminal device 301 can mutually transmit communication data and messages through the network 200 and the server 100. The terminal device 300 and the terminal device 301 may be portable devices (e.g., mobile phones, tablet computers, notebook computers, etc.), or may be computers, smart screens, Personal Computers (PCs), etc. Both the terminal device 300 and the terminal device 301 may be installed with a live application client having a video recording function or a video playing function, where the client installed on the terminal device 300 may be a client of an anchor, that is, an anchor, and the client installed on the terminal device 301 may be a client of a viewer watching an anchor live video, that is, a viewer.

The server 100 may be a server of a certain video playing platform. The server 100 may be any device having a networking function and providing a data processing capability, and for example, the server 100 may be a cloud server, and may be a server set composed of one or more servers.

The anchor may record a live video through a live application client installed on the terminal device 300, and upload the recorded live video to the server 100 in real time through the network 200. The terminal device 300 may also upload the anchor recorded short video to the server 100 in real time via the network 200, and the server 100 shares the anchor recorded short video to the viewer. The audience can enter the live broadcast room of the anchor through the live broadcast application client installed on the terminal device 301 to watch the live broadcast video, and the server 100 sends the live broadcast video of the anchor to each live broadcast application client of the audience watching the live broadcast in a live broadcast video stream mode. Viewers may also view anchor-shared short videos through a live application client installed on terminal device 301. After a viewer clicks a certain short video, the client sends a video playing request for playing the short video to the server 100, and the server 100 sends the short video specified by the video playing request to the live broadcast application client of the viewer in a video streaming manner, so that the live broadcast application client of the viewer plays the short video.

At present, most video provided by video playing platforms usually attracts audiences by the facial expression or action of the anchor, and the presentation form of the video is single, which affects the display effect of the video. In order to improve the display effect of the video, some current methods are that in the process of performing post-editing processing on the recorded video, a worker manually adds a special effect in the video, for example, the special effect is added in the comprehensive art video in the post period. The special effect is added manually in the video, the operation process is complex, the workload of workers is increased, the efficiency is low, and the special effect cannot be added in real time in the video playing process.

In order to solve the above problem, embodiments of the present application provide a video processing method, a video playing method, an apparatus, a storage medium, and a device, which can recognize target key information in user voice data through a voice recognition technology, and add a visual special effect to a video according to the target key information. The method can add the visual special effect in real time in the video playing process, and compared with the method that the special effect needs to be added manually in the related technology, the method not only can reduce the manual workload, but also can improve the efficiency.

The application scenario in fig. 1 is only an example of an application scenario for implementing the embodiment of the present application, and the embodiment of the present application is not limited to the application scenario in fig. 1.

To further illustrate the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the accompanying drawings and the detailed description. Although the embodiments of the present application provide the method operation steps as shown in the following embodiments or figures, more or less operation steps may be included in the method based on the conventional or non-inventive labor. In steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the embodiments of the present application. The method can be executed in sequence or in parallel according to the method shown in the embodiment or the figure when the method is executed in an actual processing procedure or a device.

In an embodiment, the video processing method provided by the embodiment of the present application may be executed by a server, or executed by a client installed on a terminal device in cooperation with the server. Fig. 2 shows an interaction flow diagram of a video processing method, which, as shown in fig. 2, may comprise the steps of:

step S201, the target client sends the recorded video to the server in real time.

The target client may be understood as the anchor side as described above, i.e. a live application client installed on the terminal device of the anchor. In the live broadcasting process, the anchor terminal sends the video recorded by the anchor to the server in real time in a video streaming mode. And the server sends the video recorded by the main broadcast to the client of each viewer watching the live broadcast in real time for playing. The clients of each viewer watching the live broadcast are hereinafter referred to as associated clients.

In step S202, the target client receives an operation of starting an add visual effect.

In step S203, the target client sends a visual effect addition request to the server.

And step S204, adding a special effect in the received video by the server according to the visual effect adding request to generate a target video.

And the server receives a visual effect adding request sent by the target client and takes the received video sent by the target client in real time as a live video. Firstly, voice data of a live video is obtained, whether the voice data contains target key information or not is identified, if the target key information is identified from the voice data of the video, a visual special effect is added to the live video in real time according to the target key information, and the target video is obtained.

In some embodiments, the server converts the acquired voice data into text information, judges whether the acquired text information includes target key information, determines visual effect information according to the target key information when the text information includes the target key information, and adds a special effect to a live video according to the acquired visual effect information to obtain a target video.

In an optional embodiment, a visual effect database is stored in the server, and the visual effect database is used for storing the corresponding relationship between the key information and the visual effect information. Wherein, each key information can be a word, a phrase or a word. In the visual effect database, one key information may correspond to one visual effect information, or may correspond to a plurality of visual effect information.

And after the server converts the voice data into text information, performing word segmentation processing on the text information to obtain word segments contained in the text information, respectively comparing each obtained word segment with the key information stored in the visual effect database, and determining whether the text information comprises target key information matched with the key information stored in the visual effect database. And if the text information comprises the target key information, acquiring the visual effect information corresponding to the target key information from the visual effect database. If the target key information corresponds to a plurality of visual effect information in the visual effect database, one piece of visual effect information can be randomly selected from the plurality of pieces of visual effect information and used as the visual effect information corresponding to the target key information, and according to the visual effect information, a special effect is added to the live video to obtain the target video.

In another alternative embodiment, the server further stores a key information base, and the key information stored in the key information base may be a word, a phrase, or a word. And after the server converts the voice data into text information, performing word segmentation processing on the text information, respectively comparing each obtained word segmentation with the key information stored in the key information base, and determining whether the text information comprises target key information matched with the key information stored in the key information base. If the text information comprises the target key information, acquiring the visual effect information corresponding to the target key information from the visual effect database according to the target key information; or generating visual effect information corresponding to the target key information according to the target key information and a set special effect display mode, and adding a special effect to the live video according to the visual effect information to obtain the target video. The special effect display mode may include displaying artistic words or dynamically displaying the target key information.

Alternatively, the server may store a key information base that all the anchor can share, or may store one key information base for each anchor. When the server stores one key information base for each anchor, the anchor can set the key information stored in the key information base according to the requirement of the anchor and adjust the key information stored in the key information base.

In another optional embodiment, the server may further use all the participles in the text information obtained by converting the voice data as target key information, generate visual effect information corresponding to the target key information according to the target key information and a set special effect display mode, and add a visual special effect to the live video according to the visual effect information to obtain the target video. Optionally, the target key information is a keyword, and when the visual special effect is generated, characters in the keyword can be converted into characters with the visual special effect, and the characters are added to the live video.

In step S205, the server sends the target video to the target client.

Step S206, the server sends the target video to the associated client.

The server can send the target video added with the special effect to the associated client side of the live broadcast watched by the audience in real time in a video streaming mode. Meanwhile, the server can also send the target video to the target client, so that the anchor can also see the playing effect of the video added with the special effect.

In other embodiments, if the object of video processing is a short video or other video to be distributed on a network platform, a visual effect may be added to the video during the video recording process. For example, the target client may add a visual special effect to the video according to the collected voice data of the anchor in the process of recording the video, generate a target video, and issue the target video to the network platform through the server; or the target client can send the recorded video to the server in real time, and the server adds the visual special effect to the video according to the voice data in the video by referring to the method to generate the target video, feeds the target video back to the target client and releases the target video to the network platform.

In the above embodiment, the video processing method executed by the server includes the following steps: in the process of recording or playing the video, if the target key information is identified from the voice data of the video, generating a visual special effect according to the target key information, and adding the visual special effect into the video.

In some embodiments, the target key information is a keyword, and after the keyword is identified from the voice data of the video, a visual special effect including a text in the keyword may be generated according to a set text display format. Illustratively, one or more text display formats can be set, such as artistic word display or motion picture display. If several character display formats are set, a user can select one character display format, and the visual special effect of the characters containing the keywords is generated according to the character display format selected by the user. For example, if the target key information recognized from the voice data is "peach blossom", and the set text display format is art text display, the art text is generated from the "peach blossom", and the visual special effect including the art text of the "peach blossom" is generated.

In other embodiments, after the keywords are recognized from the voice data of the video, a special effect image corresponding to the pre-stored target key information may be obtained, and the special effect image is used to generate the visual special effect according to the target key information. For example, the special effect image and the keyword may be saved correspondingly, that is, the keyword is used as the name of the special effect image. And identifying a keyword according to the voice data, and searching for a special effect image taking the keyword as a name. For example, the target key information recognized from the voice data is "peach blossom", and the special effect image with the name of the "peach blossom" is searched from the corresponding relationship between the pre-stored special effect image and the keyword, so that the visual special effect of the target key information is obtained.

In the video processing method provided by the above embodiment, when the voice data in the video includes the target key information, the visual effect information corresponding to the target key information is obtained, and according to the visual effect information, a special effect is added to the video to obtain the target video. Through increasing the special effect in real time in the video, can increase live video recreational and topical degree, let live content abundanter, help spectator understands the atmosphere that the video conveyed, improve live quality and wonderful degree, improve user's watching experience, and then improve user's viscidity.

In order to meet different requirements of users in different live scenes, in another embodiment, visual effect databases may be respectively set for a plurality of different scenes or different themes. Different key information is respectively stored in different visual effect databases, and the number of key information to be compared in the key information matching process can be reduced, so that the calculation amount in the matching process is reduced, and the matching result can be determined more quickly. Fig. 3 shows an interactive flowchart of a video processing method in this embodiment, and as shown in fig. 3, the method may include the following steps:

step S301, the target client sends the recorded video to the server in real time.

Step S302, the target client receives the operation of starting the adding visual effect, and displays theme selection controls corresponding to visual effect themes of different video scenes in a live video playing interface.

The anchor may actively initiate the operation of adding visual effects from the set function options of the target client. For example, as shown in fig. 4, the anchor may display the set function option on one side of the screen when recording a video through the target client, or display the set function option on one side of the screen in response to a sliding operation of the anchor. The setting function options can comprise controls such as 'camera switching', 'special effect opening' and 'live broadcast ending'.

And if the anchor clicks the 'special effect opening' control, the target client receives the operation of opening the adding visual effect, and displays the theme selection control corresponding to the visual effect theme of different video scenes in the live video playing interface. For example, as shown in fig. 5, a theme selection control corresponding to a visual effect theme may be displayed on the lower side of the live video playing interface, and the selectable controls in fig. 5 include "original drawing", "small firewood dog", "comprehensive art", and "maiden" controls. Wherein, the small firewood dog, the comprehensive art and the girl are theme selection controls, and the corresponding visual effect themes are respectively provided with a visual effect database corresponding to each. If the anchor selects "artwork," no special effects need to be added to the video.

Step S303, the target client receives a trigger operation of the theme selection control corresponding to any one of the visual effect themes.

In step S304, the target client sends a visual effect addition request to the server. The visual effect addition request carries the visual effect theme selected by the anchor.

The anchor can select any visual effect theme according to the desire or the scene requirement, and click a theme selection control corresponding to the visual effect theme in a live video playing interface.

For example, the anchor may select the theme selection control "heddles" in fig. 5. The target client receives the trigger operation of the anchor selecting the control 'heddles' aiming at the theme, generates a visual effect adding request containing the visual effect theme of the 'heddles', and sends the visual effect adding request to the server.

Step S305, the server acquires a visual effect database corresponding to the visual effect theme according to the received visual effect adding request.

And the server receives a visual effect adding request sent by the target client and acquires a visual effect theme selected by the anchor and carried in the visual effect adding request. And acquiring a visual effect database corresponding to the visual effect theme according to the visual effect theme selected by the anchor. For example, if the anchor selects the "heddles" visual effect theme, the server obtains a visual effect database corresponding to the "heddles" visual effect theme.

Step S306, the server converts the voice data of the video to be processed into text information.

The server receives a visual effect adding request sent by the target client, takes live video sent by the target client in real time as video to be processed, and converts voice data in the live video into text information.

In some embodiments, as shown in fig. 6, the process of converting the speech data into the text information may be that the server first extracts an acoustic feature of the speech data, inputs the extracted acoustic feature into the trained acoustic model, to obtain a phoneme data sequence corresponding to the speech data, then searches a pre-stored pronunciation dictionary for a text element corresponding to each phoneme in the phoneme data sequence, to obtain a text element sequence corresponding to the phoneme data sequence, and inputs the text element sequence into the trained language model, to obtain text information corresponding to the speech data. The acoustic model, the pronunciation dictionary and the language model jointly form a voice decoding module, and the acoustic characteristics of the voice data are input into the voice decoding module, so that the text information corresponding to the voice data output by the voice decoding module can be obtained.

Specifically, since the voice data is a voice signal collected by the microphone and is an analog signal, the analog signal can be converted into a digital signal by analog-to-digital conversion, noise can be removed from the converted voice data, and then framing and preprocessing are performed to extract acoustic features from the preprocessed voice data. In some embodiments, the server may extract acoustic features from the voice data using a MFCC (Mel Frequency Cepstrum Coefficient) method. The MFCC method may first perform framing on the acquired voice data, divide the voice data into a plurality of voice frames, then perform preprocessing on the plurality of voice frames, including pre-emphasis and windowing, then obtain a corresponding frequency spectrum by FFT (Fast Fourier Transform ) on each obtained voice frame, then obtain a Mel (Mel) frequency spectrum by passing the obtained frequency spectrum through a Mel (Mel) filter bank, and finally obtain MFCC features by performing logarithm taking and DCT (Discrete Cosine Transform ) on the Mel frequency spectrum, that is, extract acoustic features in the voice data.

After the acoustic features of the voice data are extracted, the extracted acoustic features can be input into a trained acoustic model, and a phoneme data sequence corresponding to the voice data is obtained. The phoneme data sequence includes a plurality of phonemes arranged in chronological order of pronunciation, each of which may represent the pronunciation of a syllable. Acoustic models are a key component in speech recognition. The acoustic model may be trained to recognize phoneme data contained in the speech data. And training parameters of the acoustic model according to the acoustic features of the training voice, and inputting the acoustic features of the voice to be recognized into the trained acoustic model during recognition to obtain a recognition result.

Illustratively, the acoustic model may be constructed using a hidden markov (HMM) model. The phoneme data sequence output by the HMM model is generally obtained by one-way recognition in an input order from left to right, wherein one phoneme is an HMM state of three to five states, and the HMM state can be understood as a state feature of the speech extracted by the HMM model. A word is a sequence of HMM states formed by concatenating the HMM states of the phonemes that make up the word, and continuous speech recognition is a sequence of HMM states formed by combining words. The training speech data set for training the acoustic model includes training speech and phoneme labels corresponding to the training speech. The method comprises the steps of framing training voice, obtaining acoustic features of a voice frame sequence arranged according to time, training an acoustic model by utilizing the acoustic features of the training voice and phoneme labels corresponding to the training voice through a Bayes statistical method, namely a maximum posterior probability decision method, estimating prior probability between the acoustic features of the training voice and the phoneme labels corresponding to the training voice in the training process, calculating class condition probability density through the prior probability, calculating the maximum posterior probability, adjusting parameters of the acoustic model according to the maximum posterior probability, and further establishing corresponding probability distribution between voice signals and phonemes to obtain the trained acoustic model.

After obtaining the phoneme data sequence corresponding to the speech data, respectively searching the text element corresponding to each phoneme in the phoneme data sequence in the pronunciation dictionary to obtain the text element sequence corresponding to the phoneme data sequence, wherein the pronunciation dictionary stores the pronunciation symbol table of each word or word, that is, the pronunciation dictionary stores the correspondence between the word or word and the phoneme, so that the pronunciation dictionary can establish the connection between the acoustic model and the language model.

And inputting the obtained text element sequence into the trained language model to obtain text information corresponding to the voice data. The language model is a probability model for determining the probability of occurrence of a sentence or word sequence, and may determine which word sequence is more likely to be included in speech, or predict the next upcoming word if several words occur. The language model defines which words can follow an already recognized word (matching is a sequential process) and therefore can exclude unlikely words for the matching process. Language models are also understood as language rules for word and path constraints in matching searches, including syntactic networks consisting of recognized speech commands or mathematical models consisting of statistical methods. Illustratively, the language model may be obtained by training a large amount of training texts by using an N-Gram model or an N-Gram model based on (N-1) order markov chain statistics, and the probability P (w2| w1), P (w3| w2, w1) of the current word occurrence may be determined by the previous word or two words, so that the probability of occurrence of a sentence or word sequence may be predicted.

Step S307, when the text information includes the target key information, determining the visual effect information according to the target key information.

The server converts the obtained anchor voice data into text information, then performs word segmentation processing on the text information to obtain word segments contained in the text information, compares each obtained word segment with key information stored in a visual effect database corresponding to a visual effect theme selected by the anchor, and determines whether the text information includes target key information matched with the key information stored in the visual effect database. The server can obtain the visual effect information corresponding to the target key information according to the determined target key information.

And step S308, adding a special effect in the video to be processed by the server according to the visual effect information to obtain a target video.

In some embodiments, the server may add the special effect to the image frame corresponding to the voice data at a predetermined fixed location or at a randomly selected location. The image frame corresponding to the voice data may be an image frame played corresponding to the voice data, or an image frame after the voice data is played. For example, since it may take a certain time to generate a special effect, it may be allowed to add the special effect in an image frame within a certain time period after playing voice data, or to add the special effect in N image frames after playing voice data. Illustratively, if a certain image frame is played within a set time period after the voice data is played, a special effect may be added to the image frame. In one embodiment, the above features may also be added in all image frames after the voice data is played until the end of the video. The special effect added to the image frame is determined according to the visual effect information, and the visual effect information can comprise artistic word elements, graphic elements, picture elements, moving picture elements or animation elements corresponding to the key information.

In other embodiments, viewing is prevented from being affected by occlusion of a main part of the main broadcaster's face when a special effect is displayed in an image frame. The position of the face in the image frame may be determined first, the display position of the special effect may be randomly selected in a region other than the face in the image frame, or a position having a distance from the position of the face satisfying a set distance value may be selected as the display position, and the special effect may be added to the image frame according to the selected display position.

In particular, the server may determine the face position in the image frame by a face detection algorithm or a face detection model. The face detection model may be a trained deep learning network model. For example, in one embodiment, the server may perform facial feature detection on the image frame through a face detection model, determine the position of the facial organ in the image frame, and then determine the position of the face in the image frame according to the position of the facial organ.

In another embodiment, the server may first perform face detection in the image frames using the Viola-Jones face detection algorithm. The Viola-Jones face detection algorithm is used to detect the front face position in the image. The Viola-Jones face detection algorithm may detect face positions in the image frame if the main front face is observable in the current image frame. The Viola-Jones face detection algorithm cannot detect face positions in an image frame if the current image frame does not include the main front face, e.g., the main side face is facing the camera. At this time, the server may extract features of five sense organs in the image frame, for example, candidate regions of the mouth and the nose may be found through information of contours and edges, then the features of the ears are determined through a deep learning network model, such as a fast R-CNN network model, and the candidate regions of the ears are found, and finally whether the correct positions of the nose, the mouth, and the ears are determined respectively according to the relative positions of the mouth, the nose, and the ears.

If the face position can not be detected in the image frame by the method, candidate areas of eyes and nostrils in the image frame can be determined by algorithms related to edge detection and morphological image processing and color features extracted from the image frame, and the correct positions of the eyes and the nostrils can be obtained by calculating the central point positions of the eyes and the nostrils, so that the face position in the image frame can be determined.

In the above process, when the candidate regions of the five sense organs are determined, a plurality of candidate regions may be obtained. For example, a plurality of candidate regions of the ear may be obtained, and the plurality of candidate regions may have different sizes and different positions. At this time, the candidate region with the highest probability may be selected as the finally determined candidate region by a Non-Maximum Suppression (Non-Maximum Suppression) algorithm. Taking the determination of the candidate region of the ear as an example, the non-maximum suppression algorithm is performed as follows: supposing that the network model outputs N candidate regions of ears and the probability corresponding to each candidate region in total, sorting the candidate regions according to the probability, setting the candidate region with the highest probability as a candidate region X, respectively comparing other candidate regions with the candidate region X, determining the overlapping rate of each other candidate region and the candidate region X, if the overlapping rate is less than or equal to a set threshold value, keeping the candidate region, otherwise, discarding the candidate region. And after one round of comparison is finished, selecting a candidate region with the highest probability from the rest candidate regions, continuing to perform the second round of comparison, repeating the process until one candidate region is finally left, and taking the candidate region as a candidate region of the ear. The candidate regions of the five sense organs are determined through a non-maximum inhibition algorithm, so that the accuracy of face detection can be improved.

In other embodiments, in order to further improve the display effect of the special effects, different display positions may be set for different special effects. In one embodiment, a display position may be set for each special effect, and the display position of each special effect may be stored in the corresponding visual effect information in the form of a position parameter. After the server acquires the visual effect information of the target key information, a position parameter in the visual effect information can be acquired, and the position parameter is used for indicating the position of the display position of the special effect corresponding to the visual effect information relative to the face in the image frame. By the method, the server can determine the position of the face in the image frame, and determine the display position of the special effect corresponding to the visual effect information in the image frame according to the position parameter of the visual effect information and the position of the face in the image frame.

For example, assuming that the visual effect information corresponding to the key information "peach blossom" includes a graph of peach blossom, an artistic word of peach blossom, and a position parameter of a corresponding special effect, the position parameter indicates that a display position of the special effect is located at the upper left of a face position in an image frame, and the position parameter may further include a distance between the display position of the special effect and a face edge. According to the position parameter, a special effect is added at the corresponding display position in the image frame, and a video picture as shown in fig. 7 can be displayed in the live video playing interfaces of the anchor and the audiences. Display special effects on the top left of the anchor's face: the pattern of the peach blossom and the artistic words of the peach blossom.

For another example, it is assumed that the visual effect information corresponding to the key information "stubborn skin" includes an external graphic, an artistic word of the stubborn skin located in the graphic, and a position parameter of a corresponding special effect, where the position parameter indicates that a display position of the special effect is located at the lower right side of the face position in the image frame, and is located immediately adjacent to the edge of the face. According to the position parameter, a special effect is added to the corresponding display position in the image frame, and a video picture as shown in fig. 8 can be displayed in the live video playing interfaces of the anchor and the audiences. Special effects are displayed on the lower right of the anchor's face: an outer graphic and an artistic word of a hard skin located within the graphic.

In another embodiment, the display position may be set according to the special effect category, and the display positions of the special effects belonging to the same special effect category are the same, that is, the position parameter of the display position of the special effect is stored in correspondence with the special effect category. After the server acquires the visual effect information of the target key information, the server can determine the special effect category to which the visual effect information belongs, and acquire the position parameter corresponding to the special effect category to which the visual effect information belongs as the position parameter of the visual effect information; the position parameter is used to indicate the position of the display position of the special effect relative to the face in the image frame. And after the server determines the position of the face in the image frame, determining the display position of the special effect corresponding to the visual effect information in the image frame according to the position parameter of the visual effect information and the position of the face in the image frame.

For example, the display position corresponding to the special effect category to which the air blow, the fallen leaves, and the like belong may be at the position of the neck of the anchor, and the display position corresponding to the special effect category to which the arrow, and the like belong may be at the position of the head or the shoulder of the anchor. After determining that the visual effect information corresponding to certain target key information is fallen leaves, the server acquires a position parameter corresponding to a special effect category to which the visual effect information belongs, wherein the acquired position parameter indicates that the display position of the special effect is positioned below the face position in the image frame, and the position parameter can also include the distance between the display position of the special effect and the face edge. The server adds special effects to the corresponding display positions in the image frames according to the position parameters, and can display video pictures as shown in fig. 9 in the live video playing interfaces of the anchor and the audiences. The special effect of fallen leaves is displayed at the position of the neck below the face of the anchor.

After the server determines that the visual effect information corresponding to the certain target key information is an arrow, the server acquires the position parameter corresponding to the special effect category to which the visual effect information belongs, and the acquired position parameter indicates that the display position of the special effect is positioned above the left of the face position in the image frame. The server adds special effects to the corresponding display positions in the image frames according to the position parameters, and can display video pictures as shown in fig. 10 in the live video playing interfaces of the anchor and the audiences. At the upper left of the anchor's face, the position of the head shows the effect of the arrow.

In step S309, the server sends the target video to the target client.

In step S310, the server sends the target video to the associated client.

After obtaining the target video with special effects, the server can send the target video to the target client of the anchor and the associated end of the audience for playing, so that the anchor and the audience watching the live broadcast can see the video with the special effects added.

In the above embodiment, the video processing method performed by the server may include the steps of: acquiring a video to be processed, namely receiving a video sent by a target client; if the target key information is identified from the voice data of the video, generating a visual special effect according to the target key information; and adding a visual special effect in the image frame corresponding to the voice data to obtain a target video. The image frames corresponding to the voice data comprise image frames played corresponding to the voice data or image frames after the voice data is played.

In some embodiments, the video processing method performed by the server may include the steps of: receiving a visual effect adding request sent by a target client, acquiring a visual effect database corresponding to a visual effect theme according to the received visual effect adding request, and converting voice data of live video into text information. And performing word segmentation processing on the text information to obtain word segments contained in the text information, respectively comparing each obtained word segment with key information stored in a visual effect database corresponding to the visual effect theme selected by the anchor, and determining whether the text information comprises target key information matched with the key information stored in the visual effect database. When the text information comprises target key information, visual effect information corresponding to the target key information is obtained, a special effect is added to the live video according to the visual effect information, a target video is obtained, and the target video is sent to a target client and an associated client.

In another embodiment, the video processing method provided by the embodiment of the present application may be executed by a terminal device, or by a terminal device of an anchor or a target client. Fig. 11 shows a flow chart of a video processing method executed by a terminal device, which specifically includes the following steps:

in step S1101, a video to be processed is acquired.

Step S1102, if the target key information is identified from the voice data of the video, generating a visual special effect according to the target key information.

Step S1103, add the visual special effect to the image frame corresponding to the voice data to obtain the target video.

In some embodiments, if the object of video processing is a short video or other video to be published on a network platform, a visual special effect may be added to the video during the video recording process. In the process of recording the video, the terminal equipment identifies the collected voice data, if the target key information is identified from the voice data, the visual special effect is generated according to the target key information, and the generated visual special effect is added into the video to generate the target video.

In other embodiments, if the object of the video processing is a live video, the viewer's terminal device or the anchor terminal device may add a visual effect to the video during the video playback. In the video playing process, the terminal equipment identifies voice data in the video, if target key information is identified from the voice data, a visual special effect is generated according to the target key information, and the visual special effect is added to the video in real time for playing.

Specifically, in an embodiment, in the process of live broadcasting by the anchor, a special effect can be added to the live video all the time. In other embodiments, the operation of adding the special effect function may be actively started according to the anchor to add a special effect to the live video, and if the anchor closes the function of adding the special effect, the addition of the special effect to the live video is stopped.

In an embodiment, after receiving an operation of starting adding a visual effect in a live broadcast process of the anchor, a terminal device of the anchor can acquire voice data of the anchor in a live broadcast video in real time, extract acoustic features from the voice data, and input the extracted acoustic features into a trained acoustic model to obtain a phoneme data sequence corresponding to the voice data. And respectively searching the text element corresponding to each phoneme in the phoneme data sequence in a pronunciation dictionary, thereby obtaining a text element sequence corresponding to the phoneme data sequence. And finally, inputting the obtained text element sequence into a trained language model to obtain text information corresponding to the voice data, and then adding a special effect into the live video according to the text information to obtain a target video.

In some embodiments, after converting the acquired voice data into text information, the terminal device of the anchor may perform word segmentation processing on the text information obtained by the conversion to obtain words contained in the text information, and then compare each obtained word with key information stored in a pre-stored visual effect database to determine whether the text information includes target key information matched with the key information stored in the visual effect database. And if the text information comprises target key information, acquiring visual effect information corresponding to the target key information from a visual effect database.

In other embodiments, after converting the acquired voice data into text information, the terminal device of the anchor may perform word segmentation processing on the text information obtained by the conversion to obtain words contained in the text information, and then compare each obtained word with key information stored in a pre-stored key information base to determine whether the text information includes target key information matched with the key information stored in the key information base. If the text information comprises the target key information, the visual effect information corresponding to the target key information can be obtained from a pre-stored visual effect database according to the target key information; or generating visual effect information corresponding to the target key information according to the target key information and a set special effect display mode, and adding a special effect to the live video according to the visual effect information to obtain the target video.

Optionally, the anchor may set the key information stored in the key information base according to its own requirements, and adjust the key information stored in the key information base. The anchor can also set or select a special effect display mode according to self requirements, for example, the special effect display mode can include artistic word display or dynamic display and the like on target key information.

In other embodiments, the terminal device of the anchor may further use all the participles in the text information obtained by converting the voice data as target key information, generate visual effect information corresponding to the target key information according to the target key information and a set special effect display mode, and add a special effect to the live video according to the visual effect information to obtain the target video.

In another embodiment, after receiving an operation of starting adding a visual effect in a live broadcasting process of a main broadcast, a terminal device of the main broadcast responds to the operation of starting adding the visual effect, and displays theme selection controls corresponding to visual effect themes of different video scenes in a live video playing interface. After the anchor selects one visual effect theme, clicking the theme selection control corresponding to the selected visual effect theme, and responding to the triggering operation of the theme selection control corresponding to any visual effect theme by the terminal equipment of the anchor to acquire the visual effect database corresponding to the visual effect theme. Then, the terminal device of the anchor can acquire the voice data of the anchor in the live video in real time, convert the voice data into text information, perform word segmentation processing on the text information to obtain word segments contained in the text information, then compare each obtained word segment with key information stored in a visual effect database corresponding to the visual effect theme selected by the anchor, and determine whether the text information includes target key information matched with the key information stored in the visual effect database. And if the text information comprises target key information, acquiring the visual effect information corresponding to the target key information from a visual effect database corresponding to the visual effect theme selected by the anchor.

In some embodiments, if the visual effect database contains a plurality of pieces of visual effect information of the target key information, the anchor may present an identifier of each piece of visual effect information in the live video playing interface. The anchor can select any one of the identifiers of the visual effect information, click the identifier of the visual effect information, respond to the trigger operation aiming at the identifier of any one visual effect information by the terminal equipment of the anchor, and select the visual effect information corresponding to the trigger operation as the visual effect information corresponding to the target key information.

For example, as shown in fig. 12, if the text information includes target key information "fairy", and the key information "fairy" corresponds to 4 pieces of visual effect information in the visual effect database, the identifiers of the 4 pieces of visual effect information may be respectively displayed on the lower portion of the live video playing interface, including: ' Wa! The Chinese herbal medicines are fairy and the Chinese herbal medicines are really small fairy! "," fairy this immortal "and" the fairy comes to a cheer. The anchor can select any one of the 4 visual effect information identifiers, click the visual effect information identifier, and after receiving the trigger operation of the anchor on the visual effect information identifier, the terminal equipment of the anchor selects the visual effect information corresponding to the trigger operation as the visual effect information corresponding to the target key information.

In some embodiments, the terminal device of the anchor may add a special effect to the image frame corresponding to the voice data at a predetermined fixed position or at a randomly selected position. The image frame corresponding to the voice data comprises an image frame played corresponding to the voice data or an image frame after the voice data is played.

In other embodiments, the terminal device of the anchor may determine the position of the face in the image frame first, randomly select a display position of the special effect in a region other than the face in the image frame, or select a position having a distance from the position of the face satisfying a set distance value as the display position, and add the special effect to the image frame according to the selected display position.

In other embodiments, the terminal device of the anchor may perform feature detection on the anchor face in the image frame of the video by a face detection method, determine the position of the face organ of the anchor in the image frame, and determine the position of the face in the image frame according to the position of the face organ. The terminal device of the anchor may further obtain a position parameter in the visual effect information, the position parameter being used to indicate a position of a display position of a special effect corresponding to the visual effect information relative to the anchor face in the image frame. The anchor terminal can determine the display position of the special effect corresponding to the visual effect information in the image frame synchronously played with the voice data according to the position parameter of the visual effect information and the position of the anchor face in the image frame. And adding special effects at the display positions in the image frames by the terminal equipment of the anchor according to the visual effect information.

In other embodiments, the terminal device of the anchor may determine a special effect category to which the visual effect information belongs, and obtain a position parameter of the visual effect information according to the special effect category to which the visual effect information belongs; the position parameter is used to indicate the position of the display position of the special effect relative to the face in the image frame. For example, a text effect may appear around the anchor's face, a fallen leaf blowing effect may appear at the anchor's neck, an arrow effect may appear at the anchor's head and shoulders, and so on. After the terminal equipment of the anchor determines the position of the face in the image frame, the display position of the special effect corresponding to the visual effect information in the image frame is determined according to the position parameter of the visual effect information and the position of the face in the image frame. And adding special effects at the display positions in the image frames by the terminal equipment of the anchor according to the visual effect information.

For example, if in the interface shown in FIG. 12, the anchor selects visual effects information: ' Wa! The video is a fairy, and the terminal equipment of the anchor adds special effects in the video frame according to the visual effect information. The video screen to which the special effect is added is shown in FIG. 13, where "Wawa!is displayed at the upper left of the face position! "this is a fairy ring" shown in the lower right of the face position.

And adding a special effect into the image frame of the live video by the terminal equipment of the anchor to obtain the target video. The terminal equipment of the anchor broadcast can send the target video to the server, and the server sends the target video to the terminal equipment of the audience watching the live broadcast video of the anchor broadcast for playing.

Based on the same inventive concept as the above method, the embodiment of the present application further provides a video playing method, which may be executed by a terminal device, or executed by a terminal device of a main broadcast or a target client. As shown in fig. 14, the method may include the steps of:

step S1401, the voice data in the video is played.

In step S1402, a target video obtained by adding a special effect to a video is played.

The added special effect is added into the video in the process of recording or playing the video, is determined according to the target key information in the voice data and is added into the image frame corresponding to the voice data. The image frame corresponding to the voice data comprises an image frame played corresponding to the voice data or an image frame after the voice data is played.

In some embodiments, visual effects may be added to the video during the video recording process. In the process of recording the video, the terminal equipment identifies the collected voice data, if the target key information is identified from the voice data, the visual special effect is generated according to the target key information, and the generated visual special effect is added into the video to generate the target video.

In another embodiment, in the process of live broadcasting by the anchor, special effects can be added to the live video all the time. The operation of adding the special effect function can be actively started according to the anchor, the special effect is added to the live video, and if the anchor closes the function of adding the special effect, the addition of the special effect to the live video is stopped.

In the live broadcasting process, the terminal equipment plays live broadcasting videos collected by the camera and main broadcasting voice data collected by the microphone in real time. Meanwhile, the terminal equipment converts the collected voice data into text information, and if the text information comprises target key information, special effects are added into the live video according to the target key information to obtain the target video.

For example, in an actual application process, users may log in a live application client installed on the terminal device 300, and each user may have an identity, i.e., a user identifier, such as a user account, a user nickname, a contact address, and the like, that is, the identity is recognizable by other users on the live application. The user logs in the terminal equipment by inputting the user identification of the individual, and the terminal equipment identifies the user identity by the user identification input by the user and allows the user to enter the application. The user may choose to start the live process in the anchor identity after entering the application. In the live broadcast process, a user can start a visual effect adding function in the setting of a live broadcast page, and the terminal equipment can display theme selection controls corresponding to visual effect themes of different video scenes in a live broadcast video playing interface. For example, the theme selection control corresponding to the visual effect theme is located below the live video playing interface shown in fig. 5, and the selectable visual effect themes include "original image", "small firewood dog", "comprehensive art", and "girl". The user can select any one of the visual effect themes and click the theme selection control corresponding to the visual effect theme. For example, the user may select a theme selection control corresponding to the visual effect theme "heddles" in FIG. 5. In the video live broadcast process, when a user sends certain voice data, the terminal equipment can display special-effect identification around the face of the user in a live broadcast video page.

When the terminal device receives the voice of the user, the terminal device can display the identifiers of the plurality of pieces of visual effect information in the live video playing interface. For example, when a user issues a voice of "i am a little fairy" during a live video after selecting a theme selection control corresponding to the visual effect theme "girl" in fig. 5 in the live video playing interface, an identifier of the visual effect information as shown in fig. 12 may be displayed below the live video playing interface. The user can select any visual effect information mark to perform triggering operation, after the terminal equipment receives the triggering operation of the user on any visual effect information mark, the visual effect information corresponding to the user triggering operation can be selected as the visual effect information corresponding to the voice data sent by the user, and the visual effect information is displayed around the face of the user in the live video page. For example, the user may select "Java!in FIG. 12! This is the identifier of "fairy", which is the visual effect information, the terminal device can display "wove |" as shown in fig. 13 around the user's face on the video live broadcast page! "and" this is a special effect of xianlian ".

In an actual application process, a user may log in to a live application client installed on the terminal device 302, the user logs in to the terminal device by inputting a user identifier, and the terminal device recognizes a user identity by the user identifier input by the user, so as to allow the user to enter the application. The user, upon entering the application, selects to watch the live process of the anchor in the audience identity. The user can watch the video live broadcast page to display visual effect information while hearing some voice sent by the main broadcast. For example, when the anchor starts the visual effect adding function in its terminal device, the viewer can view a special effect of "peach blossom" shown in fig. 7 around the anchor face position on the live video page in his terminal device while hearing "peach blossom" spoken by the anchor. When the anchor selects the theme selection control corresponding to the visual effect theme "maiden" in fig. 5 in the terminal device of the anchor, the viewer can hear that the anchor says "these are xianlian", and at the same time, view that "java | shown in fig. 13 is displayed around the position of the face of the anchor on the live broadcast page in the terminal device of the anchor! "and" this is a special effect of xianlian ".

In an alternative embodiment, the target key information may be a target keyword. The terminal equipment of the anchor also stores a keyword library, and the anchor can adjust the keywords stored in the keyword library, and can add new keywords or delete part of original keywords. In the process of the anchor live broadcast, the terminal equipment can acquire the voice data of the anchor, convert the voice data into text information, perform word segmentation processing on the text information to obtain word segments contained in the text information, then respectively compare each obtained word segment with keywords stored in a keyword library, and determine whether the text information comprises target keywords matched with the keywords stored in the keyword library. After determining that the text information includes a target keyword matched with the keyword stored in the keyword library, the terminal device can display the keyword at a corresponding position in the video picture, and add a visual effect to the keyword according to a visual effect theme selected by the anchor. For example, when the text information includes a target keyword "Hi" and the main theme of the visual effect selected by the anchor is "variety", corresponding visual effect information is generated according to "Hi", a corresponding special effect is added to the video, the character "Hi" is displayed at a corresponding position in the video frame of the obtained target video, and the character "Hi" can jump and slowly move from one side of the video frame to the corresponding position.

In an optional embodiment, after the anchor starts the operation of adding the visual effect, the terminal device converts the voice data in the video into text information, the obtained text information is that "i is a fairy", the terminal device takes "i is a fairy" as target key information, and an artistic word corresponding to the target key information "i is a fairy" is generated according to a set special effect display mode. After the special effect is added into the video, the obtained artistic words corresponding to the 'i is a fairy' are displayed at corresponding positions in the video picture of the target video, and the artistic words corresponding to the 'i is a fairy' can be displayed in a jumping manner.

Based on the same inventive concept as the video processing method shown in fig. 11, the embodiment of the present application further provides a live video processing apparatus, and the live video processing apparatus may be disposed in a terminal device. Because the apparatus is a corresponding apparatus to the video processing method in the embodiment of the present application, and the principle of the apparatus for solving the problem is similar to that of the method, the implementation of the apparatus may refer to the implementation of the above method, and repeated details are not repeated.

Fig. 15 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application, and as shown in fig. 15, the video processing apparatus includes: a special effect generating unit 151 and a special effect adding unit 152; wherein the content of the first and second substances,

a special effect generating unit 151 configured to acquire a video to be processed; if the target key information is identified from the voice data of the video, generating a visual special effect according to the target key information;

a special effect adding unit 152, configured to add a visual special effect to an image frame corresponding to the voice data to obtain a target video; the image frame corresponding to the voice data comprises an image frame played corresponding to the voice data or an image frame after the voice data is played.

In an optional embodiment, the special effect generating unit 151 is specifically configured to:

and generating a visual special effect containing the characters in the key words according to a set character display format.

In an optional embodiment, the special effect adding unit 152 is specifically configured to:

visual effects are added at display locations in the image frame.

converting voice data of the video into text information;

and when the text information comprises the target key information, generating the visual special effect according to the target key information.

In an optional embodiment, the special effect generating unit 151 is further configured to:

In an alternative embodiment, as shown in fig. 16, the live video processing apparatus may further include a first data obtaining unit 161 and a video sending unit 162; wherein the content of the first and second substances,

a first data obtaining unit 161, configured to receive a visual effect addition request sent by a target client in a live broadcast process, where the visual effect addition request carries a visual effect theme selected by a target user; acquiring a visual effect database corresponding to a visual effect theme;

and a video sending unit 162, configured to send the target video to the target client and the associated client watching the live broadcast respectively.

and acquiring visual effect information corresponding to the target key information from a visual effect database corresponding to the visual effect theme.

In an optional embodiment, as shown in fig. 17, the live video processing apparatus may further include a second data obtaining unit 171, configured to respond to an operation of starting to add a visual effect, and display a theme selection control corresponding to a visual effect theme of different video scenes in a live video playing interface; and responding to the triggering operation of the theme selection control corresponding to any visual effect theme to obtain a visual effect database corresponding to the visual effect theme.

and responding to the triggering operation of the theme selection control corresponding to any visual effect theme to obtain a visual effect database corresponding to the visual effect theme.

and obtaining visual effect information corresponding to the target key information from a visual effect database corresponding to the visual effect theme, and generating a visual special effect according to the visual effect information.

and responding to the triggering operation aiming at the identifier of any visual effect information, and selecting the visual effect information corresponding to the triggering operation as the visual effect information corresponding to the target key information.

In an alternative embodiment, the special effect adding unit 152 is further configured to:

determining a position of a face in an image frame;

determining the special effect category to which the visual special effect belongs, and acquiring the position parameter of the visual special effect according to the special effect category to which the visual special effect belongs.

determining a position of a face in an image frame;

in a region other than the face in the image frame, a position whose distance from the position of the face satisfies a set distance value is selected as a display position.

The live video processing device provided by the embodiment of the application can convert voice data of a user in a live video into text information through a voice recognition technology, recognize target key information in the voice data of the user, determine visual effect information according to the target key information, and add a special effect in the video according to the visual effect information. The method can add special effects in real time in the video playing process, so that the manual workload can be reduced, and the efficiency can be improved.

Based on the same inventive concept as the video playing method shown in fig. 14, the embodiment of the present application further provides a video playing device, which can be disposed in a terminal device. Because the device is a device corresponding to the video playing method in the embodiment of the present application, and the principle of the device for solving the problem is similar to that of the method, the implementation of the device can refer to the implementation of the above method, and repeated details are not repeated.

Fig. 18 shows a schematic structural diagram of a video playing apparatus provided in an embodiment of the present application, and as shown in fig. 18, the video playing apparatus includes: a voice playing unit 181 and a video playing unit 182; wherein the content of the first and second substances,

a voice playing unit 181, configured to play voice data in a video;

the video playing unit 182 is configured to add a visual special effect to an image frame corresponding to the voice data to obtain a target video; the image frame corresponding to the voice data comprises an image frame played corresponding to the voice data or an image frame after the voice data is played.

The embodiment of the application also provides electronic equipment based on the same inventive concept as the embodiment of the method and the embodiment of the device. The electronic device may be any electronic device such as a mobile phone, a tablet computer, a Point of sale (POS), a vehicle-mounted computer, a smart wearable device, and a Personal Computer (PC), and may also be the terminal device 300 shown in fig. 1 by way of example.

Fig. 19 shows a block diagram of an electronic device according to an embodiment of the present application. As shown in fig. 19, the electronic apparatus includes: radio Frequency (RF) circuit 310, memory 320, input unit 330, display unit 340, sensor 350, audio circuit 360, wireless fidelity (WiFi) module 370, processor 380, and the like. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 19 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the electronic device in detail with reference to fig. 19:

the RF circuit 310 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, receives downlink information of a base station and then processes the received downlink information to the processor 380; in addition, the data for designing uplink is transmitted to the base station.

The memory 320 may be used to store software programs and modules, such as program instructions/modules corresponding to the video processing method and apparatus in the embodiment of the present application, and the processor 380 executes various functional applications and data processing of the electronic device, such as the video processing method provided in the embodiment of the present application, by executing the software programs and modules stored in the memory 320. The memory 320 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program of at least one application, and the like; the storage data area may store data (such as visual effect information and effect identification) created according to the use of the electronic device, and the like. Further, the memory 320 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 330 may be used to receive numeric or character information input by a user and generate key signal inputs related to user settings and function control of the terminal.

Optionally, the input unit 330 may include a touch panel 331 and other input devices 332.

The touch panel 331, also referred to as a touch screen, can collect touch operations of a user on or near the touch panel 331 (for example, operations of the user on the touch panel 331 or near the touch panel 331 using any suitable object or accessory such as a finger, a stylus, etc.), and implement corresponding operations according to a preset program, for example, operations of the user clicking a shortcut identifier of a function module, etc. Alternatively, the touch panel 331 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 380, and can receive and execute commands sent by the processor 380. In addition, the touch panel 331 may be implemented in various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave.

Optionally, other input devices 332 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 340 may be used to display information input by a user or interface information presented to the user, and various menus of the electronic device. The display unit 340 is a display system of the terminal device, and is configured to present an interface, such as a display desktop, an operation interface of an application, or an operation interface of a live application.

The display unit 340 may include a display panel 341. Alternatively, the Display panel 341 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

Further, the touch panel 331 can cover the display panel 341, and when the touch panel 331 detects a touch operation on or near the touch panel 331, the touch panel is transmitted to the processor 380 to determine the type of the touch event, and then the processor 380 provides a corresponding interface output on the display panel 341 according to the type of the touch event.

Although in fig. 19, the touch panel 331 and the display panel 341 are two independent components to implement the input and output functions of the electronic device, in some embodiments, the touch panel 331 and the display panel 341 may be integrated to implement the input and output functions of the terminal.

The electronic device may also include at least one sensor 350, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel 341 according to the brightness of ambient light, and a proximity sensor that may turn off the backlight of the display panel 341 when the electronic device is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration) for recognizing the attitude of the electronic device, vibration recognition related functions (such as pedometer, tapping) and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which may be further configured to the electronic device, detailed descriptions thereof are omitted.

Audio circuitry 360, speaker 361, microphone 362 may provide an audio interface between a user and an electronic device. The audio circuit 360 may transmit the electrical signal converted from the received audio data to the speaker 361, and the audio signal is converted by the speaker 361 and output; on the other hand, the microphone 362 converts the collected sound signals into electrical signals, which are received by the audio circuit 360 and converted into audio data, which are then processed by the audio data output processor 380 and then transmitted to, for example, another electronic device via the RF circuit 310, or output to the memory 320 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the electronic device can help the user send and receive e-mail, browse web pages, access streaming media, etc. through the WiFi module 370, and it provides wireless broadband internet access for the user. Although fig. 19 shows the WiFi module 370, it is understood that it does not belong to the essential constitution of the electronic device, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 380 is a control center of the electronic device, connects various parts of the whole electronic device by using various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 320 and calling data stored in the memory 320, thereby performing overall monitoring of the electronic device. Optionally, processor 380 may include one or more processing units; optionally, the processor 380 may integrate an application processor and a modem processor, wherein the application processor mainly processes software programs such as an operating system, applications, and functional modules inside the applications, for example, a video processing method provided in the embodiment of the present application. The modem processor handles primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 380.

It will be appreciated that the configuration shown in fig. 19 is merely illustrative and that the electronic device may include more or fewer components than shown in fig. 19 or have a different configuration than shown in fig. 19. The components shown in fig. 19 may be implemented in hardware, software, or a combination thereof.

The embodiment of the application also provides electronic equipment based on the same inventive concept as the embodiment of the method and the embodiment of the device. The electronic device may be a server, such as server 100 shown in FIG. 1. In this embodiment, the electronic device may be configured as shown in fig. 20, and include a memory 101, a communication module 103, and one or more processors 102.

A memory 101 for storing a computer program for execution by the processor 102. The memory 101 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a program required for running an instant messaging function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.

The processor 102 may include one or more Central Processing Units (CPUs), or be a digital processing unit, etc. And a processor 102 for implementing the above-mentioned control method when calling the computer program stored in the memory 101.

The communication module 103 is used for communicating with the terminal device to obtain voice data.

The specific connection medium among the memory 101, the communication module 103 and the processor 102 is not limited in the embodiments of the present application. In fig. 20, the memory 101 and the processor 102 are connected by a bus 104, the bus 104 is represented by a thick line in fig. 20, and the connection manner between other components is merely illustrative and not limited. The bus 104 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 20, but this is not intended to represent only one bus or type of bus.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the video processing method in the above-described embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.

Claims

1. A video processing method, comprising:

acquiring a video to be processed;

2. The method of claim 1, wherein the target key information is a keyword, and generating a visual effect according to the target key information comprises:

3. The method according to claim 1, wherein adding the visual special effect in real time to the image frame corresponding to the voice data specifically comprises:

determining the display position of the visual special effect in an image frame corresponding to the voice data;

adding the visual special effect at the display location in the image frame.

4. The method of claim 1, wherein generating a visual effect based on target key information if the target key information is identified from the voice data of the video comprises:

converting voice data of the video into text information;

5. The method according to any one of claims 1 to 4, wherein the video is a live video, and before generating a visual special effect according to target key information if the target key information is identified from the voice data of the video, the method further comprises:

6. The method according to any one of claims 1 to 4, wherein the video is a live video, and before generating a visual special effect according to target key information if the target key information is identified from the voice data of the video, the method further comprises:

responding to the triggering operation of a theme selection control corresponding to any visual effect theme, and acquiring a visual effect database corresponding to the visual effect theme; and

generating a visual special effect according to the target key information specifically comprises: and acquiring visual effect information corresponding to the target key information from a visual effect database corresponding to the visual effect theme, and generating the visual special effect according to the visual effect information.

7. The method according to claim 6, wherein obtaining the visual effect information corresponding to the target key information from the visual effect database corresponding to the visual effect topic comprises:

8. The method of claim 3, wherein determining a display position of the visual effect in an image frame corresponding to the voice data comprises:

determining a position of a face in the image frame;

9. The method of claim 8, wherein obtaining the location parameter of the visual effect comprises:

10. The method of claim 3, wherein determining a display position of the visual effect in an image frame corresponding to the voice data comprises:

determining a position of a face in the image frame;

11. A video playback method, comprising:

playing voice data in the video;

12. A video processing apparatus, comprising:

13. A video playback apparatus, comprising:

the voice playing unit is used for playing voice data in the video;

14. A computer-readable storage medium having a computer program stored therein, the computer program characterized by: the computer program, when executed by a processor, implements the method of any one of claims 1 to 10 or the method of claim 11.

15. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor, the computer program, when executed by the processor, implementing the method of any of claims 1-10 or the method of claim 11.