CN118052912A - Video generation method, device, computer equipment and storage medium - Google Patents

Video generation method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN118052912A
CN118052912A CN202211399961.5A CN202211399961A CN118052912A CN 118052912 A CN118052912 A CN 118052912A CN 202211399961 A CN202211399961 A CN 202211399961A CN 118052912 A CN118052912 A CN 118052912A
Authority
CN
China
Prior art keywords
expression
broadcasting
emotion
broadcast
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211399961.5A
Other languages
Chinese (zh)
Inventor
黄晗
吴高
李志锋
刘威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202211399961.5A priority Critical patent/CN118052912A/en
Publication of CN118052912A publication Critical patent/CN118052912A/en
Pending legal-status Critical Current

Links

Landscapes

  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The embodiment of the application discloses a video generation method, a video generation device, computer equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: acquiring a broadcasting text to be broadcasted by a virtual object; carrying out emotion analysis on the broadcast text to obtain the emotion type of the broadcast text; based on the emotion type of the broadcasting text, determining expression data matched with the broadcasting text, wherein the emotion expressed by the expression indicated by the expression data belongs to the emotion type; driving the virtual object based on the expression data to obtain a broadcasting picture comprising the virtual object, so that the expression of the virtual object in the broadcasting picture is the expression indicated by the expression data; and generating broadcast video based on the broadcast audio and the broadcast picture corresponding to the broadcast text, wherein the broadcast video comprises the broadcast audio and the broadcast picture. The application saves the time spent on manually shooting the video, achieves the effect of reading the broadcast text by the virtual object with the sound and the color, and improves the efficiency and the authenticity of generating the broadcast video.

Description

Video generation method, device, computer equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a video generation method, a video generation device, computer equipment and a storage medium.
Background
With the rapid development of computer technology, video is used as an information expression medium, and is increasingly widely applied to daily life of people. In scenes such as advertisement broadcasting, online teaching, news broadcasting and the like, videos can be made to display and explain related information.
In the related art, a real person reads a broadcasting text, and records the process of reading the broadcasting text to obtain a broadcasting video corresponding to the broadcasting text. However, the efficiency of producing the broadcast video is low because of the great amount of manpower and time required for producing the video manually.
Disclosure of Invention
The embodiment of the application provides a video generation method, a video generation device, computer equipment and a storage medium, which can improve the efficiency of generating broadcast videos. The technical scheme is as follows:
in one aspect, a video generation method is provided, the method including:
Acquiring a broadcasting text to be broadcasted by a virtual object;
carrying out emotion analysis on the broadcasting text to obtain an emotion type of the broadcasting text;
Determining expression data matched with the broadcasting text based on the emotion type of the broadcasting text, wherein emotion expressed by the expression indicated by the expression data belongs to the emotion type;
Driving the virtual object based on the expression data to obtain a broadcasting picture comprising the virtual object, so that the expression of the virtual object in the broadcasting picture is the expression indicated by the expression data;
Generating a broadcast video based on the broadcast audio corresponding to the broadcast text and the broadcast picture, wherein the broadcast video comprises the broadcast audio and the broadcast picture.
Optionally, the plurality of emotion word banks include a positive emotion word bank and a negative emotion word bank; the determining, based on the plurality of emotion word banks, emotion types of the plurality of words in the broadcast sentence, and determining, based on the number of words belonging to each emotion type in the broadcast sentence, emotion types of the broadcast sentence includes:
under the condition that the words are inquired in the positive emotion word library, determining that the words belong to a positive emotion type; under the condition that the words are inquired in the negative emotion word library, determining that the words belong to a negative emotion type;
determining the emotion type with the largest number of the words as the emotion type of the broadcasting statement under the condition that the broadcasting statement does not comprise negative words;
and determining the emotion type with the least quantity of the words as the emotion type of the broadcasting statement under the condition that the broadcasting statement comprises negative words.
In another aspect, there is provided a video generating apparatus, the apparatus including:
The text acquisition module is used for acquiring a broadcasting text to be broadcasted by the virtual object;
The emotion analysis module is used for carrying out emotion analysis on the broadcasting text to obtain an emotion type of the broadcasting text;
The expression determining module is used for determining expression data matched with the broadcasting text based on the emotion type of the broadcasting text, and the emotion expressed by the expression indicated by the expression data belongs to the emotion type;
the driving module is used for driving the virtual object based on the expression data to obtain a broadcasting picture comprising the virtual object, so that the expression of the virtual object in the broadcasting picture is the expression indicated by the expression data;
the video generation module is used for generating a broadcast video based on the broadcast audio corresponding to the broadcast text and the broadcast picture, wherein the broadcast video comprises the broadcast audio and the broadcast picture.
Optionally, the broadcast text includes a plurality of broadcast sentences; the emotion analysis module is used for:
Carrying out emotion analysis on the broadcast statement to obtain an emotion type of the broadcast statement;
under the condition that keywords are detected in the broadcast sentences, determining emotion types to which the keywords belong;
and determining the emotion type of the broadcasting text based on the emotion type of the broadcasting statement and the emotion type of the keyword.
Optionally, the emotion analysis module is configured to implement at least one of:
Determining emotion types of a plurality of words in the broadcasting statement based on a plurality of emotion word banks, and determining emotion types of the broadcasting statement based on the number of words belonging to each emotion type in the broadcasting statement; each emotion word bank corresponds to one emotion type, and emotion expressed by words in the emotion word bank belongs to the emotion type corresponding to the emotion word bank;
Invoking an emotion classification model, performing emotion analysis on the broadcast statement to obtain prediction probabilities of multiple emotion types, and determining the emotion type of the broadcast statement based on the prediction probabilities of the multiple emotion types; the prediction probability of the emotion type represents the probability that the broadcast statement belongs to the emotion type.
Optionally, the plurality of emotion word banks include a positive emotion word bank and a negative emotion word bank; the emotion analysis module is used for:
under the condition that the words are inquired in the positive emotion word library, determining that the words belong to a positive emotion type; under the condition that the words are inquired in the negative emotion word library, determining that the words belong to a negative emotion type;
determining the emotion type with the largest number of the words as the emotion type of the broadcasting statement under the condition that the broadcasting statement does not comprise negative words;
and determining the emotion type with the least quantity of the words as the emotion type of the broadcasting statement under the condition that the broadcasting statement comprises negative words.
Optionally, the apparatus further comprises:
The frequency determining module is used for determining a first appearance frequency and a second appearance frequency corresponding to a plurality of words when the broadcasting statement comprises the words, wherein the first appearance frequency represents the appearance frequency of the words in the broadcasting text, and the second appearance frequency represents the appearance frequency of the text containing the words in a text database;
A weight determining module, configured to determine weights of the plurality of words based on a first occurrence frequency and the second occurrence frequency corresponding to the plurality of words, where the weights are positively related to the first occurrence frequency and negatively related to the second occurrence frequency;
And the keyword determining module is used for determining the word with the largest weight in the broadcasting statement as the keyword in the broadcasting statement.
Optionally, the expression data includes expression sub-data matched with each word in the broadcast text; the emotion analysis module is used for:
under the condition that the keyword is not detected in the broadcasting statement, determining expression data corresponding to the emotion type of the broadcasting statement as expression sub-data matched with each word in the broadcasting statement;
and under the condition that the keyword is detected in the broadcasting statement, determining expression sub-data corresponding to the emotion type to which the keyword belongs as expression sub-data matched with the keyword, and determining expression sub-data corresponding to the emotion type to which the broadcasting statement belongs as expression sub-data matched with a non-keyword, wherein the non-keyword refers to other words except the keyword in the broadcasting statement.
Optionally, the expression data is composed of a plurality of expression parameters, and the expression parameters are used for controlling the change of the facial key points; the apparatus further comprises:
a parameter adding module, configured to add an eye expression parameter to the expression data, where the eye expression parameter includes a blink parameter or an eyeball motion parameter, the blink parameter indicates an expression of blinking, and the eyeball motion parameter indicates an expression of rotating an eyeball;
the driving module is used for driving the virtual object based on the expression data added with the eye expression parameters.
Optionally, the parameter adding module is configured to implement any one of the following:
Determining a first time period according to a target frequency in the playing time period of the expression data, and adding the blink parameters into the expression data in the first time period;
determining a second time period in the playing time period, and adding a first eye movement parameter into expression data of the second time period, wherein the second time period refers to a time period when the virtual object acts;
And randomly determining a third time period in the playing time period, and adding a second eyeball motion parameter into expression data of the third time period, wherein the first eyeball motion parameter indicates that the amplitude of the rotating eyeball is larger than that of the second eyeball motion parameter.
Optionally, the expression data includes expression sub-data matched with each word in the broadcast text; the apparatus further comprises:
The data adjustment module is used for acquiring a plurality of frames of expression sub-data based on the duration of the playing time period of the expression sub-data;
the data adjustment module is further configured to adjust a plurality of frames of expression sub-data based on a weight parameter of the plurality of frames of expression sub-data, where the weight parameter indicates a variation amplitude of expression;
the driving module is used for driving the virtual object based on the adjusted expression data.
Optionally, the apparatus further comprises:
A parameter setting module, configured to determine a start time period, a hold time period, and a termination time period in the play time period;
The parameter setting module is further configured to sequentially increment the weight parameters of the expression sub-data of the multiple frames of the initial time period from a first value to a second value according to a time sequence, where the second value is greater than the first value;
The parameter setting module is further configured to set a weight parameter of the expression sub-data for a plurality of frames of the hold period to the second value;
the parameter setting module is further configured to set, according to a time sequence, a weight parameter of the expression sub-data of the multiple frames in the termination period to decrease from the second value to the first value in sequence.
Optionally, the expression data includes expression sub-data matched with each word in the broadcast text, and the broadcast audio includes an audio segment corresponding to each word in the broadcast text; the apparatus further comprises:
The data fusion module is used for determining a playing time period of the expression sub-data matched with the word based on a target time period and a playing time period of an audio segment corresponding to the word, wherein the time period of the playing time period is equal to the sum of the target time period and the playing time period, and the starting playing time point of the playing time period is the starting playing time point of the audio segment;
the data fusion module is further used for determining an overlapping time period in the playing time period of the expression sub-data matched with any two adjacent words;
And the data fusion module is further used for respectively fusing the two expression sub-data of the same frame in the overlapping time period, and determining the expression sub-data of the multiple frames obtained by fusion as the expression sub-data of the overlapping time period.
Optionally, the driving module is configured to:
Driving the virtual object based on the expression data to obtain a multi-frame broadcasting image comprising the virtual object, wherein the expression of the virtual object in the multi-frame broadcasting image is the expression indicated by the expression data;
And respectively fusing the multi-frame broadcasting images with target images to obtain the broadcasting picture, wherein the target images are used as the background of the broadcasting picture.
In another aspect, a computer device is provided, the computer device including a processor and a memory, the memory storing at least one computer program, the at least one computer program loaded and executed by the processor to implement operations performed by the video generation method as described in the above aspects.
In another aspect, there is provided a computer readable storage medium having stored therein at least one computer program loaded and executed by a processor to implement the operations performed by the video generation method as described in the above aspects.
In another aspect, a computer program product is provided, comprising a computer program loaded and executed by a processor to implement the operations performed by the video generation method as described in the above aspects.
According to the scheme provided by the embodiment of the application, under the scene of broadcasting the broadcasting text by adopting the virtual object, the expression data matched with the broadcasting text is determined according to the emotion type of the broadcasting text, so that the expression indicated by the expression data is matched with the emotion expressed by the broadcasting text, then the virtual object is driven based on the expression data, so that the emotion expressed by the expression made by the virtual object is consistent with the emotion expressed by the broadcasting text, the expression of the virtual object in the generated broadcasting video can be ensured to be consistent with the semantic of the broadcasting audio, and therefore, the effect of the virtual object for acoustically and chromographically expressing the broadcasting text is achieved on the basis of saving the time spent in manually shooting the video, and the efficiency and the authenticity of generating the broadcasting video are improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present application;
fig. 2 is a flowchart of a video generating method according to an embodiment of the present application;
FIG. 3 is a flowchart of another video generation method according to an embodiment of the present application;
FIG. 4 is a flowchart of a method for determining emotion type according to an embodiment of the present application;
FIG. 5 is a schematic diagram of an expression parameter according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a facial expression provided by an embodiment of the present application;
FIG. 7 is a flowchart of yet another video generation method provided by an embodiment of the present application;
FIG. 8 is a flowchart of yet another video generation method according to an embodiment of the present application;
FIG. 9 is a flowchart of yet another video generation method according to an embodiment of the present application;
FIG. 10 is a schematic diagram of an expression parameter variation according to an embodiment of the present application;
FIG. 11 is a flowchart of yet another video generation method according to an embodiment of the present application;
FIG. 12 is a flowchart of an advertisement video generation method according to an embodiment of the present application;
FIG. 13 is a schematic diagram of an advertising video provided by an embodiment of the present application;
fig. 14 is a schematic structural diagram of a video generating apparatus according to an embodiment of the present application;
Fig. 15 is a schematic structural diagram of another video generating apparatus according to an embodiment of the present application;
Fig. 16 is a schematic structural diagram of a terminal according to an embodiment of the present application;
Fig. 17 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of the embodiments of the present application will be given with reference to the accompanying drawings.
It is to be understood that the terms "first," "second," and the like, as used herein, may be used to describe various concepts, but are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. For example, a first frequency of occurrence may be referred to as a second frequency of occurrence, and similarly, a second frequency of occurrence may be referred to as a first frequency of occurrence, without departing from the scope of the application.
Wherein, at least one refers to one or more than one, for example, at least one word can be any integer number of words which is greater than or equal to one, such as one word, two words, three words, and the like. The plurality means two or more, and for example, the plurality of words may be an integer number of two or more of any one of two words, three words, and the like. Each refers to each of the at least one, e.g., each term refers to each term of the plurality of terms, and if the plurality of terms is 3 terms, each term refers to each term of the 3 terms.
It can be appreciated that in the embodiments of the present application, related data of user information such as broadcast text, virtual objects, etc. when the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.
Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.
Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
Natural language processing (Nature Language Processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
The video generation method provided by the embodiment of the application is explained below based on an artificial intelligence technology and a natural language processing technology.
The video generation method provided by the embodiment of the application can be used in computer equipment. Optionally, the computer device is a terminal or a server. Optionally, the server is a stand-alone physical server, or the server is a server cluster or a distributed system formed by a plurality of physical servers, or the server is a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like. Optionally, the terminal is a smart phone, tablet computer, notebook computer, desktop computer, smart speaker, smart watch, etc., but is not limited thereto.
In one possible implementation, the computer program according to the embodiments of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site or on multiple computer devices distributed across multiple sites and interconnected by a communication network, where the multiple computer devices distributed across multiple sites and interconnected by the communication network can constitute a blockchain system. In one possible implementation manner, the computer device for generating the broadcast video in the embodiment of the present application is a node in a blockchain system, where the node can store the generated broadcast video in a blockchain, and then the node or other devices in the blockchain acquire the broadcast video in the blockchain.
Fig. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application, and referring to fig. 1, the implementation environment includes a computer device 101 and a computer device 102, where the computer device 101 and the computer device 102 may be directly or indirectly connected through a wired or wireless communication manner. The computer device 101 is configured to generate a corresponding broadcast video based on the broadcast text and the virtual object, the computer device 102 is configured to provide the broadcast text to the computer device 101, and the computer device 101 may subsequently provide the generated broadcast video to the computer device 102.
Fig. 2 is a flowchart of a video generating method according to an embodiment of the present application, where the embodiment of the present application is executed by a computer device, and referring to fig. 2, the method includes:
201. the computer device obtains a broadcast text to be broadcast by the virtual object.
The virtual object is used for broadcasting the broadcasting text, namely reading the broadcasting text. The virtual object is a 2D (2-dimensional) virtual object or a 3D (3-dimensional) virtual object generated by AI technology, and the virtual object may be in any form, for example, the virtual object is in a real form or a cartoon form, and for example, the virtual object includes a man in a real form or a woman in a real form, and the like.
The broadcasting text is a text to be broadcasted by the virtual object, and the content of the broadcasting text is different in different fields. For example, in the e-commerce field, the broadcast text may be an introduction text for a product; for example, in the on-line education field, the broadcasting text can be an explanation text of courseware; for example, in the advertising field, the broadcast text may be an advertisement document that promotes a product, and the content of the broadcast text is not limited in the embodiment of the present application. The broadcast text may be a broadcast text obtained by the computer device according to a user operation, or may be a broadcast text sent by other devices to the computer device.
202. And the computer equipment performs emotion analysis on the broadcast text to obtain the emotion type of the broadcast text.
After the computer equipment obtains the broadcasting text, emotion analysis is carried out on the broadcasting text to determine the semantics of the broadcasting text, emotion expressed by the broadcasting text can be determined according to the semantics of the broadcasting text, and then the emotion type of the broadcasting text is determined, wherein the emotion type of the broadcasting text is the emotion type of the emotion expressed by the broadcasting text. For example, the emotion type includes happiness, anger, surprise, sadness, confusion, and the like.
203. The computer equipment determines expression data matched with the broadcasting text based on the emotion type of the broadcasting text, and the emotion expressed by the expression indicated by the expression data belongs to the emotion type.
After determining the emotion type of the broadcast text, the computer equipment determines expression data matched with the emotion type from a plurality of expression data, wherein the emotion type of the emotion expressed by the expression indicated by the expression data is the emotion type, and the expression data is the expression data matched with the broadcast text.
That is, the emotion expressed by the broadcast text is the same as the emotion expressed by the expression indicated by the expression data. For example, if the emotion expressed by the broadcast text is happy and the smiling expression can express happiness, the expression data matched with the broadcast text is expression data corresponding to the smiling expression.
204. The computer equipment drives the virtual object based on the expression data to obtain a broadcasting picture comprising the virtual object, so that the expression of the virtual object in the broadcasting picture is the expression indicated by the expression data.
After the computer equipment acquires the expression data matched with the broadcasting text, the virtual object is driven based on the expression data, so that the virtual object makes the expression indicated by the expression data, and the computer equipment can obtain a plurality of image frames comprising the virtual object by driving the virtual object, and the plurality of image frames can form a broadcasting picture. Because the expression of the virtual object is matched with the broadcasting text, namely, the emotion expressed by the expression of the virtual object is the same as the emotion expressed by the broadcasting text, when the virtual object reads the broadcasting text, the expression matched with the broadcasting text can be made, so that the virtual object is more vivid in the process of reading the broadcasting text.
205. The computer equipment generates broadcasting video based on broadcasting audio and broadcasting pictures corresponding to the broadcasting text, wherein the broadcasting video comprises broadcasting audio and broadcasting pictures.
The method comprises the steps that a computer device obtains broadcasting audio corresponding to a broadcasting text, the broadcasting audio is obtained by processing the broadcasting text by adopting a voice synthesis technology, and the broadcasting audio is the audio for broadcasting the broadcasting text.
After the broadcast audio and the broadcast picture are obtained, the broadcast audio and the broadcast picture are fused, which is equivalent to dubbing the broadcast picture by using the broadcast audio, so that a broadcast video comprising the broadcast audio and the broadcast picture is obtained, the content of the broadcast video is that a virtual object is reading a broadcast text, and as the emotion expressed by the expression of the virtual object is consistent with the emotion expressed by the broadcast text, the effect that the virtual object draws the broadcast text in a sound and drawing color manner can be achieved.
According to the method provided by the embodiment of the application, in the scene of broadcasting the broadcasting text by adopting the virtual object, the expression data matched with the broadcasting text is determined according to the emotion type of the broadcasting text, so that the expression indicated by the expression data is matched with the emotion expressed by the broadcasting text, then the virtual object is driven based on the expression data, so that the emotion expressed by the expression made by the virtual object is consistent with the emotion expressed by the broadcasting text, the expression of the virtual object in the generated broadcasting video can be ensured to be consistent with the semantic of the broadcasting audio, and therefore, the effect of the virtual object for acoustically and chromographically expressing the broadcasting text is achieved on the basis of saving the time spent in manually shooting the video, and the efficiency and the authenticity of generating the broadcasting video are improved.
On the basis of the embodiment shown in fig. 2, the broadcast text includes a broadcast sentence, the broadcast sentence includes a keyword, the computer device performs emotion analysis on sentence granularity on the broadcast sentence, and performs emotion analysis on word granularity on the keyword in the broadcast sentence, so as to determine an emotion type to which the broadcast text belongs, and a specific process is described in the embodiment shown in fig. 3 below. Fig. 3 is a flowchart of another video generating method according to an embodiment of the present application, which is executed by a computer device, referring to fig. 3, and includes:
301. The method comprises the steps that a computer device obtains a broadcasting text to be broadcasted by a virtual object, and the broadcasting text comprises a plurality of broadcasting sentences.
The process of step 301 is the same as that of step 201, and will not be described again.
302. And the computer equipment performs emotion analysis on the broadcast statement to obtain the emotion type of the broadcast statement.
After the computer device obtains the broadcast text, the computer device carries out clause on the broadcast text to obtain a plurality of broadcast sentences in the broadcast text, and for each broadcast sentence, the computer device carries out emotion analysis on the broadcast sentence by adopting the method of the step 302, so as to obtain the emotion type of each broadcast sentence.
In one possible implementation, emotion analysis includes three-category emotion analysis and multi-category emotion analysis. Optionally, the emotion types of the three-category emotion analysis include positive emotion, negative emotion and neutral emotion, and optionally, the emotion types of the multi-category emotion analysis include happiness, anger, surprise, sadness, confusion and the like. According to the embodiment of the application, any emotion analysis method can be adopted to carry out emotion analysis on the broadcast statement.
Taking three-category emotion analysis as an example, for example, in the advertising field, the broadcasting text is "multipurpose pot," electrodeless temperature control, and click purchase, "the broadcasting text is divided into three broadcasting sentences of" multipurpose pot, "electrodeless temperature control," and "click purchase," wherein the broadcasting sentence "multipurpose pot" is determined to be a positive emotion, the broadcasting sentence "electrodeless temperature control" is determined to be a neutral emotion, and the broadcasting sentence "click purchase" is determined to be a positive emotion.
In one possible implementation manner, the computer device performs emotion analysis on the broadcast statement to obtain an emotion type to which the broadcast statement belongs, including at least one of the following manners.
The first way is: the computer equipment determines emotion types of a plurality of words in the broadcasting statement based on the emotion word libraries, and determines emotion types of the broadcasting statement based on the number of words belonging to each emotion type in the broadcasting statement.
Wherein, each emotion word bank corresponds to one emotion type, and emotion expressed by words in the emotion word bank belongs to the emotion type corresponding to the emotion word bank. For each word in the broadcast sentence, the computer equipment determines an emotion word stock to which the word belongs, and determines the emotion type corresponding to the emotion word stock as the emotion type to which the word belongs. Alternatively, if the term is not queried in any emotion word library, i.e., the term does not belong to any emotion word library, the computer device determines that the term belongs to neutral emotion. After determining the emotion types of the words, the computer equipment counts the number of the words belonging to each emotion type, and determines the emotion type of the broadcasting statement according to the number of the words belonging to each emotion type.
In the embodiment of the application, the emotion type of the broadcasting statement is considered to depend on the emotion type of the words in the broadcasting statement, so that the emotion type of the broadcasting statement is determined based on the number of the words belonging to each emotion type in the broadcasting statement, which is equivalent to emotion analysis of the whole broadcasting statement from word granularity, and is beneficial to improving the accuracy of emotion analysis of the broadcasting statement.
In one possible implementation, the plurality of emotion word banks includes a positive emotion word bank and a negative emotion word bank. Under the condition that the words are inquired in the positive emotion word library, the computer equipment determines that the words belong to the positive emotion type, and under the condition that the words are inquired in the negative emotion word library, the computer equipment determines that the words belong to the negative emotion type. And the computer equipment determines the emotion type with the largest number of the words as the emotion type of the broadcasting statement under the condition that the broadcasting statement does not comprise negative words. And the computer equipment determines the emotion type with the least quantity of the words as the emotion type of the broadcasting statement under the condition that the broadcasting statement comprises negative words.
Optionally, the plurality of emotion word banks includes a positive emotion word bank and a negative emotion word bank. Taking the advertising field as an example, the active emotion word stock comprises a plurality of sub word stocks, such as a brand sub word stock, a product function sub word stock, a benefit point sub word stock and an active emotion sub word stock of an advertising scene. The brand sub word stock comprises a plurality of brand words, and the product function sub word stock comprises words introducing product functions, such as words of conciseness, comfort, intelligence and the like. The benefit point sub word library includes words for describing the benefit point, such as words of "offer", "special offer", "welfare", and the like. The positive emotion sub-word library of the advertisement scene includes words determined to be positive emotion in the advertisement scene, such as words of "recommended", "placed order", and the like. The negative emotion word library comprises words which are judged to be negative emotion in the advertisement scene, such as words like ' heart tired ', ' hard and the like. The term "negative" refers to a negative word, such as "no", "none", "false" or "opposite" words.
The emotion types comprise positive emotion and negative emotion, and under the condition that no negative word is included in the broadcasting statement, the emotion type of the broadcasting statement is influenced by the emotion type of the word in the broadcasting statement, and considering that the number of words belonging to which emotion type in the broadcasting statement is more, the possibility of the broadcasting statement belonging to which emotion type is higher, so that the emotion type with the largest number of words belongs to is determined as the emotion type of the broadcasting statement. In the case that the broadcast statement includes a negative word, besides the influence of the emotion type of the word in the broadcast statement on the emotion type of the broadcast statement, the negative word in the broadcast statement may also have a reverse influence on the emotion type of the whole broadcast statement, the number of words belonging to which emotion type in the broadcast statement is greater, and the possibility that the broadcast statement belongs to the reverse emotion type of the emotion type is greater, so that the emotion type with the least number of words is determined as the emotion type of the broadcast statement.
For each word in the broadcast statement, the computer equipment inquires the word in the positive emotion word bank and the negative emotion word bank. For example, the broadcast sentence is a "multipurpose pot", the computer device performs word segmentation on the broadcast sentence to obtain three words of "multiple", "use" and "pot", wherein the words of "multiple" and "use" are in the positive emotion word stock, the words of "pot" are neither in the positive emotion word stock nor in the negative emotion word stock, the number of words belonging to positive emotion is greater than the number of words belonging to negative emotion, and no negative words are in the broadcast sentence of "multipurpose pot", so the computer device determines that the broadcast sentence of "multipurpose pot" belongs to positive emotion.
In the embodiment of the application, not only the emotion type of each word in the broadcast statement is considered, but also whether the broadcast statement comprises a negative word is considered, so that the semantic meaning of the broadcast statement is more accordant when the broadcast statement is subjected to emotion analysis on the word granularity, and the accuracy of emotion analysis on the broadcast statement is improved.
Fig. 4 is a flowchart of a method for determining emotion type according to an embodiment of the present application, as shown in fig. 4, the method includes the following steps.
401. And dividing words of the broadcast sentence to obtain a plurality of words.
402. The number of words belonging to various emotion types is determined.
403. It is determined whether a negative word is included. If the broadcasting statement does not include the negative word, determining the emotion type with the largest number of the words as the emotion type of the broadcasting statement according to logic processing without the negative word. If the broadcasting statement comprises a negative word, determining the emotion type with the least quantity of the words as the emotion type of the broadcasting statement according to logic processing comprising the negative word.
404. And outputting emotion analysis results. The emotion analysis result is the emotion type of the broadcast statement.
The second way is: the computer equipment calls the emotion classification model, performs emotion analysis on the broadcast statement to obtain prediction probabilities of various emotion types, and determines the emotion type of the broadcast statement based on the prediction probabilities of the various emotion types. The prediction probability of the emotion type represents the probability that the broadcast statement belongs to the emotion type.
The emotion classification model is used for carrying out emotion analysis on any sentence to obtain prediction probabilities of various emotion types, and the prediction probabilities of the emotion types represent the probabilities of the sentence belonging to the emotion types. The computer equipment inputs the broadcast statement into the emotion classification model to obtain prediction probabilities of various emotion types. And then determining the emotion type of the broadcast statement based on the prediction probabilities of the plurality of emotion types. For example, the computer device determines the emotion type with the highest prediction probability as the emotion type to which the broadcast statement belongs.
In one possible implementation manner, the emotion classification model adopts a classification method which is a naive bayes classification method, and in the emotion classification scene, the essential idea of the naive bayes classification method is to solve the maximum posterior probability that a sentence is classified to a certain emotion type, so that the emotion classification model is a conditional probability model. Wherein the bayesian formula is defined as:
A represents a sentence, B represents an emotion type, P (B|A) represents a probability that the sentence A belongs to the emotion type B, P (A|B) represents a probability that the sentence belonging to the emotion type B comprises the sentence A, P (A|B) is also called a priori probability, and P (B|A) is also called a posterior probability. P (A) represents the probability of sentence A, and P (B) represents the probability of sentence B.
Optionally, when the na iotave bayesian classification method is applied to solve the emotion analysis problem, the problem can be modeled as a positive emotion/negative emotion classification problem, and the open emotion analysis data set is used as a training data set of an emotion classification model to train features of emotion classification model learning sentences. Model parameters of the emotion classification model can be solved through a maximum likelihood estimation method, and the process of training the emotion classification model is also a process of carrying out maximum likelihood estimation on average values of the model parameters by using different sentences. After training is completed, the emotion classification model outputs the probability of classifying the sentence as positive emotion, wherein the numerical range of the probability is 0-1, and the probability of classifying the sentence as negative emotion is 1 minus the probability of classifying the sentence as positive emotion.
It should be noted that, the first mode and the second mode respectively introduce a mode of determining an emotion type by using an emotion word stock and a mode of determining an emotion type by using an emotion classification model. In another embodiment, two ways can be combined, for a broadcast sentence, firstly, the number of words belonging to various emotion types in the broadcast sentence is determined based on an emotion word library, then, the probability that the broadcast sentence belongs to various emotion types is determined by adopting an emotion classification model, and finally, the emotion type to which the broadcast sentence belongs is determined based on the number of words belonging to various emotion types in the broadcast sentence and the probability that the broadcast sentence belongs to various emotion types.
For example, the emotion word stock includes a positive emotion word stock and a negative emotion word stock, and the emotion type to which the word belongs includes positive emotion and negative emotion. The emotion classification model is a classification model, and the probability that the broadcast statement belongs to positive emotion and the probability that the broadcast statement belongs to negative emotion can be predicted. The emotion types of the broadcast sentences are classified into strong positive emotion, neutral emotion and negative emotion. The classification is as follows:
(1) And if the probability that the broadcast statement belongs to the positive emotion is greater than 0.6, determining that the broadcast statement belongs to the positive emotion. After determining that the broadcast sentence belongs to the positive emotion, the following manner (2) may be further adopted to determine whether the broadcast sentence belongs to the strong positive emotion.
(2) If the number of words belonging to positive emotion in the broadcast sentence is larger than the number of words belonging to negative emotion and the probability of the broadcast text belonging to positive emotion is larger than 0.9, determining that the broadcast sentence belongs to strong positive emotion.
(3) If the number of words belonging to the negative emotion in the broadcast statement is greater than the number of words belonging to the positive emotion and the probability of the broadcast statement belonging to the positive emotion is less than 0.2, determining that the broadcast statement belongs to the negative emotion.
(4) If the above (1) - (3) are not satisfied, it is determined that the broadcast sentence belongs to neutral emotion.
In another possible implementation manner, in the case that the last word of the report statement is a query word, it is determined that the report statement belongs to a neutral emotion. For example, the questions are "do", "woolen", "what do", etc. Under the condition that the last word is the query word, the broadcast statement can be considered to belong to the query statement, and the fact that the query statement does not generally have strong positive emotion and negative emotion is considered, so that under the condition, the broadcast statement is directly determined to belong to neutral emotion, the possibility of misjudging the broadcast statement is reduced, and the accuracy of emotion analysis on the broadcast statement is improved.
303. And under the condition that the computer equipment detects the keyword in the broadcast statement, determining the emotion type of the keyword.
The computer equipment divides words of the broadcasting statement, if the broadcasting statement comprises a plurality of words, the keywords are detected in the broadcasting statement, and if the broadcasting statement comprises only one word, the keywords are considered to be absent in the broadcasting statement. And after the computer equipment detects the keyword in the broadcast statement, determining the emotion type of the keyword.
In one possible implementation, the computer device determines, based on the plurality of emotion word libraries, an emotion type to which the keyword in the broadcast sentence belongs. Wherein, each emotion word bank corresponds to one emotion type, and emotion expressed by words in the emotion word bank belongs to the emotion type corresponding to the emotion word bank. The computer equipment determines an emotion word bank to which the keyword belongs, and determines the emotion type corresponding to the emotion word bank as the emotion type to which the keyword belongs. Optionally, if the keyword is not queried in any emotion word library, that is, the keyword does not belong to any emotion word library, the computer device determines that the keyword belongs to neutral emotion.
In one possible implementation, a process for determining keywords by a computer device includes: in the case that the broadcast sentence comprises a plurality of words, determining a first appearance frequency and a second appearance frequency corresponding to the plurality of words, wherein the first appearance frequency represents the appearance frequency of the words in the broadcast text, and the second appearance frequency represents the appearance frequency of the text containing the words in a text database. The computer device determines weights of the plurality of words based on the first frequency of occurrence and the second frequency of occurrence corresponding to the plurality of words, the weights of the words being positively correlated with the first frequency of occurrence and negatively correlated with the second frequency of occurrence. And the computer equipment determines the word with the largest weight in the broadcasting statement as the keyword in the broadcasting statement.
In the embodiment of the application, if the frequency of the word appearing in the current broadcasting statement is high and the frequency of the word appearing in other texts is low, the word can well represent the semantics of the current broadcasting statement, namely the importance of the word is positively correlated with the frequency of the word appearing in the current broadcasting statement and is negatively correlated with the frequency of the word appearing in other texts. Therefore, in the embodiment of the application, the weight of the word is set to be positively correlated with the first occurrence frequency and negatively correlated with the second occurrence frequency, and then the word with the largest weight is the word with the most representativeness in the current broadcasting statement, so that the word with the largest weight is used as the keyword in the current broadcasting statement, and the importance and representativeness of the determined keyword can be ensured by adopting the method, so that the accuracy of emotion analysis on the broadcasting statement is improved.
Optionally, the computer device determines the weights of the terms using the following formula.
/>
Wherein X represents a first frequency of occurrence, a represents the number of occurrences of the word Q in the broadcast sentence, M represents the total number of words in the broadcast sentence, Y represents a second frequency of occurrence, b represents the number of texts containing the word Q in the text database, and N represents the total number of texts in the text database. Z represents the weight of the word Q.
Note that, in the step 302, the emotion type to which the broadcast statement belongs is determined at the sentence granularity, and in the step 303, the emotion type to which the keyword in the broadcast statement belongs is determined at the word granularity. The embodiment of the present application only takes the example of executing step 303 first and then executing step 303. In another embodiment, step 303 may be performed first and then step 302 may be performed, or step 303 and step 303 may be performed simultaneously.
304. The computer equipment determines the emotion type of the broadcasting text based on the emotion type of the broadcasting statement and the emotion type of the keyword.
After determining the emotion type of the broadcast statement and the emotion type of the keyword, the computer equipment determines the emotion type of the broadcast text by combining the emotion type of the broadcast statement and the emotion type of the keyword.
For example, in the case that the emotion type to which the broadcast sentence belongs is the same as the emotion type to which the keyword belongs, it is determined that the broadcast text belongs to the emotion type. For another example, in the case that the emotion type to which the broadcast sentence belongs is different from the emotion type to which the keyword belongs, it is determined that the broadcast text belongs to both the emotion type to which the broadcast sentence belongs and the emotion type to which the keyword belongs, or it is determined that the broadcast text belongs to a neutral emotion type, and the like.
It should be noted that, in the embodiment of the present application, only after executing step 302 and step 303, the emotion type to which the broadcast text belongs is determined based on the emotion type to which the broadcast sentence belongs and the emotion type to which the keyword belongs. In another embodiment, the computer device may also perform only step 302, that is, determine the emotion type to which the broadcast text belongs based only on the emotion type to which the broadcast statement belongs. Or the computer device may also perform only step 303, that is, determine the emotion type to which the broadcast text belongs based only on the emotion type to which the keyword in the broadcast sentence belongs.
It should be noted that, in the embodiment of the present application, based on the emotion type to which the broadcast statement belongs and the emotion type to which the keyword belongs, the emotion type to which the broadcast text belongs is determined, and then expression data matched with the broadcast text is obtained by adopting the following manner of step 305. In another embodiment, after determining the emotion type to which the broadcast sentence belongs and the emotion type to which the keyword belongs, the computer device does not execute step 304, but directly obtains expression data matching the broadcast sentence and expression data matching the keyword, that is, expression data matching the broadcast text.
305. The computer equipment determines expression data matched with the broadcasting text based on the emotion type of the broadcasting text, and the emotion expressed by the expression indicated by the expression data belongs to the emotion type.
In the embodiment of the application, the mapping from the broadcasting text to the expression data is from the broadcasting text to the emotion type, and then from the emotion type to the expression data, so that the expression data matched with the broadcasting text is obtained.
In one possible implementation manner, one emotion type can correspond to multiple expression data, the computer equipment acquires multiple expression data corresponding to the emotion type to which the broadcast text belongs, and one expression data is randomly selected from the multiple expression data to serve as expression data matched with the broadcast text. By setting various expression data for the same emotion type and randomly selecting one expression data for the broadcast text, the richness of the expression data is improved, and the flexibility of the expression made by the virtual object can be improved when the virtual object is driven based on the expression data, so that the problem of single expression of the virtual object is avoided.
For example, if the emotion type to which the broadcast text belongs is positive emotion, the emotion indicated by the corresponding emotion data is smiling, favorite and other emotion, if the emotion type to which the broadcast text belongs is neutral emotion, the emotion indicated by the corresponding emotion data is smiling, thinking and other emotion, and if the emotion type to which the broadcast text belongs is negative emotion, the emotion indicated by the corresponding emotion data is sighing, crying and other emotion.
In one possible implementation, the expression data includes expression sub-data that matches each word in the broadcast text. Under the condition that the keyword is not detected in the broadcasting statement, the computer equipment determines expression data corresponding to the emotion type of the broadcasting statement as expression sub-data matched with each word in the broadcasting statement. Under the condition that the computer equipment detects the keyword in the broadcasting statement, determining expression sub-data corresponding to the emotion type to which the keyword belongs as expression sub-data matched with the keyword, and determining expression sub-data corresponding to the emotion type to which the broadcasting statement belongs as expression sub-data matched with non-keyword, wherein the non-keyword refers to other words except the keyword in the broadcasting statement.
In the embodiment of the application, the expression data matched with the broadcasting text is split into the expression sub-data matched with each word in the broadcasting text, which is equivalent to dividing word granularity of the expression data, so that the expression can be changed along with the word currently read in the process of reading the broadcasting text by the virtual object, the condition that one expression is kept in the whole reading process is avoided, and the flexibility of the expression made by the virtual object is improved. For example, broadcasting the statement "home multipurpose pot", the virtual person switches to a smiling expression when speaking the "multipurpose" keyword, and then transitions to a smiling expression.
306. The computer equipment drives the virtual object based on the expression data to obtain a broadcasting picture comprising the virtual object, so that the expression of the virtual object in the broadcasting picture is the expression indicated by the expression data.
After the computer equipment acquires the expression data matched with the broadcasting text, the virtual object is driven based on the expression data, so that the virtual object makes the expression indicated by the expression data, and the computer equipment can obtain a plurality of image frames comprising the virtual object by driving the virtual object, wherein the image frames form a broadcasting picture.
In one possible implementation, the expression data is composed of a plurality of expression parameters for controlling the change of facial key points. For example, the virtual object corresponds to 52 expression parameters, and the 52 expression parameters are called expression base (blendshape), and the 52 expression parameters are used for controlling facial key points such as eyebrows, eyes, mouth and the like, and smile, laugh, surprise and the like can be generated by controlling the facial key points. Fig. 5 is a schematic diagram of an expression parameter provided by an embodiment of the present application, as shown in fig. 5, eyeblinkleft indicates an expression parameter for controlling a left eye, when Eyeblinkleft =0, the left eye of a virtual object is open, when Eyeblinkleft =1, the left eye of the virtual object is closed, and by adjusting Eyeblinkleft, an expression for controlling the virtual object to blink with the left eye can be achieved.
Fig. 6 is a schematic diagram of a facial expression provided in an embodiment of the present application, as shown in fig. 6, by controlling 52 expression parameters, various expression data may be generated, where the various expression data are respectively used to indicate facial expressions such as smile, laugh, WINK, shy, like, proud, uli, cry, surprise, hum, qi, puzzles, thinking, and the like.
In one possible implementation manner, the computer device executes the steps 301 to 306 in response to the broadcast video generation instruction, and the computer device obtains a broadcast screen including a virtual object, including: the broadcasting video generation instruction carries an object identifier, the computer equipment acquires a virtual object indicated by the object identifier, and drives the virtual object based on expression data to obtain a broadcasting picture comprising the virtual object.
The computer device stores a plurality of broadcasting objects, and each broadcasting object has different images, for example, the broadcasting object comprises a man in a real person form, a woman in a real person form, or a cartoon form. The computer device obtains the object identifier carried by the broadcast video generation instruction, and obtains the broadcast object indicated by the object identifier from the plurality of broadcast objects to generate a broadcast picture, that is, the virtual object for broadcasting can be selected.
According to the embodiment of the application, the appointed broadcasting objects can be included in the broadcasting picture generated by the computer equipment through the object identification carried in the broadcasting video generation instruction, so that the flexibility and the diversity of controlling the broadcasting objects in the broadcasting image are improved, and the flexibility and the diversity of the generated broadcasting video are further improved.
307. The computer equipment generates broadcasting video based on broadcasting audio and broadcasting pictures corresponding to the broadcasting text, wherein the broadcasting video comprises broadcasting audio and broadcasting pictures.
In one possible implementation, the computer device generates the broadcast audio based on the broadcast text, including at least one of:
(1) The broadcasting video generation instruction carries a model identification, and the computer equipment calls an audio conversion model indicated by the model identification to convert broadcasting text into broadcasting audio.
The computer equipment stores a plurality of audio conversion models, each audio conversion model is used for converting any text into audio, and different audio conversion models have different intonation, mood, tone, language and other parameters so as to generate audio with different characteristics, for example, the audio converted by the audio conversion models can belong to active cartoon characters, sunken male voices, gentle female voices and the like. Optionally, the audio conversion model in the embodiment of the present application is a TTS (Text To Speech) model or other model. The computer equipment acquires a model identifier carried by the broadcasting video generation instruction, determines an audio conversion model indicated by the model identifier in a plurality of audio conversion models, inputs broadcasting text into the audio conversion model, and converts the broadcasting text by the audio conversion model so as to output broadcasting audio corresponding to the broadcasting text.
(2) The broadcast video generation instruction carries a broadcast speed, and the computer equipment converts the broadcast text into broadcast audio with the broadcast speed.
The broadcast speed is used for indicating the speed of broadcasting the audio, for example, the broadcast speed is 200 words/min. Or the broadcasting speed is used for indicating that the speed of broadcasting the audio is fast, medium speed or slow speed, and the computer equipment converts the broadcasting text into the broadcasting audio with the broadcasting speed according to the indication of the broadcasting speed.
In one possible implementation manner, the computer device drives the virtual object based on the expression data to obtain a multi-frame broadcasting image including the virtual object, the expression of the virtual object in the multi-frame broadcasting image is the expression indicated by the expression data, and the multi-frame broadcasting image is respectively fused with the target image to obtain a broadcasting picture, wherein the target image is used as a background of the broadcasting picture.
The computer equipment takes the broadcasting image as a foreground image and takes the target image as a background image, so that the broadcasting image is overlapped on the target image, a broadcasting picture comprising the target image and the virtual object is obtained, the broadcasting video shows the effect that the virtual object broadcasts under the condition that the target image is the background, and the content of the broadcasting video is enriched.
According to the method provided by the embodiment of the application, in the scene of broadcasting the broadcasting text by adopting the virtual object, the expression data matched with the broadcasting text is determined according to the emotion type of the broadcasting text, so that the expression indicated by the expression data is matched with the emotion expressed by the broadcasting text, then the virtual object is driven based on the expression data, so that the emotion expressed by the expression made by the virtual object is consistent with the emotion expressed by the broadcasting text, the expression of the virtual object in the generated broadcasting video can be ensured to be consistent with the semantic of the broadcasting audio, and therefore, the effect of the virtual object for acoustically and chromographically expressing the broadcasting text is achieved on the basis of saving the time spent in manually shooting the video, and the efficiency and the authenticity of generating the broadcasting video are improved.
On the basis of the embodiment, after the computer device acquires the expression data matched with the broadcast text, eye expression parameters can be added into the expression data, and the specific process is described in the embodiment shown in fig. 7. Fig. 7 is a flowchart of yet another video generating method according to an embodiment of the present application, which is executed by a computer device, referring to fig. 7, and includes:
701. the computer device obtains a broadcast text to be broadcast by the virtual object.
702. And the computer equipment performs emotion analysis on the broadcast text to obtain the emotion type of the broadcast text.
703. The computer equipment determines expression data matched with the broadcasting text based on the emotion type of the broadcasting text, and the emotion expressed by the expression indicated by the expression data belongs to the emotion type.
The processes of steps 701 to 703 are the same as the processes of steps 201 to 203 and the processes of steps 301 to 305, and will not be described again.
704. The computer device adds eye expression parameters to the expression data, the eye expression parameters including blink parameters or eye movement parameters.
The expression data is composed of a plurality of expression parameters, the expression parameters are used for controlling the change of facial key points, the eye expression parameters are used for controlling the change of eye key points, the blink parameters indicate the expression of blinking, and the eyeball motion parameters indicate the expression of rotating eyeballs.
In one possible implementation, the computer device adds ocular expression parameters to the expression data, including any of the following.
The first way is: the computer equipment determines a first time period according to the target frequency in the playing time period of the expression data, and adds blink parameters in the expression data in the first time period. The computer equipment determines a playing time period of the expression data, wherein the playing time period refers to a playing time period of a broadcasting picture obtained by driving the virtual object based on the expression data. The computer device determines at least one first time period according to the target frequency in the playing time period, for example, the frequency of the blink expression is 2 times per second, then the computer device randomly determines a plurality of time periods with the duration of 1 second in the playing time period, determines two discontinuous first time periods in each time period, and adds blink parameters in the two first time periods so that the virtual object can randomly make the expression of blinking twice per second.
The second way is: the computer equipment determines a second time period in the playing time period, and adds the first eye movement parameters in the expression data of the second time period, wherein the second time period refers to the time period when the virtual object acts. Wherein the first eye movement parameter indicates an expression of a substantially rotating eyeball. In the embodiment of the application, considering that the eyeball needs to be rotated if the virtual object still keeps looking at the lens under the condition of making action, the first eyeball motion parameter is added in the period of making action by the virtual object, so that the virtual object synchronously rotates the eyeball to look at the lens while making action, for example, the eyeball of the virtual object is controlled to rotate in the opposite direction under the condition of turning around, lowering or tilting the head of the virtual object, so that the eye of the virtual object looks at the lens as naturally as possible, and the flexibility and the authenticity of the virtual object are improved.
Third mode: the computer device randomly determines a third time period in the playing time period, and adds a second eyeball motion parameter in expression data of the third time period, wherein the first eyeball motion parameter indicates that the amplitude of the rotating eyeball is larger than that of the second eyeball motion parameter. Wherein the second eye movement parameter indicates an expression of small amplitude rotation of the eyeball. In the embodiment of the application, the second eyeball motion parameter is added into the expression data to control the eyeballs of the virtual object to perform random small-amplitude motion, so that the focusing change of the eyes of a real person when the eyes of the real person watch the object is simulated, and the flexibility and the authenticity of the virtual object are improved. For example, the second eye movement parameter indicates that the eye is rotated in one of eight directions of up, down, left, right, left up, left down, right up, right down, wherein adding the second eye movement parameter indicating which direction to rotate to the expression data may be randomly determined by the computer device.
705. The computer equipment drives the virtual object based on the expression data added with the eye expression parameters to obtain a broadcasting picture comprising the virtual object, so that the expression of the virtual object in the broadcasting picture is the expression indicated by the expression data.
After the computer device adds the eye expression parameters to the expression data, the virtual object is driven based on the expression data after the eye expression parameters are added, and the process of step 705 is the same as the process of step 306, and will not be described again here.
706. The computer equipment generates broadcasting video based on broadcasting audio and broadcasting pictures corresponding to the broadcasting text, wherein the broadcasting video comprises broadcasting audio and broadcasting pictures.
The process of step 706 is the same as the process of step 307, and will not be described again.
In the embodiment of the application, after the expression data matched with the broadcasting text is obtained, the eye expression parameters are added in the expression data, so that the eye action of the virtual object is controlled, the eye of the virtual object is more similar to the eye change of a real person, the authenticity and the vitality of the virtual object are improved, and the virtual object can be broadcasted more naturally.
On the basis of the embodiment, the expression data matched with the broadcasting text comprises expression sub-data matched with each word in the broadcasting text, and the computer equipment can also control the variation amplitude of the expression by setting weight parameters for the expression sub-data, and the specific process is described in the embodiment shown in fig. 8. Fig. 8 is a flowchart of still another video generating method according to an embodiment of the present application, which is executed by a computer device, referring to fig. 8, and includes:
801. The computer device obtains a broadcast text to be broadcast by the virtual object.
802. And the computer equipment performs emotion analysis on the broadcast text to obtain the emotion type of the broadcast text.
803. The computer equipment determines expression data matched with the broadcasting text based on the emotion type of the broadcasting text, and the emotion expressed by the expression indicated by the expression data belongs to the emotion type.
The processes of steps 801 to 803 are the same as the processes of steps 201 to 203 and the processes of steps 301 to 305, and are not described in detail herein.
804. The computer equipment obtains multi-frame expression sub-data based on the duration of the playing time period of the expression sub-data.
For the expression sub-data matched with each word, the playing time period of the expression sub-data refers to the playing time period of the broadcasting picture obtained by driving the virtual object based on the expression sub-data. Each frame of expression sub-data is used for driving the virtual object to obtain one frame of broadcasting image, so that multiple frames of broadcasting images can be obtained based on the multiple frames of expression sub-data, each of the multiple frames of broadcasting images comprises the virtual object, and the multiple frames of broadcasting images can form broadcasting pictures. The computer equipment obtains multi-frame expression sub-data based on the duration of the playing time period of the expression sub-data matched with the words, so that the playing duration of the multi-frame broadcasting image obtained based on the multi-frame expression sub-data is equal to the duration of the playing time period. The multi-frame expression sub-data are all expression sub-data matched with the word.
For example, the computer device plays 24 frames of broadcast pictures every second, and the duration of the play time period is 2 seconds, so that the computer device acquires 48 frames of expression sub-data.
805. The computer device adjusts the multi-frame expression sub-data based on a weight parameter of the multi-frame expression sub-data, the weight parameter indicating a variation amplitude of the expression.
After the computer equipment acquires the multi-frame expression sub-data corresponding to the words, weight parameters are set for the multi-frame expression sub-data, and the expression sub-data is adjusted based on the weight parameters, so that the variation amplitude of the expression indicated by the expression sub-data is controlled. For example, for the expression sub-data of the same expression, the larger the weight parameter is, the larger the variation amplitude of the expression indicated by the obtained expression sub-data is after adjustment by the weight parameter, and the smaller the weight parameter is, the smaller the variation amplitude of the expression indicated by the obtained expression sub-data is after adjustment by the weight parameter.
In one possible implementation, the process of setting the weight parameters by the computer device includes the following steps (1) -step (4).
(1) The computer device determines a start time period, a hold time period, and a stop time period in the play time period. The starting time period, the holding time period and the ending time period are sequentially connected, and the total duration of the starting time period, the holding time period and the ending time period is equal to the total duration of the playing time period. The starting point of the starting time period is the same as the starting point of the broadcasting time period, the ending point of the real time period is the same as the starting point of the holding time period, the ending point of the holding time period is the same as the starting point of the ending time period, and the ending point of the ending time period is the same as the ending point of the broadcasting time period. Optionally, the duration of the start time period is one fourth of the duration of the play time period, the duration of the hold time period is two fourth of the duration of the play time period, and the duration of the end time period is one fourth of the duration of the play time period.
(2) The computer device sets the weight parameters of the multi-frame expression sub-data of the initial time period to sequentially increment from a first value to a second value according to the time sequence, wherein the second value is larger than the first value.
For example, the first value is 0 and the second value is 1. The weight parameter of the first frame of expression sub-data in the initial time period is 0, the weight parameter of the last frame of expression sub-data is 1, and the expression sub-data between the first frame of expression sub-data and the last frame of expression sub-data are sequentially increased from 0 to 1. That is, in the initial period, the variation range of the expression indicated by the expression sub-data is from small to large, so that the effect that the virtual object gradually makes the expression is achieved.
(3) The computer device sets a weight parameter of the multi-frame expression sub-data of the hold period to a second value.
For example, the second value is 1, and the weight parameters of the multi-frame expression sub-data of the holding period are all 1. That is, in the holding period, the variation amplitude of the expression indicated by the expression sub-data reaches the maximum, so that the effect that the virtual object continues to hold after making the expression is achieved.
(4) The computer device sets the weight parameters of the multi-frame expression sub-data of the termination period to decrease from the second value to the first value in sequence in time order.
For example, the first value is 0 and the second value is 1. The weight parameter of the first frame of expression sub-data in the termination time period is 1, the weight parameter of the last frame of expression sub-data is 0, and the expression sub-data between the first frame of expression sub-data and the last frame of expression sub-data is gradually decreased from 1 to 0. That is, in the termination period, the variation range of the expression indicated by the expression sub-data is from large to small, so that the effect that the virtual object gradually packs up the expression is achieved.
806. And the computer equipment drives the virtual object based on the regulated expression data to obtain a broadcasting picture comprising the virtual object, so that the expression of the virtual object in the broadcasting picture is the expression indicated by the expression data.
After the computer device adjusts the expression data based on the weight parameters, the virtual object is driven based on the adjusted expression data, and the process of step 806 is the same as the process of step 306, which is not described herein.
807. The computer equipment generates broadcasting video based on broadcasting audio and broadcasting pictures corresponding to the broadcasting text, wherein the broadcasting video comprises broadcasting audio and broadcasting pictures.
The procedure of this step 807 is the same as that of the above step 307, and will not be described here again.
In the embodiment of the application, for the same word, the multi-frame expression sub-data corresponding to the word is adjusted based on the weight parameters, so that the expression of the virtual object is fluctuant when the word is read, and the effect that the virtual object reads the word in a natural and flexible way is achieved.
Based on the above embodiment, the computer device may further fuse two expression sub-data in overlapping time periods of two adjacent words, and the specific process is described in the embodiment shown in fig. 8 below. Fig. 9 is a flowchart of still another video generating method according to an embodiment of the present application, which is executed by a computer device, referring to fig. 9, and the method includes:
901. The computer device obtains a broadcast text to be broadcast by the virtual object.
902. And the computer equipment performs emotion analysis on the broadcast text to obtain the emotion type of the broadcast text.
903. The computer equipment determines expression data matched with the broadcasting text based on the emotion type of the broadcasting text, and the emotion expressed by the expression indicated by the expression data belongs to the emotion type.
The processes of step 901 to step 903 are the same as the processes of step 201 to step 203 and the processes of step 301 to step 305, and are not described in detail herein.
904. The computer equipment determines the playing time period of the expression sub-data matched with the word based on the target time period and the playing time period of the audio segment corresponding to the word.
In the embodiment of the application, the expression data matched with the broadcasting text comprises expression sub-data matched with each word in the broadcasting text, and the broadcasting audio comprises an audio segment corresponding to each word in the broadcasting text. The duration of the playing time period of the expression sub-data is equal to the sum of the target duration and the playing time period, and the starting playing time point of the playing time period is the starting playing time point of the audio frequency period.
For each word, the computer equipment determines the playing time length of the audio segment corresponding to the word, and determines the time length obtained by adding the playing time length to the target time length as the time length of the playing time period of the expression sub-data matched with the word. That is, for each word, the duration of the playing time period of the expression sub-data corresponding to the word is longer than the playing duration of the audio segment corresponding to the word.
905. For any two adjacent words, the computer equipment determines an overlapped time period in a playing time period of expression sub-data matched with the two words, respectively fuses the two expression sub-data of the same frame in the overlapped time period, and determines multi-frame expression sub-data obtained by fusion into expression sub-data of the overlapped time period.
Because the duration of the playing time period of the expression sub-data corresponding to each word is longer than the playing time period of the audio segment corresponding to the word, and the initial playing time point of the playing time period of the expression sub-data corresponding to the word is the initial playing time point of the audio segment corresponding to the word. Therefore, for any two adjacent words, there is an overlapping period of time in the playing time periods of the expression sub-data corresponding to the two words, and then in the overlapping period of time, there is expression sub-data corresponding to the two words, that is, two expression sub-data are included in the overlapping period of time.
In the embodiment of the application, for the overlapping time period, the two expression sub-data of the same frame are respectively fused to obtain the fused multi-frame expression sub-data, and the fused multi-frame expression sub-data actually comprises the two expression sub-data, so that the two expressions can be naturally transited in the overlapping time period, and the effect that the virtual object naturally and flexibly broadcasts the broadcast text is achieved.
Fig. 10 is a schematic diagram of variation of expression parameters provided in the embodiment of the present application, as shown in fig. 10, the maximum weight parameter is 1, the minimum weight parameter is 0, each trapezoid in fig. 10 corresponds to one word, an overlapping period exists between two adjacent words, the weight parameter of expression sub-data corresponding to each word transits according to a start period, a hold period and an end period, for example, the duration of a play period corresponding to the first word is T, the duration of T/4 in the play period is the start period of the expression, the expression transits from zero to some, and the weight parameter increases linearly from 0 to 1. The expression is kept in a time period of keeping the expression in a time period of T/4 to 3T/4 of the playing time period, and the weight parameter is kept to be 1, so that the expression reaches the maximum amplitude; and in the final T/4 time in the playing time period, the expression is terminated, the expression is transited to not, and the weight parameter is linearly decreased from 1 to 0.
906. And the computer equipment drives the virtual object based on the fused expression data to obtain a broadcasting picture comprising the virtual object, so that the expression of the virtual object in the broadcasting picture is the expression indicated by the expression data.
After the computer device obtains the fused expression data, the virtual object is driven based on the fused expression data, and the process of step 906 is the same as the process of step 306, which is not described herein.
907. The computer equipment generates broadcasting video based on broadcasting audio and broadcasting pictures corresponding to the broadcasting text, wherein the broadcasting video comprises broadcasting audio and broadcasting pictures.
The process of step 907 is the same as that of step 307, and will not be described again.
According to the method provided by the embodiment of the application, the playing time period of the expression sub-data corresponding to each word is longer than the playing time period of the audio frequency corresponding to the word, so that the overlapping time period exists in the playing time periods of the expression sub-data corresponding to the adjacent two words, and the two expression sub-data of the same frame are respectively fused in the overlapping time periods, so that the two expressions can be naturally transited in the overlapping time period, and the effect of naturally and flexibly broadcasting the broadcasting text by the virtual object is achieved.
Fig. 11 is a flowchart of still another video generating method according to an embodiment of the present application, as shown in fig. 11, the method includes the following steps.
1101. And the computer equipment processes the broadcasting text by adopting a voice synthesis technology to obtain broadcasting audio.
1102. The computer device preprocesses the broadcast text. The preprocessing comprises sentence granularity emotion analysis and word granularity emotion analysis on the broadcast text, wherein the sentence granularity emotion analysis is to divide sentences of the broadcast text, so that emotion analysis is carried out on each obtained broadcast sentence. The word granularity emotion analysis refers to extracting keywords from a broadcast text and performing emotion analysis on the extracted keywords. And performing expression mapping by using results of sentence granularity emotion analysis and word granularity emotion analysis to obtain initial expression data, adding eye expression parameters into the initial expression data, and then performing smooth transition and frame insertion fusion to obtain final expression data. The smooth transition is to adjust expression data by adopting weight parameters, so that the smooth transition of expressions from none to existence and from existence to nonexistence is realized, and the frame insertion fusion is to fuse expression sub-data in the overlapping time period of two adjacent words, so that the natural transition between two expressions is realized.
1103. The computer device drives the virtual object based on the expression data to obtain a broadcasting picture.
1104. The computer device generates a broadcast video based on the broadcast audio and the broadcast screen.
The video generation method provided by the embodiment of the application can be applied to any scene for generating video.
For example, in the advertising field, the broadcast text is an advertising document, and the broadcast video is an advertising video. Fig. 12 is a flowchart of an advertisement video generating method provided in an embodiment of the present application, as shown in fig. 12, a computer device obtains an advertisement document 1201, a virtual object 1202 and an advertisement image 1203, performs face driving on the virtual object 1202 based on the content of the advertisement document 1201 to obtain a broadcast image including the virtual object 1202, then fuses the broadcast image with the advertisement image 1203 to obtain an advertisement picture including the virtual object 1202, converts the advertisement document 1201 into an advertisement audio, and generates an advertisement video including the virtual object 1202 based on the advertisement picture and the advertisement audio, thereby replacing a real actor to shoot an advertisement, and improving the interest of the advertisement video.
Fig. 13 is a schematic diagram of an advertisement video provided in an embodiment of the present application, as shown in fig. 13, the advertisement video includes a virtual object 1301 that makes an expression, and an advertisement image 1302, where the advertisement image 1302 further includes an advertisement document, so that an effect that the virtual object 1301 draws a color to read the advertisement document is achieved under the condition that the advertisement document is displayed.
In addition, the video generating method provided by the embodiment of the application can be applied to generating news broadcasting video, product introduction video, scenic spot introduction video and the like, and the application scene of the video generating method is not limited.
Fig. 14 is a schematic structural diagram of a video generating apparatus according to an embodiment of the present application. Referring to fig. 14, the apparatus includes:
A text acquisition module 1401, configured to acquire a broadcast text to be broadcast by a virtual object;
The emotion analysis module 1402 is configured to perform emotion analysis on the broadcast text to obtain an emotion type to which the broadcast text belongs;
an expression determining module 1403, configured to determine, based on an emotion type to which the broadcast text belongs, expression data that matches the broadcast text, where an emotion expressed by an expression indicated by the expression data belongs to the emotion type;
A driving module 1404, configured to drive the virtual object based on the expression data, to obtain a broadcast frame including the virtual object, so that an expression of the virtual object in the broadcast frame is an expression indicated by the expression data;
the video generating module 1405 is configured to generate a broadcast video based on the broadcast audio corresponding to the broadcast text and the broadcast picture, where the broadcast video includes the broadcast audio and the broadcast picture.
According to the video generation device provided by the embodiment of the application, under the scene of broadcasting the broadcasting text by adopting the virtual object, the expression data matched with the broadcasting text is determined according to the emotion type of the broadcasting text, so that the expression indicated by the expression data is matched with the emotion expressed by the broadcasting text, then the virtual object is driven based on the expression data, so that the emotion expressed by the expression made by the virtual object is consistent with the emotion expressed by the broadcasting text, the expression of the virtual object in the generated broadcasting video can be ensured to be consistent with the semantic of the broadcasting audio, and therefore, the effect that the virtual object draws the broadcasting text in a sound drawing manner can be achieved on the basis of saving the time spent in manually shooting the video, and the efficiency and the authenticity of generating the broadcasting video are improved.
Alternatively, referring to fig. 15, the broadcast text includes a plurality of broadcast sentences; the emotion analysis module 1402 is configured to:
Carrying out emotion analysis on the broadcast statement to obtain an emotion type of the broadcast statement;
under the condition that a keyword is detected in the broadcast statement, determining the emotion type of the keyword;
and determining the emotion type of the broadcasting text based on the emotion type of the broadcasting statement and the emotion type of the keyword.
Optionally, referring to fig. 15, the emotion analysis module 1402 is configured to implement at least one of:
Determining emotion types of a plurality of words in the broadcasting statement based on a plurality of emotion word libraries, and determining emotion types of the broadcasting statement based on the number of words belonging to each emotion type in the broadcasting statement; wherein, each emotion word bank corresponds to one emotion type, and emotion expressed by words in the emotion word bank belongs to the emotion type corresponding to the emotion word bank;
Invoking an emotion classification model, performing emotion analysis on the broadcast statement to obtain prediction probabilities of multiple emotion types, and determining the emotion type of the broadcast statement based on the prediction probabilities of the multiple emotion types; the prediction probability of the emotion type indicates the probability that the broadcast statement belongs to the emotion type.
Optionally, referring to fig. 15, the plurality of emotion word banks includes a positive emotion word bank and a negative emotion word bank; the emotion analysis module 1402 is configured to:
Under the condition that the word is inquired in the positive emotion word stock, determining that the word belongs to a positive emotion type; under the condition that the word is inquired in the negative emotion word library, determining that the word belongs to a negative emotion type;
determining the emotion type with the largest number of the words as the emotion type of the broadcast statement under the condition that the broadcast statement does not comprise negative words;
And when the broadcasting statement comprises negative words, determining the emotion type with the least quantity of the words as the emotion type to which the broadcasting statement belongs.
Optionally, referring to fig. 15, the apparatus further includes:
a frequency determining module 1406, configured to determine, in a case where the broadcast sentence includes a plurality of words, a first frequency of occurrence and a second frequency of occurrence corresponding to the plurality of words, where the first frequency of occurrence represents a frequency of occurrence of the word in the broadcast text, and the second frequency of occurrence represents a frequency of occurrence of text including the word in a text database;
A weight determining module 1407, configured to determine weights of the plurality of words based on the first occurrence frequency and the second occurrence frequency corresponding to the plurality of words, where the weights are positively related to the first occurrence frequency and negatively related to the second occurrence frequency;
the keyword determining module 1408 is configured to determine the word with the greatest weight in the broadcast sentence as the keyword in the broadcast sentence.
Optionally, referring to fig. 15, the expression data includes expression sub-data matching each word in the broadcast text; the emotion analysis module 1402 is configured to:
under the condition that the keyword is not detected in the broadcasting statement, determining expression data corresponding to the emotion type of the broadcasting statement as expression sub-data matched with each word in the broadcasting statement;
Under the condition that the keyword is detected in the broadcasting statement, determining expression sub-data corresponding to the emotion type to which the keyword belongs as expression sub-data matched with the keyword, and determining expression sub-data corresponding to the emotion type to which the broadcasting statement belongs as expression sub-data matched with a non-keyword, wherein the non-keyword refers to other words except the keyword in the broadcasting statement.
Alternatively, referring to fig. 15, the expression data is composed of a plurality of expression parameters for controlling the change of facial key points; the apparatus further comprises:
A parameter adding module 1409, configured to add an eye expression parameter in the expression data, where the eye expression parameter includes a blink parameter or an eye movement parameter, the blink parameter indicates an expression of blinking, and the eye movement parameter indicates an expression of rotating an eyeball;
the driving module 1404 is configured to drive the virtual object based on the expression data after the eye expression parameter is added.
Alternatively, referring to fig. 15, the parameter adding module 1409 is configured to implement any one of the following:
Determining a first time period according to a target frequency in the playing time period of the expression data, and adding the blink parameter into the expression data in the first time period;
Determining a second time period in the playing time period, and adding a first eyeball motion parameter into expression data of the second time period, wherein the second time period refers to a time period when the virtual object acts;
And randomly determining a third time period in the playing time period, and adding a second eyeball motion parameter into expression data of the third time period, wherein the first eyeball motion parameter indicates that the amplitude of the rotating eyeball is larger than that of the second eyeball motion parameter.
Optionally, referring to fig. 15, the expression data includes expression sub-data matching each word in the broadcast text; the apparatus further comprises:
The data adjustment module 1410 is configured to obtain a plurality of frames of the expression sub-data based on a duration of a playing time period of the expression sub-data;
The data adjustment module 1410 is further configured to adjust the expression sub-data for a plurality of frames based on a weight parameter of the expression sub-data for the plurality of frames, where the weight parameter indicates a variation range of the expression;
the driving module 1404 is configured to drive the virtual object based on the adjusted expression data.
Optionally, referring to fig. 15, the apparatus further includes:
A parameter setting module 1411 for determining a start time period, a hold time period, and a stop time period in the play time period;
the parameter setting module 1411 is further configured to sequentially increment, in time sequence, a weight parameter of the expression sub-data for a plurality of frames of the start time period from a first value to a second value, where the second value is greater than the first value;
the parameter setting module 1411 is further configured to set a weight parameter of the expression sub-data for a plurality of frames of the hold period to the second value;
The parameter setting module 1411 is further configured to set weight parameters of the plurality of frames of the expression sub-data of the termination period to sequentially decrease from the second value to the first value in a time sequence.
Optionally, referring to fig. 15, the expression data includes expression sub-data matched with each word in the broadcast text, and the broadcast audio includes an audio segment corresponding to each word in the broadcast text; the apparatus further comprises:
A data fusion module 1412, configured to determine a playing time period of the expression sub-data matched with the word based on a target time period and a playing time period of an audio segment corresponding to the word, where the time period of the playing time period is equal to a sum of the target time period and the playing time period, and a start playing time point of the playing time period is a start playing time point of the audio segment;
The data fusion module 1412 is further configured to determine, for any two adjacent words, an overlapping time period in a playing time period of the expression sub-data that matches the two words;
The data fusion module 1412 is further configured to fuse two expression sub-data of the same frame in the overlapping time period, and determine the expression sub-data of the multiple frames obtained by fusion as expression sub-data of the overlapping time period.
Optionally, referring to fig. 15, the driving module 1404 is configured to:
Driving the virtual object based on the expression data to obtain a multi-frame broadcasting image comprising the virtual object, wherein the expression of the virtual object in the multi-frame broadcasting image is the expression indicated by the expression data;
And respectively fusing the multi-frame broadcasting images with target images to obtain the broadcasting picture, wherein the target images are used as the background of the broadcasting picture.
It should be noted that: the video generating apparatus provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the computer device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the video generating apparatus provided in the above embodiment and the video generating method embodiment belong to the same concept, and specific implementation processes thereof are detailed in the method embodiment and are not described herein again.
The embodiment of the application also provides a computer device, which comprises a processor and a memory, wherein at least one computer program is stored in the memory, and the at least one computer program is loaded and executed by the processor to realize the operations executed in the video generating method of the embodiment.
Optionally, the computer device is provided as a terminal. Fig. 16 shows a schematic structural diagram of a terminal 1600 according to an exemplary embodiment of the present application.
Terminal 1600 includes: a processor 1601, and a memory 1602.
Processor 1601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1601 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable-Gate Array), PLA (Programmable Logic Array ). The processor 1601 may also include a host processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1601 may be integrated with a GPU (Graphics Processing Unit, an image processing interactor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 1601 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.
Memory 1602 may include one or more computer-readable storage media, which may be non-transitory. Memory 1602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1602 is used to store at least one computer program for being possessed by processor 1601 to implement the video generation method provided by a method embodiment of the present application.
In some embodiments, terminal 1600 may also optionally include: a peripheral interface 1603, and at least one peripheral. The processor 1601, memory 1602, and peripheral interface 1603 may be connected by bus or signal lines. The individual peripheral devices may be connected to the peripheral device interface 1603 by buses, signal lines, or circuit boards. Optionally, the peripheral device comprises: at least one of radio frequency circuitry 1604, a display screen 1605, a camera assembly 1606, audio circuitry 1607, and a power supply 1608.
Peripheral interface 1603 may be used to connect I/O (Input/Output) related at least one peripheral to processor 1601 and memory 1602. In some embodiments, the processor 1601, memory 1602, and peripheral interface 1603 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 1601, memory 1602, and peripheral interface 1603 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.
The Radio Frequency circuit 1604 is used for receiving and transmitting RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1604 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1604 converts electrical signals to electromagnetic signals for transmission, or converts received electromagnetic signals to electrical signals. Optionally, the radio frequency circuit 1604 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuit 1604 may communicate with other devices via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (WIRELESS FIDELITY ) networks. In some embodiments, the radio frequency circuit 1604 may further include NFC (NEAR FIELD Communication) related circuits, which are not limited by the present application.
The display screen 1605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 1605 is a touch display, the display 1605 also has the ability to collect touch signals at or above the surface of the display 1605. The touch signal may be input to the processor 1601 as a control signal for processing. At this point, the display 1605 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 1605 may be one and disposed on the front panel of the terminal 1600; in other embodiments, the display 1605 may be at least two, each disposed on a different surface of the terminal 1600 or in a folded configuration; in other embodiments, the display 1605 may be a flexible display disposed on a curved surface or a folded surface of the terminal 1600. Even more, the display screen 1605 may be arranged in an irregular pattern other than rectangular, i.e., a shaped screen. The display screen 1605 may be made of LCD (Liquid CRYSTAL DISPLAY), OLED (Organic Light-Emitting Diode) or other materials.
The camera assembly 1606 is used to capture images or video. Optionally, camera assembly 1606 includes a front camera and a rear camera. The front camera is disposed on the front panel of the terminal 1600, and the rear camera is disposed on the rear surface of the terminal 1600. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 1606 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.
Audio circuitry 1607 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 1601 for processing, or inputting the electric signals to the radio frequency circuit 1604 for voice communication. The microphone may be provided in a plurality of different locations of the terminal 1600 for stereo acquisition or noise reduction purposes. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 1601 or the radio frequency circuit 1604 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, audio circuitry 1607 may also include a headphone jack.
Those skilled in the art will appreciate that the structure shown in fig. 16 is not limiting and that more or fewer components than shown may be included or certain components may be combined or a different arrangement of components may be employed.
Optionally, the computer device is provided as a server. Fig. 17 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1700 may have a relatively large difference due to different configurations or performances, and may include one or more processors (Central Processing Units, CPU) 1701 and one or more memories 1702, where at least one computer program is stored in the memories 1702, and the at least one computer program is loaded and executed by the processors 1701 to implement the methods provided in the foregoing method embodiments. Of course, the server may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.
The embodiment of the application also provides a computer readable storage medium, in which at least one computer program is stored, the at least one computer program being loaded and executed by a processor to implement the operations performed by the video generation method of the above embodiment.
The embodiment of the application also provides a computer program product, which comprises a computer program, wherein the computer program is loaded and executed by a processor to realize the operations performed by the video generating method of the embodiment.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the embodiments of the application is merely illustrative of the principles of the embodiments of the present application, and various modifications, equivalents, improvements, etc. may be made without departing from the spirit and principles of the embodiments of the application.

Claims (15)

1. A method of video generation, the method comprising:
Acquiring a broadcasting text to be broadcasted by a virtual object;
carrying out emotion analysis on the broadcasting text to obtain an emotion type of the broadcasting text;
Determining expression data matched with the broadcasting text based on the emotion type of the broadcasting text, wherein emotion expressed by the expression indicated by the expression data belongs to the emotion type;
Driving the virtual object based on the expression data to obtain a broadcasting picture comprising the virtual object, so that the expression of the virtual object in the broadcasting picture is the expression indicated by the expression data;
Generating a broadcast video based on the broadcast audio corresponding to the broadcast text and the broadcast picture, wherein the broadcast video comprises the broadcast audio and the broadcast picture.
2. The method of claim 1, wherein the broadcast text comprises a plurality of broadcast sentences; and performing emotion analysis on the broadcast text to obtain an emotion type of the broadcast text, wherein the emotion type comprises the following steps:
Carrying out emotion analysis on the broadcast statement to obtain an emotion type of the broadcast statement;
under the condition that keywords are detected in the broadcast sentences, determining emotion types to which the keywords belong;
and determining the emotion type of the broadcasting text based on the emotion type of the broadcasting statement and the emotion type of the keyword.
3. The method according to claim 2, wherein the emotion analysis of the broadcast statement obtains an emotion type to which the broadcast statement belongs, and the emotion analysis comprises at least one of the following steps:
Determining emotion types of a plurality of words in the broadcasting statement based on a plurality of emotion word banks, and determining emotion types of the broadcasting statement based on the number of words belonging to each emotion type in the broadcasting statement; each emotion word bank corresponds to one emotion type, and emotion expressed by words in the emotion word bank belongs to the emotion type corresponding to the emotion word bank;
Invoking an emotion classification model, performing emotion analysis on the broadcast statement to obtain prediction probabilities of multiple emotion types, and determining the emotion type of the broadcast statement based on the prediction probabilities of the multiple emotion types; the prediction probability of the emotion type represents the probability that the broadcast statement belongs to the emotion type.
4. The method according to claim 2, wherein the method further comprises:
In the case that the broadcasting statement comprises a plurality of words, determining a first occurrence frequency and a second occurrence frequency corresponding to the words, wherein the first occurrence frequency represents the occurrence frequency of the words in the broadcasting text, and the second occurrence frequency represents the occurrence frequency of the text containing the words in a text database;
determining weights of the plurality of words based on the first occurrence frequency and the second occurrence frequency corresponding to the plurality of words, wherein the weights are positively correlated with the first occurrence frequency and negatively correlated with the second occurrence frequency;
and determining the word with the largest weight in the broadcasting statement as the keyword in the broadcasting statement.
5. The method of claim 2, wherein the expression data includes expression sub-data that matches each word in the broadcast text; the determining expression data matched with the broadcasting text based on the emotion type of the broadcasting text comprises the following steps:
under the condition that the keyword is not detected in the broadcasting statement, determining expression data corresponding to the emotion type of the broadcasting statement as expression sub-data matched with each word in the broadcasting statement;
and under the condition that the keyword is detected in the broadcasting statement, determining expression sub-data corresponding to the emotion type to which the keyword belongs as expression sub-data matched with the keyword, and determining expression sub-data corresponding to the emotion type to which the broadcasting statement belongs as expression sub-data matched with a non-keyword, wherein the non-keyword refers to other words except the keyword in the broadcasting statement.
6. The method according to any one of claims 1 to 5, wherein the expression data is composed of a plurality of expression parameters for controlling a change in facial key points; after determining the expression data matched with the broadcasting text based on the emotion type to which the broadcasting text belongs, the method further comprises:
Adding an eye expression parameter into the expression data, wherein the eye expression parameter comprises a blink parameter or an eyeball motion parameter, the blink parameter indicates the expression of blinking, and the eyeball motion parameter indicates the expression of rotating eyeballs;
the driving the virtual object based on the expression data includes:
and driving the virtual object based on the expression data added with the eye expression parameters.
7. The method of claim 6, wherein the adding of the ocular expression parameter to the expression data comprises any one of:
Determining a first time period according to a target frequency in the playing time period of the expression data, and adding the blink parameters into the expression data in the first time period;
determining a second time period in the playing time period, and adding a first eye movement parameter into expression data of the second time period, wherein the second time period refers to a time period when the virtual object acts;
And randomly determining a third time period in the playing time period, and adding a second eyeball motion parameter into expression data of the third time period, wherein the first eyeball motion parameter indicates that the amplitude of the rotating eyeball is larger than that of the second eyeball motion parameter.
8. The method of any of claims 1-5, wherein the expression data includes expression sub-data that matches each word in the broadcast text; after determining the expression data matched with the broadcasting text based on the emotion type to which the broadcasting text belongs, the method further comprises:
Acquiring a plurality of frames of expression sub-data based on the duration of the playing time period of the expression sub-data;
Adjusting the expression sub-data of a plurality of frames based on the weight parameters of the expression sub-data of the plurality of frames, wherein the weight parameters indicate the variation amplitude of the expression;
the driving the virtual object based on the expression data includes:
and driving the virtual object based on the adjusted expression data.
9. The method of claim 8, wherein the method further comprises:
determining a start time period, a hold time period and a stop time period in the playing time period;
According to the time sequence, setting the weight parameters of the multi-frame expression sub-data of the initial time period to sequentially increase from a first value to a second value, wherein the second value is larger than the first value;
setting a weight parameter of the expression sub-data of a plurality of frames of the holding time period to the second value;
And setting the weight parameters of the expression sub-data of the multiple frames of the termination time period to be sequentially reduced from the second numerical value to the first numerical value according to the time sequence.
10. The method of any of claims 1-5, wherein the expression data comprises expression sub-data that matches each word in the broadcast text, the broadcast audio comprising an audio segment corresponding to each word in the broadcast text;
After determining the expression data matched with the broadcasting text based on the emotion type to which the broadcasting text belongs, the method further comprises:
Determining a playing time period of the expression sub-data matched with the word based on a target time period and a playing time period of an audio segment corresponding to the word, wherein the time period of the playing time period is equal to the sum of the target time period and the playing time period, and the starting playing time point of the playing time period is the starting playing time point of the audio segment;
for any two adjacent words, determining an overlapping time period in the playing time period of the expression sub-data matched with the two words;
and in the overlapping time period, respectively fusing the two expression sub-data of the same frame, and determining the expression sub-data of the multiple frames obtained by fusion as the expression sub-data of the overlapping time period.
11. The method according to any one of claims 1-5, wherein driving the virtual object based on the expression data to obtain a broadcast screen including the virtual object, so that an expression of the virtual object in the broadcast screen is an expression indicated by the expression data, includes:
Driving the virtual object based on the expression data to obtain a multi-frame broadcasting image comprising the virtual object, wherein the expression of the virtual object in the multi-frame broadcasting image is the expression indicated by the expression data;
And respectively fusing the multi-frame broadcasting images with target images to obtain the broadcasting picture, wherein the target images are used as the background of the broadcasting picture.
12. A video generating apparatus, the apparatus comprising:
The text acquisition module is used for acquiring a broadcasting text to be broadcasted by the virtual object;
The emotion analysis module is used for carrying out emotion analysis on the broadcasting text to obtain an emotion type of the broadcasting text;
The expression determining module is used for determining expression data matched with the broadcasting text based on the emotion type of the broadcasting text, and the emotion expressed by the expression indicated by the expression data belongs to the emotion type;
the driving module is used for driving the virtual object based on the expression data to obtain a broadcasting picture comprising the virtual object, so that the expression of the virtual object in the broadcasting picture is the expression indicated by the expression data;
the video generation module is used for generating a broadcast video based on the broadcast audio corresponding to the broadcast text and the broadcast picture, wherein the broadcast video comprises the broadcast audio and the broadcast picture.
13. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one computer program that is loaded and executed by the processor to implement the operations performed by the video generation method of any of claims 1 to 11.
14. A computer readable storage medium having stored therein at least one computer program that is loaded and executed by a processor to implement operations performed by a video generation method as claimed in any one of claims 1 to 11.
15. A computer program product comprising a computer program, wherein the computer program is loaded and executed by a processor to implement the operations performed by the video generation method of any one of claims 1 to 11.
CN202211399961.5A 2022-11-09 2022-11-09 Video generation method, device, computer equipment and storage medium Pending CN118052912A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211399961.5A CN118052912A (en) 2022-11-09 2022-11-09 Video generation method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211399961.5A CN118052912A (en) 2022-11-09 2022-11-09 Video generation method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN118052912A true CN118052912A (en) 2024-05-17

Family

ID=91048848

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211399961.5A Pending CN118052912A (en) 2022-11-09 2022-11-09 Video generation method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN118052912A (en)

Similar Documents

Publication Publication Date Title
KR102503413B1 (en) Animation interaction method, device, equipment and storage medium
CN111209440B (en) Video playing method, device and storage medium
CN110519636B (en) Voice information playing method and device, computer equipment and storage medium
CN110288077A (en) A kind of synthesis based on artificial intelligence is spoken the method and relevant apparatus of expression
CN111414736B (en) Story generation model training method, device, equipment and storage medium
CN111327772B (en) Method, device, equipment and storage medium for automatic voice response processing
CN111414506B (en) Emotion processing method and device based on artificial intelligence, electronic equipment and storage medium
US20230215068A1 (en) Method for outputting blend shape value, storage medium, and electronic device
CN111541951B (en) Video-based interactive processing method and device, terminal and readable storage medium
CN113392687A (en) Video title generation method and device, computer equipment and storage medium
CN113750523A (en) Motion generation method, device, equipment and storage medium for three-dimensional virtual object
WO2023207541A1 (en) Speech processing method and related device
CN110162598A (en) A kind of data processing method and device, a kind of device for data processing
CN112819933A (en) Data processing method and device, electronic equipment and storage medium
CN114170648A (en) Video generation method and device, electronic equipment and storage medium
CN111414737B (en) Story generation model training method, device, equipment and storage medium
CN113573128A (en) Audio processing method, device, terminal and storage medium
CN116248811B (en) Video processing method, device and storage medium
CN113496156A (en) Emotion prediction method and equipment
CN112528760B (en) Image processing method, device, computer equipment and medium
CN118052912A (en) Video generation method, device, computer equipment and storage medium
Ding et al. Interactive multimedia mirror system design
CN114328815A (en) Text mapping model processing method and device, computer equipment and storage medium
CN113822084A (en) Statement translation method and device, computer equipment and storage medium
Cakir et al. Audio to video: Generating a talking fake agent

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication