CN117219047A - Speech synthesis method, device, equipment, medium and product for virtual person - Google Patents

Speech synthesis method, device, equipment, medium and product for virtual person Download PDF

Info

Publication number
CN117219047A
CN117219047A CN202310663996.3A CN202310663996A CN117219047A CN 117219047 A CN117219047 A CN 117219047A CN 202310663996 A CN202310663996 A CN 202310663996A CN 117219047 A CN117219047 A CN 117219047A
Authority
CN
China
Prior art keywords
voice
person
virtual
content
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310663996.3A
Other languages
Chinese (zh)
Inventor
吴杉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202310663996.3A priority Critical patent/CN117219047A/en
Publication of CN117219047A publication Critical patent/CN117219047A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The application discloses a voice synthesis method, device, equipment, medium and product for a virtual person, and belongs to the field of artificial intelligence. The method comprises the following steps: acquiring first text content, wherein the first text content comprises text information interacted with a virtual person; inputting the first text content into a large language model to obtain second text content, wherein the large language model is used for carrying out natural language reply processing on the first text content; converting the second text content into first voice content based on the text-to-voice model, wherein the first voice content is voice information corresponding to the second text content; and reasoning the first voice content based on the virtual person voice model to obtain second voice content, wherein the second voice content comprises voice characteristics of the target person. According to the method, first text content interacted with the virtual person is obtained, and second voice content with voice characteristics of the target person is finally obtained through reasoning based on a large language model, a text-to-voice model and a virtual person voice model.

Description

Speech synthesis method, device, equipment, medium and product for virtual person
Technical Field
The application relates to the field of artificial intelligence, in particular to a method, a device, equipment, a medium and a product for synthesizing voice aiming at a virtual person.
Background
With the development of artificial intelligence technology, the conversion of text content into speech content is realized in many application scenarios by using a universal text-to-speech model. But the general text-to-speech model can only generate speech content with a fixed tone according to the characteristics of the model.
In the related art, in order to provide personalized voice content in a virtual person scene, a text-to-voice model is trained by using a sample voice segment of a target person, so that the trained text-to-voice model can generate voice content with voice characteristics of the target person.
However, in practical application, a large number of sample speech segments of the target person cannot be obtained to train the text-to-speech model, so that when the text-to-speech model obtained by training with the sample speech segments of the target person is used, a phenomenon that the generated speech content is inaccurate may occur. For example, in the case where "banking" does not exist in the sample voice segment of the target person, the voice content generated by the text-to-voice model may be "banking".
Disclosure of Invention
The application provides a voice synthesis method, a device, equipment, a medium and a product for a virtual person, which have the following technical scheme:
according to an aspect of the present application, there is provided a method of speech synthesis for a virtual person, the method comprising:
acquiring first text content, wherein the first text content comprises text information interacted with the virtual person;
inputting the first text content into a large language model to obtain second text content, wherein the large language model is used for carrying out natural language reply processing on the first text content;
converting the second text content into first voice content based on a text-to-voice model, wherein the first voice content is voice information corresponding to the second text content;
reasoning the first voice content based on a virtual person voice model to obtain second voice content, wherein the second voice content comprises voice characteristics of a target person;
the virtual human voice model is used for reasoning the voice content with the general voice characteristics into the voice content with the voice characteristics of the target person, and the number of sample words used by the text-to-voice model in the training process is larger than that of the sample words used by the virtual human voice model in the training process.
According to another aspect of the present application, there is provided a method of speech synthesis for a virtual person, the method comprising:
displaying an interactive interface comprising the virtual person;
responding to an interaction operation for interacting with the virtual person, and acquiring first text content, wherein the first text content comprises text information for interacting with the virtual person;
playing second voice content taught by the virtual person, wherein the second voice content is obtained by reasoning first voice content based on a virtual person voice model, the first voice content is obtained by converting second text content based on a text-to-voice model, and the second text content is obtained by carrying out natural language reply processing on the first text content based on a large language model;
the virtual human voice model is used for reasoning the voice content with the general voice characteristics into the voice content with the voice characteristics of the target person, and the number of sample words used by the text-to-voice model in the training process is larger than that of the sample words used by the virtual human voice model in the training process.
According to another aspect of the present application, there is provided a speech synthesis apparatus for a virtual person, the apparatus comprising:
The acquisition module is used for acquiring first text content, wherein the first text content comprises text information interacted with the virtual person;
the processing module is used for inputting the first text content into a large language model to obtain second text content, and the large language model is used for carrying out natural language reply processing on the first text content;
the processing module is further used for converting the second text content into first voice content based on a text-to-voice model, wherein the first voice content is voice information corresponding to the second text content;
the processing module is further used for reasoning the first voice content based on the virtual person voice model to obtain second voice content, wherein the second voice content comprises voice characteristics of a target person;
the virtual human voice model is used for reasoning the voice content with the general voice characteristics into the voice content with the voice characteristics of the target person, and the number of sample words used by the text-to-voice model in the training process is larger than that of the sample words used by the virtual human voice model in the training process.
According to another aspect of the present application, there is provided a speech synthesis apparatus for a virtual person, the apparatus comprising:
The display module is used for displaying an interactive interface comprising the virtual person;
the interaction module is used for responding to the interaction operation for interacting with the virtual person and obtaining first text content, wherein the first text content comprises text information for interacting with the virtual person;
the playing module is used for playing second voice content taught by the virtual person, the second voice content is obtained by reasoning the first voice content based on a voice model of the virtual person, the first voice content is obtained by converting second text content based on a text-to-voice model, and the second text content is obtained by carrying out natural language reply processing on the first text content based on a large language model;
the virtual human voice model is used for reasoning the voice content with the general voice characteristics into the voice content with the voice characteristics of the target person, and the number of sample words used by the text-to-voice model in the training process is larger than that of the sample words used by the virtual human voice model in the training process.
According to another aspect of the present application, there is provided a computer apparatus including: a processor and a memory, wherein at least one section of program is stored in the memory, and the at least one section of program is loaded and executed by the processor to realize the voice synthesis method for the virtual person.
According to another aspect of the present application, there is provided a computer readable storage medium having stored therein at least one program loaded and executed by a processor to implement the speech synthesis method for a virtual person as described in the above aspect.
According to another aspect of the present application, there is provided a computer program product comprising at least one program stored in a computer readable storage medium; the processor of the computer device reads the at least one program from the computer-readable storage medium, and the processor executes the at least one program to cause the computer device to perform the speech synthesis method for a virtual person as described in the above aspect.
The technical scheme provided by the application has the beneficial effects that at least:
the method comprises the steps of obtaining first text content interacted with a virtual person, obtaining second text content from the first text content based on a large language model, converting the second text content into first voice content based on a text-to-voice model, and finally reasoning the first voice content based on the virtual person voice model to obtain second voice content with voice characteristics of a target person. The number of the sample words used in the training process of the text-to-speech model is larger than that of the sample words used in the training process of the virtual human speech model, namely training samples of the text-to-speech model are massive, and the training samples of the virtual human speech model are small. Therefore, by converting the second text content into the first speech content by using the text-to-speech model trained from a large number of samples as the intermediate layer, the phenomenon of inaccurate generated speech content can be solved in the case that the training samples of the virtual human speech model are fewer. The text-to-speech model obtained through mass sample training is firstly utilized to carry out buffer treatment, and then the virtual human speech model obtained through a small amount of sample training is utilized, so that the generated second speech content can be more natural and smooth to a certain extent, and the accuracy and the robustness of speech recognition are effectively improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a block diagram of a computer system provided in accordance with an exemplary embodiment of the present application;
FIG. 2 is a flow chart of a method of speech synthesis for a virtual person provided by an exemplary embodiment of the present application;
FIG. 3 is a flow chart of a method of speech synthesis for a virtual person provided by an exemplary embodiment of the present application;
FIG. 4 is a flow chart of a method of speech synthesis for a virtual person provided by an exemplary embodiment of the present application;
FIG. 5 is a flow chart of a method of speech synthesis for a virtual person provided by an exemplary embodiment of the present application;
FIG. 6 is a schematic diagram of a time length threshold provided by an exemplary embodiment of the present application;
FIG. 7 is a flowchart of a method of speech synthesis for a virtual person provided by an exemplary embodiment of the present application;
FIG. 8 is a flow chart of a method of speech synthesis for a virtual person provided by an exemplary embodiment of the present application;
FIG. 9 is a flowchart of a method of speech synthesis for a virtual person provided by an exemplary embodiment of the present application;
FIG. 10 is a flowchart of a method of speech synthesis for a virtual person provided by an exemplary embodiment of the present application;
FIG. 11 is an interface diagram of a method of speech synthesis for a virtual person provided by an exemplary embodiment of the present application;
FIG. 12 is an interface diagram of a method of speech synthesis for a virtual person provided in accordance with an exemplary embodiment of the present application;
FIG. 13 is a flowchart of a method of speech synthesis for a virtual person provided by an exemplary embodiment of the present application;
FIG. 14 is a flowchart of a method of speech synthesis for a virtual person provided by an exemplary embodiment of the present application;
FIG. 15 is a flowchart of a method of speech synthesis for a virtual person provided by an exemplary embodiment of the present application;
FIG. 16 is an interactive flow chart of a method of speech synthesis for a virtual person provided by an exemplary embodiment of the present application;
FIG. 17 is a flowchart of a method of speech synthesis for a virtual person provided by an exemplary embodiment of the present application;
FIG. 18 is a block diagram of a speech synthesis apparatus for a virtual person provided in accordance with an exemplary embodiment of the present application;
FIG. 19 is a block diagram of a speech synthesis apparatus for a virtual person provided in accordance with an exemplary embodiment of the present application;
FIG. 20 is a block diagram of a computer device provided by an exemplary embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
First, the terms involved in the embodiments of the present application will be briefly described:
virtual man: the virtual character is used for simulating the language expression, the action gesture and the thinking mode of the real person. Refers to an artificial intelligence entity created by computer programs and computer technology. The application field of the virtual person comprises at least one of intelligent customer service, live interaction, games and education.
Tone migration: refers to a Voice Conversion (VC) technique capable of converting the Voice of an a speaker into the Voice of a B speaker. The source speech is processed by analyzing the sound characteristics of the source speech (the speech of the A speaker) and the target speech (the speech of the B speaker) and then using an algorithm to make the source speech have similar sound characteristics as the target speech.
Text-to-speech: the method is used for converting text content into voice content, is an artificial intelligence technology, and can simulate human voice by a computer so as to realize the functions of voice synthesis, voice recognition and the like.
Large language model (Large Language Model, LLM): refers to a model for natural language recovery processing that is used to train text data of an amount greater than a threshold amount, typically containing billions of parameters in the structure of the model. The model can automatically learn the rules and semantic information of a text sequence, so as to generate a natural language text similar to human beings, is widely applied to the fields of natural language processing, machine translation, dialogue systems and the like, and has strong semantic understanding and language generating capability.
Artificial intelligence (Artificial Intelligence, AI): is a technical science that studies, develops theories, methods, techniques and application systems for simulating, extending and expanding human intelligence. Artificial intelligence can simulate a human thinking model to accomplish a task. In a virtual environment, artificial intelligence can simulate the manner in which a user controls a virtual character to control the virtual character, e.g., control the virtual character to walk in the virtual environment, attack other virtual characters. Artificial intelligence may refer to programs, algorithms, software that simulate the thinking patterns of a human, and the execution subject may be a computer system, a server, or a terminal.
FIG. 1 illustrates a block diagram of a computer system provided in accordance with an exemplary embodiment of the present application. The computer system 100 includes: a terminal 120 and a server 140.
The terminal (or client) 120 installs and runs an application program supporting a virtual person. The application may be any one of an entertainment application, a live broadcast application, and a game application. The terminal 120 is a terminal used by a user, who interacts with a dummy using the terminal 120.
The terminal 120 is connected to the server 140 through a wireless network or a wired network.
Server 140 includes at least one of a server, a plurality of servers, a cloud computing platform, and a virtualization center. Illustratively, the server 140 includes a processor 144 and a memory 142, the memory 142 in turn including a display module 1421, an input/output module 1422, and a detection module 1423. The server 140 is used to provide background services for applications supporting virtual humans. Optionally, the server 140 takes on primary computing work and the terminal 120 takes on secondary computing work; alternatively, the server 140 takes on secondary computing work and the terminal 120 takes on primary computing work; alternatively, a distributed computing architecture is employed between the server 140 and the terminal 120 for collaborative computing.
Those skilled in the art will recognize that the number of terminals may be greater or lesser. Such as the above-mentioned terminals may be only one, or the above-mentioned terminals may be several tens or hundreds, or more. The embodiment of the application does not limit the number of terminals and the equipment type.
It should be noted that, the information (including but not limited to user equipment information, user account information, user personal information, etc.), data (including but not limited to data for analysis, stored data, displayed data, etc.), and signals related to the present application are all authorized by the user or fully authorized by the parties, and the collection, use, and processing of the related data is required to comply with the relevant laws and regulations and standards of the relevant country and region. For example, the information involved in the present application is obtained under the condition of sufficient authorization, and the client and the server only buffer the information during the running of the program, and do not solidify the relevant data for storing and secondarily utilizing the information.
Fig. 2 is a flowchart of a method of speech synthesis for a virtual person according to an exemplary embodiment of the present application. The method is performed by the server 140 in the computer system 100 shown in fig. 1, and includes:
Step 220: acquiring first text content;
the first text content includes text information that interacts with the dummy. In some embodiments, the first text content is obtained based on input content entered by a user. Optionally, the first text content includes text content entered by a user. Illustratively, the text content input by the user is "hello", and "hello" is the first text content. Optionally, the first text content includes a text content obtained based on voice content input by a user. Illustratively, the user inputs the voice content, and based on the voice recognition model, the voice content input by the user is recognized to obtain the first text content.
Step 240: inputting the first text content into a large language model to obtain second text content;
and inputting the first text content into a large language model, wherein the large language model is used for carrying out natural language reply processing on the first text content to obtain second text content subjected to the natural language reply processing. Illustratively, in the case where the first text content is "hello," the "hello" is input into the large language model, resulting in the second text content being "i fine. Optionally, the large language model is a Chat GPT model. Alternatively, the large language model is a GPT-4 model.
Step 260: converting the second text content into first speech content based on the text-to-speech model;
and inputting the second text content into the text-to-speech model based on the text-to-speech model capable of converting the text content into the speech content, and outputting the converted first speech content. In some embodiments, the first voice content is voice information corresponding to the second text content. Illustratively, the second text content is "I fine", and the first speech content corresponds to the "I fine" speech content.
In some embodiments, the text-to-speech model is a generic model trained based on a large number of samples, the number of sample words used by the text-to-speech model in the training process being massive.
Step 280: and reasoning the first voice content based on the virtual voice model to obtain the second voice content.
And inputting the first voice content into the virtual human voice model for reasoning to obtain the second voice content. The virtual person voice model is used for enabling the reasoning of voice contents with general voice characteristics into voice contents with voice characteristics of a target person. That is, the second voice content includes voice features of the target person. In some embodiments, the speech characteristics of the target person include at least one of timbre characteristics, emotion characteristics, prosodic characteristics.
In some embodiments, the virtual human voice model is implemented by a deep neural network, and in the embodiments of the present application, the virtual human voice model is implemented by a convolutional neural network, but those skilled in the art can know that the model type and the topology of the deep neural network are not limited to one type of convolutional neural network. In some embodiments, a convolutional neural network for implementing a virtual human voice model includes at least one of an input layer, an inference layer, and an output layer. The input layer is used for encoding input data of the virtual human voice model into feature vectors; the reasoning layer is used for giving secondary tone to the output of the input layer; the output layer is used for outputting the output result of the reasoning layer.
In some embodiments, the virtual human Voice model is trained based on an open source free AI Voice conversion software, and a user creates an AI Voice library of a target person through training data and converts a piece of Voice or singing Voice into a desired timbre, which can be understood as from sound to sound (Voice to Voice). The AI speech conversion software uses an end-to-end architecture and can handle the task of speech conversion.
In some embodiments, the virtual human voice model is a timbre imparting model trained based on a limited number of samples. The number of sample words used by the virtual human speech model in the training process is small. In some embodiments, the number of sample words used in the training process of the text-to-speech model is greater than the number of sample words used in the training process of the virtual human speech model. It should be understood that under the condition that the number of the sample words used in the training process of the text-to-speech model is ensured to be larger than that of the sample words used in the training process of the virtual human speech model, the problem of inaccurate second speech content caused by the fact that the number of the training samples of the virtual human speech model is small can be solved. Under the condition that the number of the sample words used in the training process of the text-to-speech model is far greater than that of the sample words used in the training process of the virtual human speech model, the speech synthesis method for the virtual human provided by the embodiment of the application can achieve better application effect.
In summary, according to the method provided by the embodiment of the application, the first text content interacted with the virtual person is obtained, the second text content is obtained from the first text content based on the large language model, the second text content is converted into the first voice content based on the text-to-voice model, and the second voice content with the voice characteristics of the target person is finally inferred based on the virtual person voice model. The number of the sample words used in the training process of the text-to-speech model is larger than that of the sample words used in the training process of the virtual human speech model, namely training samples of the text-to-speech model are massive, and the training samples of the virtual human speech model are small. Therefore, by converting the second text content into the first speech content by using the text-to-speech model trained from a large number of samples as the intermediate layer, the phenomenon of inaccurate generated speech content can be solved in the case that the training samples of the virtual human speech model are fewer. The text-to-speech model obtained through mass sample training is firstly utilized to carry out buffer treatment, and then the virtual human speech model obtained through a small amount of sample training is utilized, so that the generated second speech content can be more natural and smooth to a certain extent, and the accuracy and the robustness of speech recognition are effectively improved.
In some embodiments, the virtual human voice model is trained. Fig. 3 is a flowchart of a method of speech synthesis for a virtual person according to an exemplary embodiment of the present application. The method further comprises the following steps:
step 320: acquiring a sample voice fragment of a target person;
a sample speech segment for training a virtual person speech model is obtained, the sample speech segment being a speech segment having speech features of a target person. In some embodiments, the sample speech segment is truncated from an existing speech segment. Illustratively, interview voice of a public character is obtained by intercepting interview fragments of a news medium, wherein the public character is a target person. In some embodiments, the sample speech segment is recorded audio recorded by the targeted person. In some embodiments, the number of words in the sample speech segment is less than the number of words in the text-to-speech model.
In some embodiments, where it is desired to obtain a sample speech segment of a target person, authorization to use the sample speech segment is first obtained. In some embodiments, the authorized use of the sample speech segment of the target person is prompted through an authorization pop-up window, and the sample speech segment of the target person is obtained if the authorization is confirmed.
In some embodiments, the authorization is only used to provide the sample voice segment of the target person, and the system automatically deletes the relevant information after the sample voice segment of the target person is obtained.
Step 340: based on the sample voice fragments, training the virtual human voice model to obtain a trained virtual human voice model.
In some embodiments, the virtual person speech model is trained based on sample speech segments having speech characteristics of the target person, such that the trained virtual person speech model is capable of reasoning speech content having generic speech characteristics to speech content having speech characteristics of the target person.
In some embodiments, as shown in fig. 4, the step 340 further includes the following sub-steps:
step 341: extracting the voice characteristics of a target person in the sample voice fragment;
in some embodiments, based on the sample speech segments, speech features in the sample speech segments are extracted as speech features of the target person. In some embodiments, the sample speech segment includes speech features of the target person. Optionally, the voice features of the target person in the sample voice segment include at least one of timbre features, emotion features and prosody features of the target person.
In some embodiments, the timbre characteristics of the target person in the sample speech segment are extracted. In some embodiments, the target person's timbre characteristics include at least one of a sunk timbre, a bright timbre, a young timbre. In some embodiments, the timbre features of the target person in the sample speech segment are extracted by a pre-trained first feature extraction network. The first feature extraction network is trained by a first sample speech training set comprising at least one section of sample speech, each section of sample speech having corresponding sample timbre features. At least one section of sample voice is input into a first feature extraction network, and predicted tone features are output. And comparing the predicted tone color characteristics with the sample tone color characteristics serving as the labels, and calculating to obtain a first error loss. The parameters in the first feature extraction network are trained based on the first error loss through an error back propagation algorithm, and the trained first feature extraction network capable of extracting tone features is obtained after multiple training, for example, under the condition of ten thousand sample training or model convergence.
In some embodiments, emotional characteristics of a target person in the sample speech segments are extracted. In some embodiments, the emotional characteristics of the target person include at least one of a hard emotion, a happy emotion, and a straying emotion. In some embodiments, the emotional features of the target person in the sample speech segment are extracted through a pre-trained second feature extraction network. The second feature extraction network is trained by a second sample voice training set, wherein the second sample voice training set comprises at least one section of sample voice, and each section of sample voice corresponds to a sample emotion feature. And inputting at least one section of sample voice into a second feature extraction network, and outputting predicted emotion features. And comparing the predicted emotion characteristics with sample emotion characteristics serving as labels, and calculating to obtain a second error loss. And training parameters in the second feature extraction network based on the second error loss through an error back propagation algorithm, and obtaining a trained second feature extraction network capable of extracting emotion features under the condition of training for a plurality of times, for example, training for ten thousands of samples or model convergence.
In some embodiments, prosodic features of the target person in the sample speech segment are extracted. In some embodiments, the prosodic features of the target person include at least one of long pauses, short pauses, pitch changes. In some embodiments, prosodic features of the target person in the sample speech segment are extracted by a pre-trained third feature extraction network. The third feature extraction network is trained by a third sample voice training set, wherein the third sample voice training set comprises at least one section of sample voice, and each section of sample voice corresponds to a sample prosodic feature. And inputting at least one section of sample voice into a third feature extraction network, and outputting predicted prosody features. And comparing the predicted prosody characteristic with the sample prosody characteristic serving as a label, and calculating to obtain a third error loss. And training parameters in the third feature extraction network based on the third error loss through an error back propagation algorithm, and obtaining the trained third feature extraction network capable of extracting emotion features under the condition of training for a plurality of times, for example, training for ten thousands of samples or model convergence.
Step 342: based on the voice characteristics of the target person, training the virtual person voice model to obtain a trained virtual person voice model.
In some embodiments, the speech characteristics of the target person include at least one of a timbre characteristic, an emotion characteristic, and a prosody characteristic of the target person.
In some embodiments, where only one feature is included in the target person's speech features, the virtual person speech model is trained based on the one feature, resulting in a trained virtual person speech model. In some embodiments, the virtual human voice model is trained based on timbre characteristics of the target person to obtain a trained virtual human voice model. In some embodiments, the virtual human voice model is trained based on emotional characteristics of the target person to obtain a trained virtual human voice model. In some embodiments, the virtual human voice model is trained based on prosodic features of the target person, resulting in a trained virtual human voice model.
In some embodiments, in the case where at least two features are included in the speech features of the target person, the virtual person speech model is trained based on the at least two features, resulting in a trained virtual person speech model. In some embodiments, the virtual human voice model is trained based on timbre features and emotion features of the target person to obtain a trained virtual human voice model. In some embodiments, the virtual human voice model is trained based on timbre characteristics and prosody characteristics of the target person to obtain a trained virtual human voice model. In some embodiments, the virtual human voice model is trained based on the emotion features and prosody features of the target person to obtain a trained virtual human voice model. In some embodiments, the virtual human voice model is trained based on timbre characteristics, emotion characteristics and prosody characteristics of the target person to obtain a trained virtual human voice model.
In some embodiments, in the case that at least two features are included in the voice features of the target person, the at least two features are spliced to obtain a spliced feature. Training the virtual human voice model based on the splicing characteristics to obtain a trained virtual human voice model.
In some embodiments, when the voice feature of the target person includes at least two features, weights corresponding to the at least two features are set, and the at least two features are spliced based on the weights corresponding to the at least two features, so as to obtain a spliced feature. In some embodiments, the weights for each of the at least two features are fixed values that are preset. In some embodiments, the weights for each of the at least two features are automatically adjusted. In some embodiments, the weights of each of the at least two features are set to be manually adjustable by a user.
In some embodiments, the speech styles of the target person are identified based on a speech style recognition model. The target person's speech styles include at least one of a first speech style, a second speech style, and a third speech style. In some embodiments, the speech style of the target person is related to a weight of a speech feature of the target person, the speech feature of the target person including at least one of a timbre feature, an emotion feature, a prosodic feature of the target person.
In some embodiments, where the target person's speech style is the first speech style, the target person's timbre features are weighted more than the emotion features and the timbre features are weighted more than the prosodic features. For example, in the case where the speech style of the target person is the first speech style, the weight of the timbre feature is 50%, the weight of the emotion feature is 25%, and the weight of the prosody feature is 25%.
In some embodiments, where the target person's speech style is the second speech style, the weight of the emotion feature of the target person is greater than the weight of the timbre feature and the weight of the emotion feature is greater than the weight of the prosody feature. Illustratively, in the case where the speech style of the target person is the second speech style, the weight of the timbre feature is 25%, the weight of the emotion feature is 50%, and the weight of the prosody feature is 25%.
In some embodiments, where the target person's speech style is a third speech style, the target person's prosodic features are weighted more than the timbre features and the prosodic features are weighted more than the emotion features. For example, in the case where the speech style of the target person is the third speech style, the weight of the timbre feature is 25%, the weight of the emotion feature is 25%, and the weight of the prosody feature is 50%.
In some embodiments, the weights of the speech features of the target person are determined based on the speech styles of the target person. Based on the weight of the voice characteristics of the target person, splicing the voice characteristics of the target person to obtain splicing characteristics; or fusing the voice characteristics of the target person based on the weight of the voice characteristics of the target person to obtain fused characteristics. Training the virtual human voice model based on the splicing features or the fusion features to obtain a trained virtual human voice model.
In some embodiments, the speech style recognition model is trained from a sample speech training set that includes sample speech segments corresponding to sample speech styles. And inputting the sample voice fragments into a voice style recognition model, and outputting a predicted voice style. And comparing the predicted voice style with the sample voice style serving as the label, and calculating to obtain the style error loss. Model parameters in the speech style recognition model are trained based on style error loss through an error back propagation algorithm, and a trained speech style recognition model capable of recognizing a speech style is obtained after multiple training, for example, under the condition of ten thousand sample training or model convergence.
In summary, according to the method provided by the embodiment of the application, the virtual human voice model is obtained through training, so that the voice content with the general voice characteristic can be inferred to be the voice content with the voice characteristic of the target person through the trained virtual human voice model, thereby avoiding the generation of popular voice content; meanwhile, the cost for training the virtual human voice model is low, and the problem of high voice customization can be solved to a certain extent.
According to the method provided by the embodiment of the application, the voice style of the target person is identified through the voice style identification model, so that the weight of the voice characteristics of the target person can be determined through identifying the voice style of the target person in a small number of samples under the condition that only a small number of training samples are provided, and the training of the voice model of the virtual person is realized.
In some embodiments, the sample speech segment is obtained after speech processing. Fig. 5 is a flowchart of a method of speech synthesis for a virtual person according to an exemplary embodiment of the present application. The method further comprises the following steps:
step 420: acquiring sample voice of a target person;
and acquiring sample voice of the target person. In some embodiments, the sample speech is intercepted from an existing speech segment. Illustratively, interview voice of a public character is obtained by intercepting interview fragments of a news medium, wherein the public character is a target person. In some embodiments, the sample speech is recorded audio recorded by the targeted person.
In some embodiments, the background sound information is not included in the sample speech of the target person. In some embodiments, the target person's sample speech includes background sound information in addition to the target person's speech, such as background music in the target person's sample speech. In order to increase the accuracy in the subsequent training of the virtual human voice model, the background sound information in the sample voice of the target person is removed based on the voice processing technology under the condition that the sample voice of the target person comprises the background sound information. In the embodiment of the present application, the sample voice of the target person does not include background sound information as an example.
Step 440: performing mute or cut processing on target sound information in the sample voice to obtain a cut sample;
in some embodiments, the target sound information is included in the sample speech. The target sound information is information that affects the recognition degree of sound in the sample voice. Optionally, the target sound information is information that has an influence on the continuation of sound in the sample voice. Alternatively, the target sound information is information that has an influence on the fluency of sound in the sample voice.
In some embodiments, the target sound information includes at least one of a ventilation sound, a pause, a mechanical sound in the sample speech.
In some embodiments, the cut sample is obtained by identifying target sound information in the sample speech, and muting or cutting the identified target sound information. In some embodiments, target voice information in the sample voice is identified by a pre-trained voice information identification model. The voice information recognition model is obtained through training a sample voice training set, the sample voice training set comprises at least one section of sample voice, and sample voice information corresponds to each section of sample voice. At least one section of sample voice is input into the voice information recognition model, and predicted voice information is output. And comparing the predicted sound information with the sample sound information serving as the label, and calculating to obtain the information error loss. Model parameters in the voice information recognition model are trained based on information error loss through an error back propagation algorithm, and the trained voice information recognition model capable of recognizing voice information is obtained after multiple training, for example, under the condition of ten thousands of sample training or model convergence.
Step 460: and according to the time length threshold, carrying out segmentation processing on the cut samples to obtain sample voice fragments.
In some embodiments, the sheared sample has a corresponding shear duration. If the sheared sample is sheared audio of two minutes, the shearing time length corresponding to the sheared sample is two minutes.
In some embodiments, the cut samples are partitioned according to a time length threshold to obtain sample speech segments. The time length threshold is used to indicate the time length of the sample speech segment. For example, as shown in fig. 6, the shearing duration corresponding to the sheared sample is n b1 seconds, and the value of n is a positive integer greater than 0. b1 seconds is a time length threshold for performing a segmentation process on the cut samples. Then, according to the time length threshold b1 seconds, the cutting duration of the cut samples is subjected to cutting processing, n sample voice fragments are obtained, and the sample voice duration corresponding to each sample voice fragment is b1 seconds.
In some embodiments, the length of time threshold is determined based on the processing power of the system. In some embodiments, the time length threshold is a fixed value that is set in advance. In some embodiments, the time length threshold is automatically adjusted based on the processing power of the system in real time. In some embodiments, the length of time threshold is set to be manually adjustable by a user.
In summary, according to the method provided by the embodiment of the application, the sample voice of the target person is obtained by obtaining the sample voice of the target person and processing the sample voice of the target person to obtain the sample voice fragment, so that in the process of training the virtual person voice model, the sample voice fragment with reasonable time length and higher sample quality can be used for training the virtual person voice model, and a more accurate voice reasoning result can be obtained when the virtual person voice model is used for reasoning the voice content with the voice characteristics of the target person.
In some embodiments, the sample speech segments of the target person are obtained from a sample video segment containing the target person. Fig. 7 is a flowchart of a method of speech synthesis for a virtual person according to an exemplary embodiment of the present application. The method further comprises the following steps:
step 520: extracting a target limb action corresponding to at least one word from the sample video segment;
in some embodiments, at least one target limb action is extracted from a sample video clip containing a target person. The target limb movement is a limb movement of the target person. In some embodiments, the target limb motion is determined based on a skeletal point of the target person, the skeletal point being used to indicate a limb motion change. A sequence of skeletal point changes of the target person is determined based on the changes in at least one skeletal point of the target person. Based on the sequence of skeletal point changes, limb movements of the target person are determined.
In some embodiments, the target limb action corresponds to at least one mood word. The target limb motion is a limb motion of the target person corresponding to the mood word in the sample video clip. In some embodiments, the target limb actions are in a one-to-one correspondence with the mood words, such as mood words "woolen" corresponding to a first target limb action and mood words "mock" corresponding to a second target limb action. In some embodiments, the target limb actions are not in one-to-one correspondence with the mood words, such as mood word "tweed" and mood word "do" each correspond to a third target limb action.
Step 540: constructing a limb action library of a target person based on the target limb action;
in some embodiments, in the case that m mood words and target limb actions corresponding to the m mood words are included in the sample video clip, a limb action library of the target person is constructed. The limb action library of the target person comprises at least one target limb action and a mood word corresponding to the at least one target limb action.
Step 560: based on the limb action library, limb actions of the virtual person corresponding to the mood words in the second voice content are generated.
A limb motion of the virtual person corresponding to the mood word in the second speech content is generated based on a limb motion library comprising at least one target limb motion and the mood word corresponding to the at least one target limb motion.
In some embodiments, the limb movements of the virtual person are determined based on skeletal points of the virtual person. Optionally, the skeletal points of the virtual person are consistent with the skeletal points of the target person, e.g., the virtual person includes t skeletal points, the target person also includes t skeletal points, and the ratio of the positions of the t skeletal points of the virtual person to the ratio of the positions of the t skeletal points of the target person is consistent. Optionally, the skeletal points of the virtual person are inconsistent with the skeletal points of the target person, including the number of skeletal points of the virtual person and the number of skeletal points of the target person being inconsistent, and/or the ratio of the locations of the skeletal points of the virtual person and the ratio of the locations of the skeletal points of the target person being inconsistent.
In some embodiments, generating limb movements of the virtual person is achieved by mapping skeletal point changes of the target person into skeletal point changes of the virtual person. In some embodiments, the target limb actions corresponding to the mood words in the second voice content are found by searching or retrieving the target limb actions in the limb action library, and the skeletal point change sequence of the target person corresponding to the target limb actions is mapped to the skeletal point change sequence of the virtual person to generate the limb actions of the virtual person corresponding to the mood words in the second voice content.
In summary, in the method provided by the embodiment of the present application, the target limb motion is extracted from the sample video segment, and the limb motion library is constructed based on the target limb motion, so that the limb motion of the virtual person corresponding to the mood word in the second voice content can be generated under the condition of generating the second voice content corresponding to the virtual person.
In some embodiments, the sample speech segments of the target person are obtained from a sample video segment containing the target person. Fig. 8 is a flowchart of a method of speech synthesis for a virtual person according to an exemplary embodiment of the present application. The method further comprises the following steps:
step 620: extracting a target expression corresponding to at least one word from the sample video segment;
in some embodiments, at least one target expression is extracted from a sample video clip containing a target person. The target expression is an expression of the target person. In some embodiments, the expression of the target person is determined based on facial key points of the target person, the facial key points being used to indicate facial expression changes. A sequence of changes in the facial keypoints of the target person is determined based on at least one facial keypoint of the target person. Based on the sequence of changes in the facial key points, a facial expression of the target person is determined.
In some embodiments, the target expression corresponds to at least one mood word. The target expression is an expression of the target person corresponding to the mood word in the sample video clip. In some embodiments, the target expressions are in one-to-one correspondence with the mood words, such as mood word "woolen" corresponds to a first target expression and mood word "mock" corresponds to a second target expression. In some embodiments, the target expression is not in one-to-one correspondence with the mood word, such as mood word "tweed" and mood word "morpholine" each correspond to a third target expression.
Step 640: constructing an expression library of a target person based on the target expression;
in some embodiments, in the case that m mood words and target expressions corresponding to the m mood words are included in the sample video clip, an expression library of the target person is constructed. The expression library of the target person comprises at least one target expression and a mood word corresponding to the at least one target expression.
Step 660: based on the expression library, generating expressions of the virtual persons corresponding to the mood words in the second voice content.
Based on the expression library comprising at least one target expression and the mood word corresponding to the at least one target expression, the expression of the virtual person corresponding to the mood word in the second speech content is generated.
In some embodiments, the expression of the virtual person is determined based on facial key points of the virtual person. Optionally, the face key points of the virtual person are consistent with the face key points of the target person, for example, the virtual person includes y face key points, the target person also includes y face key points, and the position proportion of the y face key points of the virtual person is consistent with the position proportion of the y face key points of the target person. Optionally, the face key points of the virtual person are inconsistent with the face key points of the target person, including that the number of face key points of the virtual person is inconsistent with the number of face key points of the target person, and/or that the position proportion of the face key points of the virtual person is inconsistent with the position proportion of the face key points of the target person.
In some embodiments, generating the expression of the virtual person is accomplished by mapping the facial keypoint variation of the target person into the facial keypoint variation of the virtual person. In some embodiments, the target expression corresponding to the mood word in the second voice content is found by searching or retrieving the target expression in the expression library, and the facial key point change sequence of the target person corresponding to the target expression is mapped to the facial key point change sequence of the virtual person, so as to generate the expression of the virtual person corresponding to the mood word in the second voice content.
In summary, according to the method provided by the embodiment of the application, the target expression is extracted from the sample video segment, and the expression library is constructed based on the target expression, so that the expression of the virtual person corresponding to the word of the language in the second voice content can be generated under the condition of generating the second voice content corresponding to the virtual person.
Fig. 9 is a flowchart of a method of speech synthesis for a virtual person according to an exemplary embodiment of the present application. In some embodiments, the above method further comprises the steps of:
step 720: acquiring a virtual person creation request sent by a user account in a social application program;
in some embodiments, a virtual person creation request sent by a user account in a social application is obtained, the virtual person creation request being used to create a virtual person corresponding to the user account.
In some embodiments, the social application refers to an application that supports the creation of virtual persons. Optionally, the social application includes at least one of a live class application, an entertainment class application, and a gaming class application.
In some embodiments, the user account is an account that the user logs in to the social application, the user account being bound to the created virtual person. That is, each user account corresponds to its own virtual person. In some embodiments, the virtual person created by the user includes at least one of a virtual person character, a virtual person figure, a virtual person name.
Step 740: acquiring real person information and a sample voice fragment corresponding to a user account from a self-timer video published by the user account;
in some embodiments, self-timer videos published by a user account are collected or acquired, the number of self-timer videos being unlimited. In some embodiments, the social application for which the user account publishes the self-timer video is the same social application that supports creating virtual persons. In some embodiments, the social application in which the user account posts the self-timer video is a different social application than the social application that supports creating a virtual person. Optionally, the social application program of the user account publishing the self-timer video is an a social application program, the social application program supporting creating the virtual person is a B social application program, and the user account logged in on the a social application program and the user account logged in on the B social application program are the same user account.
In some embodiments, real person information corresponding to the user account is obtained from a self-timer video published by the user account. The real person information includes at least one of a appearance characteristic, a posture characteristic, and a stature proportion of the real person.
In some embodiments, a sample speech segment is obtained from a self-timer video published by a user account, the sample speech segment being used to train a virtual human speech model. In some embodiments, the sample speech segment is captured from a self-timer video published by the user account. In some embodiments, the sample speech segments are spliced based on self-timer video published by the user account.
Step 760: generating a virtual person corresponding to the user account based on the real person information corresponding to the user account;
in some embodiments, the real person information corresponding to the user account includes at least one of a shape feature, a gesture feature, and a stature proportion of the real person. In some embodiments, the virtual person corresponding to the user account includes virtual person information. The virtual person information is consistent with the real person information corresponding to the user account, if the real person information corresponding to the user account comprises z information parameters, the virtual person information also comprises z information parameters, and the z information parameters in the real person information corresponding to the user account and the z information parameters in the virtual person information are in one-to-one correspondence. Generating a virtual person corresponding to the user account through z information parameters in one-to-one correspondence, wherein the virtual person has virtual person information consistent with real person information corresponding to the user account.
Step 780: and binding the virtual human voice model obtained based on the sample voice fragment training with the virtual human.
In some embodiments, the virtual human voice model is trained based on sample voice segments obtained from self-timer videos published by the user account. The virtual person is generated based on real person information obtained from a self-timer video published by the user account. That is, the virtual person voice model and the virtual person are both corresponding to the user account, the virtual person voice model and the virtual person are bound, and the second voice content obtained based on the reasoning of the virtual person voice model is bound with the virtual person.
In summary, according to the method provided by the embodiment of the application, the virtual person corresponding to the user account is generated by acquiring the virtual person creation request sent by the user account in the social application program, and the corresponding virtual person voice model is obtained by training, so that the user can generate the virtual person with personal characteristics and the second voice content taught by the virtual person according to the requirements and ideas of the user.
Fig. 10 is a flowchart of a method of speech synthesis for a virtual person according to an exemplary embodiment of the present application. The method is performed by a terminal 120 in the computer system 100 shown in fig. 1, and includes:
step 820: displaying an interactive interface comprising a virtual person;
in some embodiments, an interactive interface including a virtual person is displayed on an interface of an application. The interactive interface is used for interaction between the virtual person and the user. Illustratively, as shown in FIG. 11, an interactive interface 12 including a virtual person 11 is displayed on the interface of the entertainment-type application. Illustratively, as shown in FIG. 12, an interactive interface 12 including a virtual person 11 is displayed on the interface of the live-type application.
In some embodiments, the interactive interface includes at least one text box for displaying text, and text content is displayed in the at least one text box, the text content including a first text content and a second text content. Illustratively, as shown in FIG. 11, a first text box 14, a second text box 15, and a third text box 16 are included in the interactive interface 12. Wherein the first text box 14 is for displaying a first text content, the second text box 15 is for displaying a second text content, and the third text box 16 is for providing a user with input of an interactive content for interaction with the virtual person 11.
In some embodiments, the first text content includes text information that interacts with the virtual person. In some embodiments, the first text content is obtained based on input content entered by a user. Optionally, the first text content includes text content entered by a user. Illustratively, the text content input by the user is "hello", and "hello" is the first text content. Optionally, the first text content includes a text content obtained based on voice content input by a user. Illustratively, the user inputs the voice content, and based on the voice recognition model, the voice content input by the user is recognized to obtain the first text content.
In some embodiments, the second text content is obtained by performing natural language reply processing on the first text content based on a large language model. And inputting the first text content into a large language model, wherein the large language model is used for carrying out natural language reply processing on the first text content to obtain second text content subjected to the natural language reply processing. Illustratively, in the case where the first text content is "hello," the "hello" is input into the large language model, resulting in the second text content being "i fine. Optionally, the large language model is a Chat GPT model. Alternatively, the large language model is a GPT-4 model.
Step 840: responding to an interaction operation for interacting with the virtual person, and acquiring first text content;
in some embodiments, the first text content is obtained in response to an interactive operation by the user for interacting with the virtual person. The interactive operation includes at least one of single click, double click, left and right sliding, up and down sliding, long press, hover, face recognition, and voice recognition. It should be noted that the interactive operation includes, but is not limited to, the above mentioned several operations, and those skilled in the art should be aware that any operation capable of implementing the above mentioned functions falls within the scope of the embodiments of the present application.
Step 860: and playing the second voice content taught by the virtual person.
In some embodiments, the second voice content taught by the dummy is played. In some embodiments, the second speech content spoken by the virtual person is played in response to a trigger operation for the play control. The triggering operation includes at least one of single click, double click, left and right sliding, up and down sliding, long press, hover, face recognition, and voice recognition. It should be noted that the triggering operation includes, but is not limited to, the above mentioned several operations, and those skilled in the art should be aware that any operation capable of implementing the above mentioned functions falls within the scope of the embodiments of the present application. Illustratively, as shown in fig. 11, a play control 13 is included on the interactive interface 12, where the play control 13 is configured to control playing of the second voice content, and in response to a triggering operation for the play control 13, playing of the second voice content taught by the virtual person 11.
In some embodiments, the second speech content taught by the virtual person having speech characteristics of the target person including at least one of a timbre characteristic, an emotion characteristic, and a prosody characteristic is played. The second voice content is obtained by reasoning the first voice content based on a virtual voice model, and the virtual voice model is used for realizing reasoning of voice content with general voice characteristics into voice content with voice characteristics of a target person. That is, the second voice content includes voice features of the target person. In some embodiments, the speech characteristics of the target person include at least one of timbre characteristics, emotion characteristics, prosodic characteristics.
In some embodiments, the first speech content is converted from the second text content based on a text-to-speech model. In some embodiments, the first voice content is voice information corresponding to the second text content. Illustratively, the second text content is "I fine", and the first speech content corresponds to the "I fine" speech content.
In summary, by displaying the interactive interface including the virtual person and playing the second voice content taught by the virtual person, the method provided by the embodiment of the application realizes playing the voice content with the voice characteristics of the target person and avoids playing the voice content with popular tone.
Fig. 13 is a flowchart of a method of speech synthesis for a virtual person according to an exemplary embodiment of the present application. The method further comprises the following sub-steps:
step 920: and playing the interactive video comprising the limb actions of the virtual person.
In some embodiments, an interactive video including limb movements of a virtual person is played, the limb movements of the virtual person corresponding to the mood words in the second speech content, the limb movements of the virtual person being obtained from a limb movement library constructed based on sample video clips containing the target person.
In some embodiments, the interactive video comprising limb movements of the virtual person is a video stream, the video stream being pre-generated. In some embodiments, the interactive video comprising limb movements of the virtual person is a video frame, based on which the interactive video comprising limb movements of the virtual person is generated.
In some embodiments, at least one target limb action is extracted from a sample video clip containing a target person. The target limb movement is a limb movement of the target person. In some embodiments, the target limb motion is determined based on a skeletal point of the target person, the skeletal point being used to indicate a limb motion change. A sequence of skeletal point changes of the target person is determined based on the changes in at least one skeletal point of the target person. Based on the sequence of skeletal point changes, limb movements of the target person are determined.
In some embodiments, the target limb action corresponds to at least one mood word. The target limb motion is a limb motion of the target person corresponding to the mood word in the sample video clip. In some embodiments, the target limb actions are in a one-to-one correspondence with the mood words, such as mood words "woolen" corresponding to a first target limb action and mood words "mock" corresponding to a second target limb action. In some embodiments, the target limb actions are not in one-to-one correspondence with the mood words, such as mood word "tweed" and mood word "do" each correspond to a third target limb action.
In some embodiments, in the case that m mood words and target limb actions corresponding to the m mood words are included in the sample video clip, a limb action library of the target person is constructed. The limb action library of the target person comprises at least one target limb action and a mood word corresponding to the at least one target limb action. In some embodiments, the limb movements of the virtual person are determined based on skeletal points of the virtual person. Optionally, the skeletal points of the virtual person are consistent with the skeletal points of the target person, e.g., the virtual person includes t skeletal points, the target person also includes t skeletal points, and the ratio of the positions of the t skeletal points of the virtual person to the ratio of the positions of the t skeletal points of the target person is consistent. Optionally, the skeletal points of the virtual person are inconsistent with the skeletal points of the target person, including the number of skeletal points of the virtual person and the number of skeletal points of the target person being inconsistent, and/or the ratio of the locations of the skeletal points of the virtual person and the ratio of the locations of the skeletal points of the target person being inconsistent.
In some embodiments, generating limb movements of the virtual person is achieved by mapping skeletal point changes of the target person into skeletal point changes of the virtual person. In some embodiments, the target limb actions corresponding to the mood words in the second voice content are found by searching or retrieving the target limb actions in the limb action library, and the skeletal point change sequence of the target person corresponding to the target limb actions is mapped to the skeletal point change sequence of the virtual person to generate the limb actions of the virtual person corresponding to the mood words in the second voice content.
In summary, the method provided by the embodiment of the application plays the interactive video including the limb actions of the virtual person, so that the virtual person is more interesting in the process of interaction.
Fig. 14 is a flowchart of a method of speech synthesis for a virtual person according to an exemplary embodiment of the present application. The method further comprises the following sub-steps:
step 940: and playing the interactive video comprising the expression of the virtual person.
In some embodiments, an interactive video including expressions of a virtual person is played, the expressions of the virtual person correspond to the mood words in the second voice content, the expressions of the virtual person are obtained from an expression library, and the expression library is constructed based on a sample video segment containing a target person.
In some embodiments, the interactive video comprising the expression of the virtual person is a video stream, the video stream being pre-generated. In some embodiments, the interactive video comprising the expression of the virtual person is a video frame, based on which the interactive video comprising the expression of the virtual person is generated.
In some embodiments, at least one target expression is extracted from a sample video clip containing a target person. The target expression is an expression of the target person. In some embodiments, the expression of the target person is determined based on facial key points of the target person, the facial key points being used to indicate facial expression changes. A sequence of changes in the facial keypoints of the target person is determined based on at least one facial keypoint of the target person. Based on the sequence of changes in the facial key points, a facial expression of the target person is determined.
In some embodiments, the target expression corresponds to at least one mood word. The target expression is an expression of the target person corresponding to the mood word in the sample video clip. In some embodiments, the target expressions are in one-to-one correspondence with the mood words, such as mood word "woolen" corresponds to a first target expression and mood word "mock" corresponds to a second target expression. In some embodiments, the target expression is not in one-to-one correspondence with the mood word, such as mood word "tweed" and mood word "morpholine" each correspond to a third target expression.
In some embodiments, in the case that m mood words and target expressions corresponding to the m mood words are included in the sample video clip, an expression library of the target person is constructed. The expression library of the target person comprises at least one target expression and a mood word corresponding to the at least one target expression.
In some embodiments, the expression of the virtual person is determined based on facial key points of the virtual person. Optionally, the face key points of the virtual person are consistent with the face key points of the target person, for example, the virtual person includes y face key points, the target person also includes y face key points, and the position proportion of the y face key points of the virtual person is consistent with the position proportion of the y face key points of the target person. Optionally, the face key points of the virtual person are inconsistent with the face key points of the target person, including that the number of face key points of the virtual person is inconsistent with the number of face key points of the target person, and/or that the position proportion of the face key points of the virtual person is inconsistent with the position proportion of the face key points of the target person.
In some embodiments, generating the expression of the virtual person is accomplished by mapping the facial keypoint variation of the target person into the facial keypoint variation of the virtual person. In some embodiments, the target expression corresponding to the mood word in the second voice content is found by searching or retrieving the target expression in the expression library, and the facial key point change sequence of the target person corresponding to the target expression is mapped to the facial key point change sequence of the virtual person, so as to generate the expression of the virtual person corresponding to the mood word in the second voice content.
In summary, the method provided by the embodiment of the application plays the interactive video including the expression of the virtual person, so that the virtual person is more interesting in interaction.
Fig. 15 is a flowchart of a method of speech synthesis for a virtual person according to an exemplary embodiment of the present application. The method further comprises the following sub-steps:
step 1000: and responding to the triggering operation for creating the virtual person, and displaying the virtual person corresponding to the user account in the social application program.
In some embodiments, in response to a trigger operation for creating a virtual person, a virtual person corresponding to a user account in a social application is displayed. The triggering operation includes at least one of single click, double click, left and right sliding, up and down sliding, long press, hover, face recognition, and voice recognition. It should be noted that the triggering operation includes, but is not limited to, the above mentioned several operations, and those skilled in the art should be aware that any operation capable of implementing the above mentioned functions falls within the scope of the embodiments of the present application.
In some embodiments, the virtual person is created based on real person information corresponding to the user account, the real person information including at least one of a look feature, a gesture feature, and a stature scale of the real person. In some embodiments, the virtual person corresponding to the user account includes virtual person information. The virtual person information is consistent with the real person information corresponding to the user account, if the real person information corresponding to the user account comprises z information parameters, the virtual person information also comprises z information parameters, and the z information parameters in the real person information corresponding to the user account and the z information parameters in the virtual person information are in one-to-one correspondence. Generating a virtual person corresponding to the user account through z information parameters in one-to-one correspondence, wherein the virtual person has virtual person information consistent with real person information corresponding to the user account.
In some embodiments, the virtual person is bound to a virtual person voice model that is trained based on sample voice segments. In some embodiments, the real person information and the sample voice clip corresponding to the user account are obtained from a self-timer video published by the user account. In some embodiments, the sample speech segment is captured from a self-timer video published by the user account. In some embodiments, the sample speech segments are spliced based on self-timer video published by the user account.
In summary, according to the method provided by the embodiment of the application, the virtual person corresponding to the user account in the social application program is displayed in response to the triggering operation for creating the virtual person, so that the user can generate the virtual person with personal characteristics and the second voice content taught by the virtual person according to the needs and ideas of the user, and the enthusiasm of the user for interaction with the virtual person can be improved to a certain extent.
Fig. 16 is a flowchart of a method of speech synthesis for a virtual person according to an exemplary embodiment of the present application. The method is cooperatively performed by the terminal 120 and the server 140 in the computer system 100 shown in fig. 1, and includes:
Step 1: the terminal receives triggering operation for creating a virtual person;
in some embodiments, the terminal receives a trigger operation for creating a virtual person. The triggering operation includes at least one of single click, double click, left and right sliding, up and down sliding, long press, hover, face recognition, and voice recognition. It should be noted that the triggering operation includes, but is not limited to, the above mentioned several operations, and those skilled in the art should be aware that any operation capable of implementing the above mentioned functions falls within the scope of the embodiments of the present application.
Step 2: the terminal responds to the triggering operation and sends a virtual person creating request to the server;
in some embodiments, the terminal sends a virtual person creation request to the server in response to a trigger operation for creating the virtual person. The creation request includes a request to create a virtual person based on real person information corresponding to the user account. In some embodiments, the real person information includes at least one of a look feature, a gesture feature, and a stature scale of the real person.
Step 3: the server receives a virtual person creation request sent by the terminal;
in some embodiments, the server receives a virtual person creation request sent by the terminal, including receiving a request for creating a virtual person based on real person information corresponding to a user account.
Step 4: the server acquires real person information corresponding to a user account;
in some embodiments, self-timer videos published by a user account are collected or acquired, the number of self-timer videos being unlimited. In some embodiments, the social application for which the user account publishes the self-timer video is the same social application that supports creating virtual persons. In some embodiments, the social application in which the user account posts the self-timer video is a different social application than the social application that supports creating a virtual person. Optionally, the social application program of the user account publishing the self-timer video is an a social application program, the social application program supporting creating the virtual person is a B social application program, and the user account logged in on the a social application program and the user account logged in on the B social application program are the same user account.
In some embodiments, real person information corresponding to the user account is obtained from a self-timer video published by the user account. The real person information includes at least one of a appearance characteristic, a posture characteristic, and a stature proportion of the real person.
Step 5: the server generates a virtual person;
in some embodiments, the server generates a virtual person corresponding to the user account based on the real person information corresponding to the user account. In some embodiments, the real person information corresponding to the user account includes at least one of a shape feature, a gesture feature, and a stature proportion of the real person.
Step 6: the terminal displays an interactive interface comprising a virtual person;
in some embodiments, an interactive interface including a virtual person is displayed on an interface of an application of the terminal. The interactive interface is used for interaction between the virtual person and the user. Illustratively, as shown in FIG. 11, an interactive interface 12 including a virtual person 11 is displayed on the interface of the entertainment-type application. Illustratively, as shown in FIG. 12, an interactive interface 12 including a virtual person 11 is displayed on the interface of the live-type application.
In some embodiments, the interactive interface includes at least one text box for displaying text, and text content is displayed in the at least one text box, the text content including a first text content and a second text content. Illustratively, as shown in FIG. 11, a first text box 14, a second text box 15, and a third text box 16 are included in the interactive interface 12. Wherein the first text box 14 is for displaying a first text content, the second text box 15 is for displaying a second text content, and the third text box 16 is for providing a user with input of an interactive content for interaction with the virtual person 11.
Step 7: the server acquires a sample voice fragment of a target person;
in some embodiments, a sample speech segment is obtained from a self-timer video published by a user account, the sample speech segment being used to train a virtual human speech model. In some embodiments, the sample speech segment is captured from a self-timer video published by the user account. In some embodiments, the sample speech segments are spliced based on self-timer video published by the user account.
In some embodiments, the sample speech segment is truncated from an existing speech segment. Illustratively, interview voice of a public character is obtained by intercepting interview fragments of a news medium, wherein the public character is a target person. In some embodiments, the sample speech segment is recorded audio recorded by the targeted person. In some embodiments, the number of words in the sample speech segment is less than the number of words in the text-to-speech model.
Step 8: the server trains the virtual human voice model based on the sample voice fragments to obtain a trained virtual human voice model;
in some embodiments, the virtual person speech model is trained based on sample speech segments having speech characteristics of the target person, such that the trained virtual person speech model is capable of reasoning speech content having generic speech characteristics to speech content having speech characteristics of the target person.
Step 9: the terminal receives an interaction operation for interaction with the virtual person;
in some embodiments, a terminal receives an interactive operation for interacting with a virtual person. The interactive operation includes at least one of single click, double click, left and right sliding, up and down sliding, long press, hover, face recognition, and voice recognition. It should be noted that the interactive operation includes, but is not limited to, the above mentioned several operations, and those skilled in the art should be aware that any operation capable of implementing the above mentioned functions falls within the scope of the embodiments of the present application.
Step 10: the terminal responds to the interaction operation and sends an interaction request to the server;
in some embodiments, the terminal sends an interaction request to the server in response to an interaction operation by the user for interaction with the dummy. The interactive request includes at least one of a request to obtain first text content, a request to obtain second text content, a request to obtain first voice content, a request to obtain second voice content, a request to obtain interactive video including limb movements of a virtual person, and a request to obtain interactive video including expressions of a virtual person.
Step 11: the server receives an interaction request sent by the terminal;
in some embodiments, the server receives an interactive request sent by the terminal, including at least one of a request to obtain first text content, a request to obtain second text content, a request to obtain first voice content, a request to obtain second voice content, a request to obtain interactive video including limb movements of the virtual person, and a request to obtain interactive video including expressions of the virtual person.
Step 12: the server acquires first text content;
the first text content includes text information that interacts with the dummy. In some embodiments, the first text content is obtained based on input content entered by a user. Optionally, the first text content includes text content entered by a user. Illustratively, the text content input by the user is "hello", and "hello" is the first text content. Optionally, the first text content includes a text content obtained based on voice content input by a user. Illustratively, the user inputs the voice content, and based on the voice recognition model, the voice content input by the user is recognized to obtain the first text content.
Step 13: the server inputs the first text content into a large language model to obtain second text content;
the server inputs the first text content into a large language model, and the large language model is used for carrying out natural language reply processing on the first text content to obtain second text content subjected to the natural language reply processing. Illustratively, in the case where the first text content is "hello," the "hello" is input into the large language model, resulting in the second text content being "i fine. Optionally, the large language model is a Chat GPT model. Alternatively, the large language model is a GPT-4 model.
Step 14: the server converts the second text content into first voice content based on the text-to-voice model;
the server inputs the second text content into the text-to-speech model based on the text-to-speech model capable of converting the text content into the speech content, and outputs the converted first speech content. In some embodiments, the first voice content is voice information corresponding to the second text content. Illustratively, the second text content is "I fine", and the first speech content corresponds to the "I fine" speech content.
Step 15: the server infers the first voice content based on the virtual voice model to obtain second voice content;
the server inputs the first voice content into the virtual human voice model for reasoning to obtain the second voice content. The virtual person voice model is used for enabling the reasoning of voice contents with general voice characteristics into voice contents with voice characteristics of a target person. That is, the second voice content includes voice features of the target person. In some embodiments, the speech characteristics of the target person include at least one of timbre characteristics, emotion characteristics, prosodic characteristics.
Step 16: the terminal plays the second voice content taught by the virtual person;
in some embodiments, the terminal plays the second voice content taught by the virtual person. In some embodiments, the second speech content spoken by the virtual person is played in response to a trigger operation for the play control. The triggering operation includes at least one of single click, double click, left and right sliding, up and down sliding, long press, hover, face recognition, and voice recognition. It should be noted that the triggering operation includes, but is not limited to, the above mentioned several operations, and those skilled in the art should be aware that any operation capable of implementing the above mentioned functions falls within the scope of the embodiments of the present application. Illustratively, as shown in fig. 11, a play control 13 is included on the interactive interface 12, where the play control 13 is configured to control playing of the second voice content, and in response to a triggering operation for the play control 13, playing of the second voice content taught by the virtual person 11. In some embodiments, a second voice content taught by the dummy having voice characteristics of the target person is played.
Step 17: the server builds a limb action library of the target person;
in some embodiments, the sample speech segments of the target person are obtained from a sample video segment containing the target person. The server extracts at least one target limb action from a sample video clip containing a target person. The target limb movement is a limb movement of the target person. In some embodiments, the target limb motion is determined based on a skeletal point of the target person, the skeletal point being used to indicate a limb motion change. A sequence of skeletal point changes of the target person is determined based on the changes in at least one skeletal point of the target person. Based on the sequence of skeletal point changes, limb movements of the target person are determined.
In some embodiments, the target limb action corresponds to at least one mood word. The target limb motion is a limb motion of the target person corresponding to the mood word in the sample video clip. In some embodiments, the target limb actions are in a one-to-one correspondence with the mood words, such as mood words "woolen" corresponding to a first target limb action and mood words "mock" corresponding to a second target limb action. In some embodiments, the target limb actions are not in one-to-one correspondence with the mood words, such as mood word "tweed" and mood word "do" each correspond to a third target limb action.
And under the condition that the sample video segment comprises m mood words and target limb actions corresponding to the m mood words, constructing a limb action library of the target person. The limb action library of the target person comprises at least one target limb action and a mood word corresponding to the at least one target limb action.
Step 18: the server generates limb actions of the virtual person corresponding to the mood words in the second voice content based on the limb action library;
the server generates limb actions of the virtual person corresponding to the mood words in the second voice content based on the limb action library including at least one target limb action and the mood words corresponding to the at least one target limb action.
In some embodiments, the limb movements of the virtual person are determined based on skeletal points of the virtual person. Optionally, the skeletal points of the virtual person are consistent with the skeletal points of the target person, e.g., the virtual person includes t skeletal points, the target person also includes t skeletal points, and the ratio of the positions of the t skeletal points of the virtual person to the ratio of the positions of the t skeletal points of the target person is consistent. Optionally, the skeletal points of the virtual person are inconsistent with the skeletal points of the target person, including the number of skeletal points of the virtual person and the number of skeletal points of the target person being inconsistent, and/or the ratio of the locations of the skeletal points of the virtual person and the ratio of the locations of the skeletal points of the target person being inconsistent.
In some embodiments, generating limb movements of the virtual person is achieved by mapping skeletal point changes of the target person into skeletal point changes of the virtual person. In some embodiments, the target limb actions corresponding to the mood words in the second voice content are found by searching or retrieving the target limb actions in the limb action library, and the skeletal point change sequence of the target person corresponding to the target limb actions is mapped to the skeletal point change sequence of the virtual person to generate the limb actions of the virtual person corresponding to the mood words in the second voice content.
Step 19: the terminal plays an interactive video comprising limb actions of the virtual person;
in some embodiments, an interactive video including limb movements of a virtual person is played, the limb movements of the virtual person corresponding to the mood words in the second speech content, the limb movements of the virtual person being obtained from a limb movement library constructed based on sample video clips containing the target person. In some embodiments, the interactive video comprising limb movements of the virtual person is a video stream, the video stream being pre-generated. In some embodiments, the interactive video comprising limb movements of the virtual person is a video frame, based on which the interactive video comprising limb movements of the virtual person is generated.
Step 20: the server builds an expression library of the target person;
in some embodiments, the sample speech segments of the target person are obtained from a sample video segment containing the target person. The server extracts at least one target expression from the sample video segment containing the target person. The target expression is an expression of the target person. In some embodiments, the expression of the target person is determined based on facial key points of the target person, the facial key points being used to indicate facial expression changes. A sequence of changes in the facial keypoints of the target person is determined based on at least one facial keypoint of the target person. Based on the sequence of changes in the facial key points, a facial expression of the target person is determined.
In some embodiments, the target expression corresponds to at least one mood word. The target expression is an expression of the target person corresponding to the mood word in the sample video clip. In some embodiments, the target expressions are in one-to-one correspondence with the mood words, such as mood word "woolen" corresponds to a first target expression and mood word "mock" corresponds to a second target expression. In some embodiments, the target expression is not in one-to-one correspondence with the mood word, such as mood word "tweed" and mood word "morpholine" each correspond to a third target expression.
And under the condition that the sample video fragment comprises m mood words and target expressions corresponding to the m mood words, constructing an expression library of the target person. The expression library of the target person comprises at least one target expression and a mood word corresponding to the at least one target expression.
Step 21: the server generates the expression of the virtual person corresponding to the mood word in the second voice content based on the expression library;
the server generates an expression of the virtual person corresponding to the mood word in the second speech content based on the expression library including the at least one target expression and the mood word corresponding to the at least one target expression.
In some embodiments, the expression of the virtual person is determined based on facial key points of the virtual person. Optionally, the face key points of the virtual person are consistent with the face key points of the target person, for example, the virtual person includes y face key points, the target person also includes y face key points, and the position proportion of the y face key points of the virtual person is consistent with the position proportion of the y face key points of the target person. Optionally, the face key points of the virtual person are inconsistent with the face key points of the target person, including that the number of face key points of the virtual person is inconsistent with the number of face key points of the target person, and/or that the position proportion of the face key points of the virtual person is inconsistent with the position proportion of the face key points of the target person.
In some embodiments, generating the expression of the virtual person is accomplished by mapping the facial keypoint variation of the target person into the facial keypoint variation of the virtual person. In some embodiments, the target expression corresponding to the mood word in the second voice content is found by searching or retrieving the target expression in the expression library, and the facial key point change sequence of the target person corresponding to the target expression is mapped to the facial key point change sequence of the virtual person, so as to generate the expression of the virtual person corresponding to the mood word in the second voice content.
Step 22: the terminal plays the interactive video comprising the expression of the virtual person.
In some embodiments, an interactive video including expressions of a virtual person is played, the expressions of the virtual person correspond to the mood words in the second voice content, the expressions of the virtual person are obtained from an expression library, and the expression library is constructed based on a sample video segment containing a target person. In some embodiments, the interactive video comprising the expression of the virtual person is a video stream, the video stream being pre-generated. In some embodiments, the interactive video comprising the expression of the virtual person is a video frame, based on which the interactive video comprising the expression of the virtual person is generated.
In summary, the method provided in this embodiment can generate the second voice content with the voice feature of the target person, and by avoiding generating the popular voice content, the virtual person voice is more distinctive. In addition, by generating the limb actions and/or expressions of the virtual person corresponding to the second voice content, the method can help to improve the enthusiasm of the user for interaction with the virtual person to a certain extent.
In some embodiments, the voice synthesis method for the virtual person provided by the embodiment of the application generates the answer by using the large language model, and endows the virtual person with a specific tone through the replacement of the middle-layer text-to-voice model and the virtual person voice model which is trained autonomously, so that the problems that the voice answer is mostly marketing number voice, the custom sound line is expensive, the custom time is long, or the self-training text-to-voice model is hard to sound, the word library is less and the multi-tone word can not be processed in the related art are solved. The voice synthesis method for the virtual person, provided by the embodiment of the application, has low cost and very fast training and reasoning time.
According to the voice synthesis method for the virtual person, provided by the embodiment of the application, the voice of the virtual person can be customized to be more matched with the human setting of the virtual person through the sample voice fragment of the target person and the virtual person voice model, so that the universal marketing number voice line is avoided, or the voice collision with other virtual persons is avoided.
According to the voice synthesis method for the virtual person, provided by the embodiment of the application, the text-to-voice model is used as the middle layer for emotion processing, mood and breath pause, and then the virtual person voice model is used for tone migration, so that secondary tone assignment is realized, the voice can be more natural and smooth, and the problems of too few word libraries, processing of polyphones, names, rare words and the like can be effectively solved by using the text-to-voice model as the middle layer.
The cost of the voice synthesis method for the virtual person provided by the embodiment of the application is controllable, a free text-to-voice model can be used as the middle layer, and the high simulation degree can be obtained by training the voice model of the virtual person with fewer sample voice fragments, so that the cost, the voice simulation degree and the like can be well balanced.
The voice synthesis method for the virtual person, provided by the embodiment of the application, can be suitable for interaction scenes with any virtual person, such as chat box dialogue, live broadcast of the virtual person and the like. The user inputs questions or demands, and the system performs speech recognition and natural language processing to send text to the large language model. The system generates corresponding text replies based on the large language model, generates corresponding voice replies by taking the text replies as parameters, and the voice replies are consistent with the text replies and have the mood, emotion and intonation of the target person. The system displays the voice answers, the text replies and the expression actions of the virtual person on a user interface, and the user can further understand and meet the requirements of himself by watching, listening and reading the answer contents.
In some embodiments, as shown in fig. 17, the steps of the voice synthesis method for a virtual person provided in the embodiment of the present application include:
step 1110: training a virtual human voice model;
1) The microphone and the recording device are used for enabling the target person to record a voice with a certain length, and the time length is tens of minutes to one or two hours. The recorded sound is not processed for effect. In the case where background music or other reverberation effects are included, the background music or other reverberation effects are removed.
2) And (3) performing secondary refinement treatment on the recorded voice of the target person, and performing mute or shearing treatment on the ventilation sound and the like. The step can promote the naturalness of the trained voice model, prevent excessive pollution to the training data set and reduce the mechanical sense.
3) Cutting the recorded voice of the target person, cutting according to the duration and the sentence breaking, and maintaining the optimal audio length of each small section to be about the minimum of eight seconds and the maximum of thirty seconds, thereby preventing training failure caused by overflow of a display memory required by the equipment due to overlong voice materials in the middle of training.
4) All dataset samples were converted to 44100Hz using a resmplex script.
5) The preprocess_flist_config.py script and preprocess_hubert_f0.py script are used to automatically divide the dataset into a training set, a validation set and a test set and automatically generate corresponding configuration files for training and reasoning.
6) Training by using the train script to obtain a virtual human voice model.
Step 1120: acquiring an input text of a user;
in some embodiments, the user's input text is obtained through a text box entered by the user. In some embodiments, the user's input text is captured by capturing a user's barrage in a live scene.
Step 1130: generating a text reply based on the large language model;
in some embodiments, the user's input text is passed as a parameter to a large language model, which automatically generates a corresponding reply from the user's input text.
Step 1140: generating a preliminary voice reply based on the text-to-voice model;
in some embodiments, the virtual human text reply generated by the large language model is transmitted to a text-to-speech model as a parameter, emotion, mood, sentence breaking or pause of the reply is automatically analyzed through the text-to-speech model, and a preliminary speech reply is generated.
Step 1150: generating a final voice reply based on the virtual human voice model;
in some embodiments, the preliminary voice response generated by the text-to-voice model is used as a voice input, and the final voice response is generated by performing a second inference through the already trained virtual human voice model.
Step 1160: and feeding back the final voice reply to the user.
In some embodiments, the final voice response is fed back to the user, the corresponding text response and voice response are displayed to the audience, and the interaction between the user and the virtual person is completed by matching with the corresponding expression and action.
According to the voice synthesis method for the virtual person, provided by the embodiment of the application, the voice model of the virtual person is trained by using the voice material of the target person, so that personalized customization of the voice of the virtual person can be realized, the defect of using preset voice is avoided, the voice of the virtual person is more in line with the expected setting of the person, and the experience and the recognition of the user are improved.
According to the voice synthesis method for the virtual person, provided by the embodiment of the application, the text-to-voice model is used as the middle layer of emotion processing, mood and breath pause, and the virtual person voice model is used for secondary tone assignment after buffer processing, so that voice can be more natural and smooth, and the problems of too few word libraries, processing of polyphones, names, uncommon words and the like can be effectively solved by the text-to-voice model, and the accuracy and robustness of voice recognition are improved.
The voice synthesis method for the virtual person, provided by the embodiment of the application, has relatively low cost, achieves better balance in the aspects of realizing cost control, voice simulation degree and the like, has higher training speed, and improves efficiency and expandability.
Fig. 18 shows a block diagram of a speech synthesis apparatus for a virtual person according to an exemplary embodiment of the present application. The device comprises:
the obtaining module 1810 is configured to obtain first text content, where the first text content includes text information that interacts with a virtual person.
The processing module 1820 is configured to input the first text content into a large language model, and obtain the second text content, where the large language model is configured to perform natural language reply processing on the first text content.
The processing module 1820 is further configured to convert the second text content into first speech content based on the text-to-speech model, where the first speech content is speech information corresponding to the second text content.
The processing module 1820 is further configured to infer a first voice content based on the virtual person voice model, so as to obtain a second voice content, where the second voice content includes voice features of the target person.
The virtual human voice model is used for reasoning the voice content with the general voice characteristics into the voice content with the voice characteristics of the target person, and the number of sample words used by the text-to-voice model in the training process is larger than that of the sample words used by the virtual human voice model in the training process.
The obtaining module 1810 is further configured to obtain a sample speech segment of the target person, where the number of words in the sample speech segment is smaller than the number of words in the text-to-speech model.
The training module 1830 is configured to train the virtual human voice model based on the sample voice segment, and obtain a trained virtual human voice model.
The obtaining module 1810 is further configured to extract a voice feature of the target person in the sample voice segment.
The training module 1830 is further configured to train the virtual human voice model based on the voice characteristics of the target person, to obtain a trained virtual human voice model.
The obtaining module 1810 is further configured to extract a voice feature in the sample voice segment as a voice feature of the target person, where the voice feature includes at least one of a timbre feature, an emotion feature, and a prosody feature.
The obtaining module 1810 is further configured to obtain a sample voice of the target person.
The processing module 1820 is further configured to mute or cut target sound information in the sample speech to obtain a cut sample, where the target sound information includes at least one of ventilation sound, pause, and mechanical sound in the sample speech.
The processing module 1820 is further configured to segment the cut samples according to a time length threshold, where the time length threshold is used to indicate a time length of the sample speech segment.
In some embodiments, the sample speech segments of the target person are obtained from a sample video segment containing the target person.
The obtaining module 1810 is further configured to extract, from the sample video segment, a target limb action corresponding to at least one word of the mood, where the target limb action is a limb action of the target person corresponding to the word of the mood in the sample video segment.
The processing module 1820 is further configured to construct a limb motion library of the target person based on the target limb motion.
The processing module 1820 is further configured to generate, based on the limb-action library, a limb action of the virtual person corresponding to the mood word in the second speech content.
In some embodiments, the sample speech segments of the target person are obtained from a sample video segment containing the target person.
The obtaining module 1810 is further configured to extract, from the sample video segment, a target expression corresponding to at least one of the mood words, where the target expression is an expression of the target person corresponding to the mood word in the sample video segment.
The processing module 1820 is further configured to construct an expression library of the target person based on the target expression.
The processing module 1820 is further configured to generate, based on the expression library, an expression of the virtual person corresponding to the mood word in the second speech content.
The obtaining module 1810 is further configured to obtain a virtual person creation request sent by a user account in the social application, where the virtual person creation request is used to create a virtual person corresponding to the user account.
The obtaining module 1810 is further configured to obtain real person information and a sample speech segment corresponding to the user account from a self-timer video published by the user account.
The processing module 1820 is further configured to generate a virtual person corresponding to the user account based on the real person information corresponding to the user account.
The processing module 1820 is further configured to bind the virtual person voice model obtained based on the sample voice segment training with the virtual person.
Fig. 19 shows a block diagram of a speech synthesis apparatus for a virtual person according to an exemplary embodiment of the present application. The device comprises:
the display module 1910 is configured to display an interactive interface including a virtual person.
The interaction module 1920 is configured to obtain, in response to an interaction operation for interacting with the virtual person, first text content, where the first text content includes text information for interacting with the virtual person.
The play module 1930 is configured to play a second voice content taught by the virtual person, where the second voice content is obtained by reasoning the first voice content based on the virtual person voice model, the first voice content is obtained by converting the second text content based on the text-to-voice model, and the second text content is obtained by performing natural language reply processing on the first text content based on the large language model.
The virtual human voice model is used for reasoning the voice content with the general voice characteristics into the voice content with the voice characteristics of the target person, and the number of sample words used by the text-to-voice model in the training process is larger than that of the sample words used by the virtual human voice model in the training process.
The play module 1930 is further configured to play a second voice content, which is taught by the virtual person and has a voice feature of the target person, where the voice feature includes at least one of a timbre feature, an emotion feature, and a prosody feature.
The playing module 1930 is further configured to play an interactive video including a limb motion of the virtual person, where the limb motion of the virtual person corresponds to a word of language in the second voice content, and the limb motion of the virtual person is obtained from a limb motion library, and the limb motion library is constructed based on a sample video clip containing the target person.
The playing module 1930 is further configured to play an interactive video including an expression of a virtual person, where the expression of the virtual person corresponds to a word of a mood in the second voice content, and the expression of the virtual person is obtained from an expression library, and the expression library is constructed based on a sample video segment containing a target person.
The interaction module 1920 is further configured to display a virtual person corresponding to the user account in the social application in response to a trigger operation for creating the virtual person.
The virtual person is created based on real person information corresponding to the user account, the virtual person is bound with a virtual person voice model, the virtual person voice model is obtained based on sample voice segment training, and the real person information corresponding to the user account and the sample voice segment are obtained from a self-timer video published by the user account.
In some embodiments, the interactive interface includes at least one text box for displaying text.
The display module 1910 is further configured to display text content in at least one text box, where the text content includes a first text content and a second text content.
Fig. 20 is a schematic diagram showing the structure of a computer device according to an exemplary embodiment of the present application. Illustratively, the computer apparatus 2000 includes a central processing unit (Central Processing Unit, CPU) 2001, a system Memory 2004 including a random access Memory (Random Access Memory, RAM) 2002 and a Read-Only Memory (ROM) 2003, and a system bus 2005 connecting the system Memory 2004 and the central processing unit 2001. The computer device 2000 also includes a basic input/output system 2006 to facilitate the transfer of information between various devices within the computer, and a mass storage device 2007 that stores an operating system 2013, application programs 2014, and other program modules 2015.
The basic input/output system 2006 includes a display 2008 for displaying information and an input device 2009 such as a mouse, keyboard, etc. for a user to input information. Wherein both the display 2008 and the input device 2009 are connected to the central processing unit 2001 through an input/output controller 2010 connected to a system bus 2005. The basic input/output system 2006 may also include an input/output controller 2010 for receiving and processing input from a keyboard, mouse, or electronic stylus among a plurality of other devices. Similarly, the input/output controller 2010 also provides output to a display screen, printer or other type of output device.
The mass storage device 2007 is connected to the central processing unit 2001 through a mass storage controller (not shown) connected to the system bus 2005. The mass storage device 2007 and its associated computer-readable media provide non-volatile storage for the computer device 2000. That is, the mass storage device 2007 may include a computer-readable medium (not shown) such as a hard disk or a compact disk-Only (CD-ROM) drive.
The computer readable medium may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-Only Memory (EPROM), electrically erasable programmable read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), flash Memory or other solid state Memory technology, CD-ROM, digital versatile disks (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 2004 and mass storage device 2007 described above may be collectively referred to as memory.
According to various embodiments of the application, the computer device 2000 may also operate through a network such as the Internet to a remote computer on the network. I.e. the computer device 2000 may be connected to the network 2012 via a network interface unit 2011 coupled to the system bus 2005, or alternatively, the network interface unit 2011 may be used to connect to other types of networks or remote computer systems (not shown).
An exemplary embodiment of the present application also provides a computer readable storage medium storing at least one program, where the at least one program is loaded and executed by a processor to implement the method for synthesizing speech for a virtual person provided in the above method embodiments.
An exemplary embodiment of the present application also provides a computer program product comprising at least one program, the at least one program being stored in a readable storage medium; the processor of the communication device reads the signaling from the readable storage medium, and the processor executes the signaling to cause the communication device to execute to implement the voice synthesis method for the virtual person provided in the above method embodiments.
It should be understood that references herein to "a plurality" are to two or more. Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the application are intended to be included within the scope of the application.

Claims (19)

1. A method of speech synthesis for a virtual person, the method being performed by a server, the method comprising:
acquiring first text content, wherein the first text content comprises text information interacted with the virtual person;
inputting the first text content into a large language model to obtain second text content, wherein the large language model is used for carrying out natural language reply processing on the first text content;
converting the second text content into first voice content based on a text-to-voice model, wherein the first voice content is voice information corresponding to the second text content;
Reasoning the first voice content based on a virtual person voice model to obtain second voice content, wherein the second voice content comprises voice characteristics of a target person;
the virtual human voice model is used for enabling voice contents with general voice characteristics to be inferred to voice contents with voice characteristics of the target person, and the number of sample words used by the text-to-voice model in the training process is larger than that of sample words used by the virtual human voice model in the training process.
2. The method according to claim 1, wherein the method further comprises:
acquiring a sample voice fragment of the target person;
and training the virtual human voice model based on the sample voice fragment to obtain the trained virtual human voice model.
3. The method of claim 2, wherein training the virtual human voice model based on the sample speech segments to obtain the trained virtual human voice model comprises:
extracting the voice characteristics of the target person in the sample voice fragment;
and training the virtual human voice model based on the voice characteristics of the target person to obtain the trained virtual human voice model.
4. A method according to claim 3, wherein said extracting speech features of the target person in the sample speech segment comprises:
and extracting voice characteristics in the sample voice fragments as voice characteristics of the target person, wherein the voice characteristics comprise at least one of tone characteristics, emotion characteristics and rhythm characteristics.
5. A method according to claim 3, characterized in that the method further comprises:
acquiring sample voice of the target person;
performing mute or cut processing on target sound information in the sample voice to obtain a cut sample, wherein the target sound information comprises at least one of ventilation sound, pause and mechanical sound in the sample voice;
and dividing the cut sample according to a time length threshold value to obtain the sample voice fragment, wherein the time length threshold value is used for indicating the time length of the sample voice fragment.
6. The method of any one of claims 1 to 5, wherein the sample speech segment of the target person is obtained from a sample video segment containing the target person, the method further comprising:
extracting a target limb action corresponding to at least one word of language from the sample video segment, wherein the target limb action is a limb action of the target person corresponding to the word of language in the sample video segment;
Constructing a limb action library of the target person based on the target limb action;
and generating limb actions of the virtual person corresponding to the mood words in the second voice content based on the limb action library.
7. The method of any one of claims 1 to 5, wherein the sample speech segment of the target person is obtained from a sample video segment containing the target person, the method further comprising:
extracting a target expression corresponding to at least one word of language from the sample video segment, wherein the target expression is the expression of the target person corresponding to the word of language in the sample video segment;
constructing an expression library of the target person based on the target expression;
and generating the expression of the virtual person corresponding to the mood word in the second voice content based on the expression library.
8. The method according to any one of claims 1 to 5, further comprising:
acquiring a virtual person creation request sent by a user account in a social application program, wherein the virtual person creation request is used for creating a virtual person corresponding to the user account;
acquiring real person information corresponding to the user account and the sample voice fragment from a self-timer video published by the user account;
Generating a virtual person corresponding to the user account based on the real person information corresponding to the user account;
and binding the virtual human voice model obtained based on the sample voice fragment training with the virtual human.
9. A method of speech synthesis for a virtual person, the method being performed by a terminal, the method comprising:
displaying an interactive interface comprising the virtual person;
responding to an interaction operation for interacting with the virtual person, and acquiring first text content, wherein the first text content comprises text information for interacting with the virtual person;
playing second voice content taught by the virtual person, wherein the second voice content is obtained by reasoning first voice content based on a virtual person voice model, the first voice content is obtained by converting second text content based on a text-to-voice model, and the second text content is obtained by carrying out natural language reply processing on the first text content based on a large language model;
the virtual human voice model is used for reasoning the voice content with the general voice characteristics into the voice content with the voice characteristics of the target person, and the number of sample words used by the text-to-voice model in the training process is larger than that of the sample words used by the virtual human voice model in the training process.
10. The method of claim 9, wherein playing the second voice content taught by the virtual person comprises:
and playing the second voice content which is taught by the virtual person and has the voice characteristics of the target person, wherein the voice characteristics comprise at least one of tone color characteristics, emotion characteristics and rhythm characteristics.
11. The method according to claim 9, wherein the method further comprises:
and playing an interactive video comprising the limb actions of the virtual person, wherein the limb actions of the virtual person correspond to the mood words in the second voice content, the limb actions of the virtual person are obtained from a limb action library, and the limb action library is constructed based on a sample video fragment containing a target person.
12. The method according to claim 9, wherein the method further comprises:
and playing the interactive video comprising the expression of the virtual person, wherein the expression of the virtual person corresponds to the word of the language in the second voice content, the expression of the virtual person is obtained from an expression library, and the expression library is constructed based on a sample video fragment containing the target person.
13. The method according to claim 9, wherein the method further comprises:
Responding to a trigger operation for creating the virtual person, and displaying the virtual person corresponding to a user account in a social application program;
the virtual person is created based on real person information corresponding to the user account, the virtual person is bound with the virtual person voice model, the virtual person voice model is obtained based on sample voice segment training, and the real person information corresponding to the user account and the sample voice segment are obtained from a self-timer video published by the user account.
14. The method of claim 9, wherein the interactive interface includes at least one text box for displaying text;
the method further comprises the steps of:
displaying text content in the at least one text box, the text content including the first text content and the second text content.
15. A speech synthesis apparatus for a virtual person, the apparatus comprising:
the acquisition module is used for acquiring first text content, wherein the first text content comprises text information interacted with the virtual person;
the processing module is used for inputting the first text content into a large language model to obtain second text content, and the large language model is used for carrying out natural language reply processing on the first text content;
The processing module is further used for converting the second text content into first voice content based on a text-to-voice model, wherein the first voice content is voice information corresponding to the second text content;
the processing module is further used for reasoning the first voice content based on the virtual person voice model to obtain second voice content, wherein the second voice content comprises voice characteristics of a target person;
the virtual human voice model is used for reasoning the voice content with the general voice characteristics into the voice content with the voice characteristics of the target person, and the number of sample words used by the text-to-voice model in the training process is larger than that of the sample words used by the virtual human voice model in the training process.
16. A speech synthesis apparatus for a virtual person, the apparatus comprising:
the display module is used for displaying an interactive interface comprising the virtual person;
the interaction module is used for responding to the interaction operation for interacting with the virtual person and obtaining first text content, wherein the first text content comprises text information for interacting with the virtual person;
The playing module is used for playing second voice content taught by the virtual person, the second voice content is obtained by reasoning the first voice content based on a voice model of the virtual person, the first voice content is obtained by converting second text content based on a text-to-voice model, and the second text content is obtained by carrying out natural language reply processing on the first text content based on a large language model;
the virtual human voice model is used for reasoning the voice content with the general voice characteristics into the voice content with the voice characteristics of the target person, and the number of sample words used by the text-to-voice model in the training process is larger than that of the sample words used by the virtual human voice model in the training process.
17. A computer device, the computer device comprising: a processor and a memory, in which at least one program is stored, which is loaded and executed by the processor to implement the speech synthesis method for a virtual person according to any one of claims 1 to 14.
18. A computer readable storage medium, characterized in that at least one program is stored in the readable storage medium, which is loaded and executed by a processor to implement the speech synthesis method for a virtual person according to any of claims 1 to 14.
19. A computer program product, characterized in that the computer program product comprises at least one program, the at least one program being stored in a computer readable storage medium; a processor of a communication device reads the at least one program from the computer-readable storage medium, the processor executing the at least one program to cause the communication device to perform the speech synthesis method for a virtual person according to any one of claims 1 to 14.
CN202310663996.3A 2023-06-06 2023-06-06 Speech synthesis method, device, equipment, medium and product for virtual person Pending CN117219047A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310663996.3A CN117219047A (en) 2023-06-06 2023-06-06 Speech synthesis method, device, equipment, medium and product for virtual person

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310663996.3A CN117219047A (en) 2023-06-06 2023-06-06 Speech synthesis method, device, equipment, medium and product for virtual person

Publications (1)

Publication Number Publication Date
CN117219047A true CN117219047A (en) 2023-12-12

Family

ID=89044975

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310663996.3A Pending CN117219047A (en) 2023-06-06 2023-06-06 Speech synthesis method, device, equipment, medium and product for virtual person

Country Status (1)

Country Link
CN (1) CN117219047A (en)

Similar Documents

Publication Publication Date Title
JP6876752B2 (en) Response method and equipment
CN111741326B (en) Video synthesis method, device, equipment and storage medium
CN110035330B (en) Video generation method, system, device and storage medium based on online education
CN111489424A (en) Virtual character expression generation method, control method, device and terminal equipment
CN110517689A (en) A kind of voice data processing method, device and storage medium
WO2021227707A1 (en) Audio synthesis method and apparatus, computer readable medium, and electronic device
CN111147871B (en) Singing recognition method and device in live broadcast room, server and storage medium
JP2023552854A (en) Human-computer interaction methods, devices, systems, electronic devices, computer-readable media and programs
CN114242033A (en) Speech synthesis method, apparatus, device, storage medium and program product
US20220301250A1 (en) Avatar-based interaction service method and apparatus
US20230343321A1 (en) Method and apparatus for processing virtual concert, device, storage medium, and program product
CN117095669A (en) Emotion voice synthesis method, system, equipment and medium based on variation automatic coding
CN117219047A (en) Speech synthesis method, device, equipment, medium and product for virtual person
CN115690277A (en) Video generation method, system, device, electronic equipment and computer storage medium
Kurihara et al. “ai news anchor” with deep learning-based speech synthesis
Sannino et al. Lessonable: leveraging deep fakes in MOOC content creation
CN112465679B (en) Piano learning and creation system and method
CN116561294A (en) Sign language video generation method and device, computer equipment and storage medium
CN113823300A (en) Voice processing method and device, storage medium and electronic equipment
CN113223513A (en) Voice conversion method, device, equipment and storage medium
Abdelnour et al. From visual to acoustic question answering
CN117560340B (en) Information interaction method, device and storage medium based on simulated roles
CN116843805B (en) Method, device, equipment and medium for generating virtual image containing behaviors
CN115457931B (en) Speech synthesis method, device, equipment and storage medium
CN117014653A (en) Video synthesis method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication