CN117334179A

CN117334179A - Method, device and storage medium for real-time simulation of designated character tone by digital person

Info

Publication number: CN117334179A
Application number: CN202311317546.5A
Authority: CN
Inventors: 周旺华
Original assignee: Shanghai Shuheng Information Technology Co ltd
Current assignee: Shanghai Shuheng Information Technology Co ltd
Priority date: 2023-10-12
Filing date: 2023-10-12
Publication date: 2024-01-02

Abstract

The invention relates to the technical field of natural language processing, in particular to a method, a device and a storage medium for real-time simulation of the tone of a designated person by a digital person, wherein the method for real-time simulation of the tone of the designated person by the digital person comprises the following steps: 1) Collecting the sound which needs to be converted into the tone of the target character, and cutting the sound; 2) Preprocessing the audio frequency cut in the step 1) and extracting sound characteristics; 3) Learning the features extracted in the step 2) by generating an countermeasure network, so that the model learns the tone of the target person better; 4) Sending the audio to be inferred into a model, and obtaining audio after tone conversion from the output result of the countermeasure network generated in the step 3) through a vocoder; compared with the prior art, the invention can realize the tone color almost the same as the target character through training in a shorter time, and can realize the tone color conversion in a very short time in an reasoning stage, thereby meeting the real-time requirement.

Description

Method, device and storage medium for real-time simulation of designated character tone by digital person

[ technical field ]

The invention relates to the technical field of natural language processing, in particular to a method, a device and a storage medium for simulating the tone of a designated person in real time by a digital person.

[ background Art ]

The traditional digital human tone simulation technology needs to collect voice data of a target person, then conduct pretreatment and feature extraction, conduct sound conversion by establishing a conversion model, and finally synthesize a tone similar to the target person.

The specific algorithms generally include: (1) And collecting voice data of the target person, including voice recording, audio sampling and the like. Such as recording using common recording devices such as cell phones, recording pens, etc. (2) Preprocessing, namely preprocessing the collected voice data, including removing noise, framing the voice signals and the like. Firstly, noise in original sound is processed by common methods such as a waveform intensity processing method, a filter, a spectral subtraction method, self-adaptive noise suppression, deep learning and the like, and then the sound is framed by methods such as fixed-length framing, variable-length framing and the like. The purpose of speech framing is to enable various signal processing techniques to better process non-stationary signals, such as speech signals. (3) And extracting the characteristics, namely extracting the sound characteristics by analyzing the voice signals of the target person, wherein the common characteristics comprise fundamental frequency, sound channel parameters, mel frequency cepstrum coefficients and the like. Because humans are more sensitive to smaller sound volume increments when the perceived sound volume is smaller, and are less sensitive to smaller sound volume increments when the perceived sound volume is larger, and according to this feature, mel frequency cepstrum coefficients are used to reflect the mapping relationship between different sounds and output model data. Meanwhile, the characteristic is input when the conversion model is built in the step 4. (4) A conversion model is built, and a machine learning algorithm or a deep learning algorithm is used to build a conversion model from the target character features to the cloned character features. This model may be based on a Gaussian Mixture Model (GMM), support Vector Machine (SVM), neural Network (NN), etc. After the source sound characteristics and the target sound characteristics extracted in the step 3 are passed, a certain corresponding relation is established between the two through a model, and the source sound characteristics are converted into the characteristic information of the target sound through a certain amount of parameters. (5) And voice conversion, namely inputting the voice characteristics of the target person into a conversion model, converting the voice characteristics into voice characteristics similar to those of the cloned person, and synthesizing the voice signals of the cloned person. The sound of the target person is collected and processed through the methods in the step 1, the step 2 and the step 3, and then a sound signal is generated through a conversion model. (6) And synthesizing voice, reconstructing the converted voice signal, and synthesizing a sound similar to the cloned character.

In this process, a lot of data and specialized technical support are required, and the effect is also affected by many factors, such as the voice characteristics of the target person, the quality of the transformation model, and the like. Therefore, in practical application, sufficient test and verification are required to be performed to ensure the effectiveness and reliability of the system, so that the time consumption is long, and the real-time performance cannot be met.

[ summary of the invention ]

The invention aims to solve the defects and provide a method for simulating the tone of the appointed character in real time by a digital person, which can realize the tone almost the same as the target character by training in a shorter time, and can realize the tone conversion in a very short time in an reasoning stage so as to meet the real-time requirement.

In one aspect of the invention, a method for simulating the tone of a designated person in real time by a digital person is provided, which comprises an off-line preprocessing stage and a real-time processing stage; in the off-line preprocessing stage, firstly, extracting the characteristics of the audio of a target person, and then obtaining a model of the tone of the specific person through training; in the real-time processing stage, under the condition of keeping the audio content unchanged, reasonable acoustic characteristics are inferred through a model, and then the acoustic characteristics are converted into audio with specific tone through a vocoder.

As one embodiment, the method for simulating the tone color of the designated person by the digital person in real time comprises the following steps:

1) Collecting the sound which needs to be converted into the tone of the target character, and cutting the sound;

2) Preprocessing the audio frequency cut in the step 1) and extracting sound characteristics;

3) Learning the features extracted in the step 2) by generating an countermeasure network, so that the model learns the tone of the target person better;

4) And 3) sending the audio to be inferred into a model, and obtaining the audio after tone conversion by the vocoder according to the output result of the countermeasure network generated in the step 3).

As an example, in step 1), the sound is cut for a period of 8s and stored under a designated folder.

In step 2), the cut audio is first preprocessed, and human voice and noise are separated to eliminate the influence of noise, and then the mel spectrum of the voice is extracted.

As an embodiment, in step 3), the model is generated as a whole into an impedance network structure, the generator is a generator based on a VAE model, and a flow model is added in the VAE.

As an embodiment, in step 3), the model as a whole comprises a posterior encoder, an a priori encoder, a decoder, a arbiter and a random time predictor; the posterior encoder and the discriminant are used only for training and are not used for inference; in the model training process, firstly, the audio after being transmitted and processed is respectively transmitted into a posterior encoder and an priori encoder, and the priori encoder outputs the mean value and the variance; the posterior encoder outputs z, mean and variance; and cutting z output by the posterior encoder, then sending the cut z into a decoder and a flow, performing monotonic alignment search on the output of the flow and the output of the prior encoder, and inputting the monotonic alignment search result d and the output Htext of the prior encoder into a random time predictor.

In another aspect of the present invention, there is provided a device for real-time simulation of a designated character tone by a digital person, including an audio cutting unit, a sound feature extraction unit, a generation countermeasure network training unit, and an audio conversion unit, wherein:

the audio cutting unit is used for collecting the sound which needs to be converted into the tone of the target character, cutting the sound and storing the sound under a designated folder;

the sound feature extraction unit is used for preprocessing the cut audio and extracting sound features; firstly, preprocessing cut audio, separating human voice from noise, and then extracting features of a mel frequency spectrum of the voice;

the generated countermeasure network training unit is used for learning the extracted characteristics through generating a countermeasure network, so that the model can learn the tone of the target person better;

the audio conversion unit is used for sending the audio to be inferred into the model, and generating an output result of the countermeasure network, which is passed through the vocoder to obtain the audio after tone conversion.

As an embodiment, in the generating countermeasure network training unit, the model integrally includes a posterior encoder, an a priori encoder, a decoder, a discriminator, and a random time predictor; the posterior encoder and the discriminant are used only for training and are not used for inference; in the model training process, firstly, the audio after being transmitted and processed is respectively transmitted into a posterior encoder and an priori encoder, and the priori encoder outputs the mean value and the variance; the posterior encoder outputs z, mean and variance; and cutting z output by the posterior encoder, then sending the cut z into a decoder and a flow, performing monotonic alignment search on the output of the flow and the output of the prior encoder, and inputting the monotonic alignment search result d and the output Htext of the prior encoder into a random time predictor.

In a third aspect of the present invention, a computer-readable storage medium is presented, the computer-readable storage medium comprising a stored program, the program performing the above-described method.

In a fourth aspect, the present invention provides a computer device, comprising: a processor, a memory, and a bus; the processor is connected with the memory through the bus; the memory is used for storing a program, and the processor is used for running the program, and the program runs to execute the method.

Compared with the prior art, the whole process of the method is divided into two processes of off-line pretreatment and real-time treatment, and the method performs feature extraction on the audio of the target person in the off-line pretreatment stage and obtains a model of the tone of the specific person through training; in the real-time stage, only the audio with the content is needed to be provided, and under the condition that the audio content is kept unchanged, reasonable acoustic characteristics are inferred through a model, and then the sound is converted into the audio with a specific tone through a vocoder; compared with the traditional mode, the method has the advantages that the target tone color model is finished offline in advance and is stored in the server, so that the tone color conversion can be finished in 3 seconds by 1 minute of reasoning audio when the target tone color model responds in real time, the time consumption is short, and the instantaneity is met. In summary, the invention uses the sound of the tone of the target person in ten minutes, the tone almost same as the target person can be achieved through training in a short time, the tone conversion can be realized within 3 seconds after the audio frequency is tested for 1 minute in the reasoning stage, the real-time requirement is met, and the invention is worthy of popularization and application.

[ description of the drawings ]

FIG. 1 is a schematic diagram of the method steps of the present invention for a digital person to simulate the tone color of a designated character in real time;

FIG. 2 is a schematic flow chart of the treatment method of the present invention;

FIG. 3 is a flow chart of a conventional processing method.

Detailed description of the preferred embodiments

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the invention. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described below with reference to the accompanying drawings and specific embodiments:

referring to fig. 1, an embodiment of the present invention is a method for real-time simulating the tone of a designated person by a digital person, which includes the following steps: 1) Collecting the sound which needs to be converted into the tone of the target character, and cutting the sound; 2) Preprocessing the audio frequency cut in the step 1) and extracting sound characteristics; 3) Learning the features extracted in the step 2) by generating an countermeasure network, so that the model learns the tone of the target person better; 4) And 3) sending the audio to be inferred into a model, and obtaining the audio after tone conversion by the vocoder according to the output result of the countermeasure network generated in the step 3).

In this embodiment, as a further embodiment, in step 1), the sound is cut for a period of 8s and stored under a designated folder.

In this embodiment, as a further embodiment, in step 2), the cut audio is first preprocessed, human voice and noise are separated to eliminate the influence of noise, and then the mel spectrum of the sound is extracted.

In this embodiment, as a further embodiment, in step 3), the model is integrally generated into an impedance network structure, and the generator is a generator based on a VAE model, and a flow model is added to the VAE. Specifically, the model generally includes a posterior encoder, an a priori encoder, a decoder, a arbiter, and a random time predictor; the posterior encoder and the discriminant are used only for training and are not used for inference; in the model training process, firstly, the audio after being transmitted and processed is respectively transmitted into a posterior encoder and an priori encoder, and the priori encoder outputs the mean value and the variance; the posterior encoder outputs z, mean and variance; and cutting z output by the posterior encoder, then sending the cut z into a decoder and a flow, performing monotonic alignment search on the output of the flow and the output of the prior encoder, and inputting the monotonic alignment search result d and the output Htext of the prior encoder into a random time predictor.

In addition, the invention also provides a computer readable storage medium, and the computer readable storage medium comprises a stored program, and the program executes the method for simulating the tone of the appointed character in real time by the digital person.

Further, the invention also provides a computer device, which comprises a processor, a memory and a bus; the processor is connected with the memory through a bus, the memory is used for storing a program, and the processor is used for running the program, and the method for simulating the tone of the appointed character in real time by the digital person is executed when the program runs.

The second aspect of the present invention provides an embodiment, specifically, a device for simulating a tone of a specified person in real time by a digital person, which includes an audio cutting unit, a sound feature extraction unit, a generating countermeasure network training unit, and an audio conversion unit, wherein:

In this embodiment, as a further embodiment, in the generation-reactance network training unit, the model integrally includes a posterior encoder, a priori encoder, a decoder, a discriminator, and a random time predictor; the posterior encoder and the discriminant are used only for training and are not used for inference; in the model training process, firstly, the audio after being transmitted and processed is respectively transmitted into a posterior encoder and an priori encoder, and the priori encoder outputs the mean value and the variance; the posterior encoder outputs z, mean and variance; and cutting z output by the posterior encoder, then sending the cut z into a decoder and a flow, performing monotonic alignment search on the output of the flow and the output of the prior encoder, and inputting the monotonic alignment search result d and the output Htext of the prior encoder into a random time predictor.

As a more specific embodiment, the invention divides the process of simulating the tone color of a designated person by a digital person in real time into four steps:

(1) The sound to be converted into the tone of the target person is collected, cut according to the time length of 8s, and stored under the designated folder. Because of the limitation of computing resources, a long-time audio tends to have a larger video memory, and the audio duration of the total training is unchanged after the audio is cut, but the computing resources can be greatly saved.

(2) And (3) preprocessing and extracting sound characteristics of the audio cut in the step (1). Firstly, the audio needs to be processed, the human voice and the noise are separated, and the influence caused by the noise is eliminated. Feature extraction is then performed on the mel spectrum of the sound.

(3) And the extracted characteristic features are learned through generating an countermeasure network, so that the model can learn the tone of the target person better. The model as a whole is to generate an antagonistic network structure, the generator being a VAE in which flow is used. The model mainly comprises five parts, namely a posterior coder, an a priori coder, a decoder, a discriminator and a random time predictor. The posterior encoder and the discriminant are used for training only and are not used for inference. In the model training process, firstly, the audio after being transmitted into the processing is respectively transmitted into a posterior encoder and an priori encoder, and the priori encoder outputs the mean value and the variance; the posterior encoder outputs z, mean and variance; and cutting z output by the posterior encoder, then sending the cut z into a decoder and a flow, performing monotonic alignment search on the output of the flow and the output of the prior encoder, and inputting the monotonic alignment search result d and the output Htext of the prior encoder into a random time predictor.

(4) During reasoning, the audio after tone conversion can be realized by the acoustic encoder through the output result of the countermeasure network generated in the step 3. The audio to be inferred is fed into a model containing certain tone characteristics, and tone conversion can be realized in a very short time.

As shown in figure 2, the whole process of the method for simulating the tone of the appointed character by the digital person in real time is divided into an off-line preprocessing stage and a real-time processing stage. And in the off-line preprocessing stage, extracting the characteristics of the audio of the target person, and obtaining a model of the tone of the specific person through training. In the real-time processing stage, only the audio with the content is needed to be provided, and under the condition that the audio content is kept unchanged, the audio is processed by model reasoning into reasonable acoustic characteristics and then is converted into the audio with a specific tone through a vocoder.

The invention will be further described with reference to an audio frequency that requires a transition length of 1 minute:

on the one hand, in the conventional processing manner as shown in fig. 3, the audio output sound conversion model to be inferred is synthesized to obtain the voice signal of the target person, and then the voice signal is reconstructed to obtain the specific sound. This synthesis takes 10 seconds. The final synthesized audio size was 985kb.

On the other hand, the target tone color model is finished off-line in advance and is stored in the server, so that the tone color conversion can be finished in 3 seconds by 1 minute of reasoning audio when the target tone color model responds in real time. The audio size was 978kb.

Therefore, the invention can realize tone conversion in extremely short time, meet the real-time requirement, and overcome the defects of long time consumption and incapability of meeting real-time performance of the traditional processing mode.

The functions of the methods of the embodiments of the present invention, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer device readable storage medium. Based on such understanding, a part of the present invention that contributes to the prior art or a part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a mobile computing device or a network device, etc.) to perform all or part of the steps of the method described in the various embodiments of the present invention; the storage medium includes various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory, a random access memory, a magnetic disk, or an optical disk.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limited thereto; the technical features of the above embodiments or in different embodiments may also be combined under the idea of the invention, the steps may be implemented in any order, and many other variations exist in different aspects of the invention as described above; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

The present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principles of the invention should be made in the equivalent manner and are included in the scope of the invention.

Claims

1. A method for simulating the tone of a designated person by a digital person in real time is characterized by comprising the following steps of: the method comprises an off-line pretreatment stage and a real-time treatment stage; in the off-line preprocessing stage, firstly, extracting the characteristics of the audio of a target person, and then obtaining a model of the tone of the specific person through training; in the real-time processing stage, under the condition of keeping the audio content unchanged, reasonable acoustic characteristics are inferred through a model, and then the acoustic characteristics are converted into audio with specific tone through a vocoder.

2. The method of claim 1, comprising the steps of:

3. The method of claim 2, wherein: in step 1), the sound is cut according to the time length of 8s, and the sound is stored under a designated folder.

4. The method of claim 2, wherein: in the step 2), firstly, preprocessing the cut audio, separating human voice from noise to eliminate the influence of the noise, and then extracting the features of the mel frequency spectrum of the voice.

5. The method of claim 2, wherein: in step 3), the model is integrally generated into an impedance network structure, and the generator is a generator based on a VAE model, and a flow model is added in the VAE.

6. The method of claim 5, wherein: in step 3), the model as a whole comprises a posterior encoder, an a priori encoder, a decoder, a discriminator and a random time predictor; the posterior encoder and the discriminant are used only for training and are not used for inference; in the model training process, firstly, the audio after being transmitted and processed is respectively transmitted into a posterior encoder and an priori encoder, and the priori encoder outputs the mean value and the variance; the posterior encoder outputs z, mean and variance; and cutting z output by the posterior encoder, then sending the cut z into a decoder and a flow, performing monotonic alignment search on the output of the flow and the output of the prior encoder, and inputting the monotonic alignment search result d and the output Htext of the prior encoder into a random time predictor.

7. An apparatus for simulating the timbre of a given character in real time by a digital person, comprising:

8. The apparatus of claim 7, wherein: in the generating countermeasure network training unit, the model integrally comprises a posterior encoder, an prior encoder, a decoder, a discriminator and a random time predictor; the posterior encoder and the discriminant are used only for training and are not used for inference; in the model training process, firstly, the audio after being transmitted and processed is respectively transmitted into a posterior encoder and an priori encoder, and the priori encoder outputs the mean value and the variance; the posterior encoder outputs z, mean and variance; and cutting z output by the posterior encoder, then sending the cut z into a decoder and a flow, performing monotonic alignment search on the output of the flow and the output of the prior encoder, and inputting the monotonic alignment search result d and the output Htext of the prior encoder into a random time predictor.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program that performs the method of any one of claims 1 to 6.

10. A computer device, comprising: a processor, a memory, and a bus; the processor is connected with the memory through the bus; the memory is for storing a program, the processor is for running the program, which when run performs the method of any one of claims 1 to 6.