WO2023206928A1 - 语音处理方法、装置、计算机设备及计算机可读存储介质 - Google Patents

语音处理方法、装置、计算机设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2023206928A1
WO2023206928A1 PCT/CN2022/119157 CN2022119157W WO2023206928A1 WO 2023206928 A1 WO2023206928 A1 WO 2023206928A1 CN 2022119157 W CN2022119157 W CN 2022119157W WO 2023206928 A1 WO2023206928 A1 WO 2023206928A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
voice
timbre
model
user
Prior art date
Application number
PCT/CN2022/119157
Other languages
English (en)
French (fr)
Inventor
张旸
詹皓粤
林悦
Original Assignee
网易(杭州)网络有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 网易(杭州)网络有限公司 filed Critical 网易(杭州)网络有限公司
Publication of WO2023206928A1 publication Critical patent/WO2023206928A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Definitions

  • the present disclosure relates to the field of information processing technology, and specifically to a speech processing method, device, computer equipment and computer-readable storage medium.
  • Voice cloning refers to a technology in which the machine extracts timbre information from the voice provided by the user and uses the user's timbre to synthesize speech.
  • Voice cloning is an extension of speech synthesis technology.
  • Traditional speech synthesis achieves text-to-speech conversion on a fixed speaker, while voice cloning further specifies the speaker's timbre.
  • voice cloning such as voice navigation, audio novels and other applications. Users can customize their own voice package by uploading their voice, and use their own voice to navigate or read novels to enhance the fun of using the application. .
  • Embodiments of the present disclosure provide a speech processing method, device, computer equipment and computer-readable storage medium, which can solve the difficulty of obtaining the recorded voice provided by the user and the recorded voice that is consistent with the read content, and the user's requirements during voice recording. High, issues affecting user experience.
  • an embodiment of the present disclosure provides a speech processing method, including:
  • Speech conversion processing is performed based on the target user's user voice and designated timbre information to obtain the designated converted voice of the designated timbre, wherein the designated timbre information is timbre information determined from a plurality of preset timbre information, and the designated converted voice is The user’s voice with the specified timbre;
  • the target speech conversion model performs speech conversion processing on the intermediate speech to generate a target synthesized speech that matches the timbre of the target user.
  • an embodiment of the present disclosure also provides a voice processing device, including:
  • a first processing unit configured to perform voice conversion processing based on the target user's user voice and specified timbre information to obtain the specified converted voice of the specified timbre, wherein the specified timbre information is timbre information determined from a plurality of preset timbre information. , the specified converted voice is the user's voice with the specified timbre;
  • a training unit configured to train a speech conversion model based on the user's voice and the specified converted voice to obtain a target speech conversion model
  • a generation unit configured to input the target text of the speech to be synthesized and the specified timbre information into the speech synthesis model, and generate an intermediate speech of the specified timbre;
  • the second processing unit is configured to perform speech conversion processing on the intermediate speech through the target speech conversion model, and generate a target synthesized speech that matches the timbre of the target user.
  • the device further includes:
  • the first acquisition subunit is used to acquire language content features and prosodic features from the user voice of the target user;
  • the first processing subunit is configured to perform voice conversion processing based on the language content features, the prosodic features and designated timbre information to obtain the designated converted voice of the designated timbre.
  • the device further includes:
  • the second acquisition subunit is used to acquire the sample voice, the text of the sample voice, and the sample timbre information
  • a first adjustment unit configured to adjust the model parameters of the preset speech model based on the sample speech, the text of the sample speech, and the sample timbre information to obtain an adjusted preset speech model
  • the second processing subunit is used to continue to obtain the next sample voice, the text of the next sample voice, and the sample timbre information in the training sample voice set, and execute the method based on the sample voice, the text of the sample voice, and the sample.
  • the timbre information adjusts the model parameters of the preset speech synthesis model until the training situation of the adjusted speech model meets the model training end condition, and the trained preset speech model is obtained as the speech synthesis model.
  • the device further includes:
  • the second adjustment unit is configured to adjust the model parameters of the parallel speech conversion model based on the user's voice and the specified converted voice until the model training end conditions of the parallel speech conversion model are met, and a trained parallel speech conversion model is obtained as a target Speech conversion model.
  • the device further includes:
  • the third acquisition subunit is used to acquire a training voice pair and the preset timbre information corresponding to the training voice, wherein the training voice pair includes an original voice and an output voice, and the original voice and the output voice are the same voice. , all voices in the training voice pairs are voices in the training sample voice set;
  • the third adjustment unit is used to adjust the model parameters of the non-parallel speech conversion model based on the original speech, the output speech and the preset timbre information until the model training end condition of the non-parallel speech conversion model is met, and we get The trained non-parallel speech conversion model is used as the target non-parallel speech conversion model.
  • the device further includes:
  • the third processing subunit is configured to perform language content extraction processing on the original speech through the language feature processor of the non-parallel speech conversion model to obtain the language content features of the original speech;
  • the fourth processing subunit is used to perform prosody extraction processing on the original speech through the prosodic feature processor of the non-parallel speech conversion model to obtain the prosodic features of the original speech;
  • a fourth adjustment unit configured to adjust model parameters of the non-parallel speech conversion model based on the language content characteristics of the original speech, the prosodic characteristics of the original speech, the preset timbre information and the output speech.
  • the device further includes:
  • the first generation subunit is used to perform language information filtering processing on the original speech, determine the language information corresponding to the original speech, and generate a first specified length vector based on the language information, and convert the first specified length vector into as language content features.
  • the device further includes:
  • the second generation subunit is used to perform prosodic information screening processing on the original speech, determine the prosodic information corresponding to the original speech, and generate a second specified length vector based on the prosodic information, and convert the second specified length vector into as a rhythmic feature.
  • the device further includes:
  • the fifth processing subunit is used to perform language content extraction processing on the user's voice through the language feature processor of the target non-parallel voice conversion model to obtain the language content features of the user's voice;
  • the sixth processing subunit is configured to perform prosody extraction processing on the user's voice through the prosodic feature processor of the target non-parallel speech conversion model to obtain the prosodic features of the user's voice.
  • the device further includes:
  • the input subunit is used to input the language content characteristics of the user's voice, the prosodic characteristics of the user's voice, and the designated timbre information into the target non-parallel speech conversion model, and generate the designated converted speech of the designated timbre.
  • embodiments of the present disclosure also provide a computer device, including a processor, a memory, and a computer program stored on the memory and capable of running on the processor.
  • the computer program is executed by the processor. The steps to implement any of the speech processing methods.
  • embodiments of the present disclosure also provide a computer-readable storage medium.
  • a computer program is stored on the computer-readable storage medium.
  • any one of the game scene control methods is implemented. A step of.
  • Embodiments of the present disclosure provide a speech processing method, device, computer equipment and computer-readable storage medium.
  • the target text is synthesized into an intermediate speech of a specified timbre through the speech synthesis model.
  • the specified timbre of the intermediate voice is directly converted into the timbre of the user's voice through the parallel voice conversion model to obtain the target synthesized voice, thereby enabling quick voice cloning operations and making the user's voice cloning operation simple.
  • the embodiments of the present disclosure can effectively improve the operating efficiency of speech cloning; and, the embodiments of the present disclosure can generate corresponding parallel conversion models for user speech, multiple users can share a speech synthesis model and a non-parallel speech conversion model, and can simplify the speech conversion model structure, Make the speech conversion model lightweight, thereby reducing the storage consumption of the computer device by the speech conversion model.
  • Figure 1 is a schematic scene diagram of a speech processing system provided by an embodiment of the present disclosure
  • Figure 2 is a schematic flow chart of a speech processing method provided by an embodiment of the present disclosure
  • Figure 3 is a schematic diagram of training of the speech synthesis model provided by an embodiment of the present disclosure
  • Figure 4 is a schematic diagram of training of a non-parallel speech conversion model provided by an embodiment of the present disclosure
  • Figure 5 is a schematic diagram of the application of the non-parallel speech conversion model provided by an embodiment of the present disclosure
  • Figure 6 is a schematic diagram of training of the parallel speech conversion model provided by an embodiment of the present disclosure.
  • Figure 7 is a schematic diagram of the application of the speech synthesis model provided by an embodiment of the present disclosure.
  • Figure 8 is a schematic diagram of the application of the parallel speech conversion model provided by an embodiment of the present disclosure.
  • Figure 9 is a schematic structural diagram of a voice processing device provided by an embodiment of the present disclosure.
  • Figure 10 is a schematic structural diagram of a computer device provided by an embodiment of the present disclosure.
  • Embodiments of the present disclosure provide a speech processing method, device, computer equipment, and computer-readable storage medium.
  • the speech processing method of the embodiment of the present disclosure can be executed by a computer device, where the computer device can be a terminal.
  • the terminal can be a smartphone, a tablet computer, a notebook computer, a touch screen, a game console, a personal computer (PC, Personal Computer), or a personal digital assistant (Personal Digital Assistant). Digital Assistant, PDA) and other terminal devices.
  • the terminal may also include a client.
  • the client may be a video application client, a music application client, a game application client, a browser client carrying a game program, or an instant messaging client. wait.
  • FIG. 1 is a schematic diagram of a speech processing system provided by an embodiment of the present disclosure, including computer equipment.
  • the system may include at least one terminal, at least one server, and a network.
  • the terminal held by the user can connect to the servers of different games through the network.
  • a terminal is any device with computing hardware capable of supporting and executing a software product corresponding to a game.
  • the terminal has one or more multi-touch-sensitive screens for sensing and obtaining user input through touch or sliding operations performed at multiple points of the one or more touch-sensitive display screens.
  • the system includes multiple terminals, multiple servers, and multiple networks, different terminals can be connected to each other through different networks and different servers.
  • the network may be a wireless network or a wired network.
  • the wireless network may be a wireless local area network (WLAN), a local area network (LAN), a cellular network, a 2G network, a 3G network, a 4G network, a 5G network, etc.
  • WLAN wireless local area network
  • LAN local area network
  • cellular network a 2G network
  • 3G network a 3G network
  • 4G network a 5G network
  • different terminals can also use their own Bluetooth network or hotspot network to connect to other terminals or connect to servers, etc.
  • the computer device can obtain language content features and prosodic features from the user voice of the target user; perform voice conversion processing based on the language content features, the prosodic features and designated timbre information to obtain a designated converted voice of the designated timbre; according to the The user's voice and the designated converted voice are used to train the voice conversion model to obtain a target voice conversion model; the target text of the voice to be synthesized and the designated timbre information are input into the speech synthesis model to generate an intermediate voice of the designated timbre; through the The target speech conversion model performs speech conversion processing on the intermediate speech to generate a target synthesized speech that matches the timbre of the target user.
  • Embodiments of the present invention provide a voice processing method, device, computer equipment, and computer-readable storage medium.
  • the voice processing method can be used with a terminal, such as a smart phone, a tablet computer, a notebook computer, or a personal computer.
  • a terminal such as a smart phone, a tablet computer, a notebook computer, or a personal computer.
  • the speech processing method, device, terminal and storage medium are described in detail below. It should be noted that the order of description of the following embodiments does not limit the preferred order of the embodiments.
  • Figure 2 is a schematic flow chart of a speech processing method provided by an embodiment of the present disclosure. The specific process can be as follows: Step 101 to Step 104:
  • the method includes:
  • the voice conversion processing based on the target user's user voice and specified timbre information includes:
  • Speech conversion processing is performed based on the language content features, the prosodic features and designated timbre information to obtain a designated converted voice of the designated timbre.
  • the method may include:
  • the training voice pair includes an original voice and an output voice, the original voice and the output voice are the same voice, and the training voice pair All voices are voices in the training sample voice set;
  • the model parameters of the non-parallel speech conversion model are adjusted based on the original speech, the output speech and the preset timbre information until the model training end conditions of the non-parallel speech conversion model are met, and the trained non-parallel speech conversion is obtained model, as a target non-parallel speech conversion model.
  • the method may include:
  • the prosodic feature processor of the non-parallel speech conversion model performs prosody extraction processing on the original speech to obtain the prosodic features of the original speech;
  • the model parameters of the non-parallel speech conversion model are adjusted based on the language content characteristics of the original speech, the prosodic characteristics of the original speech, the preset timbre information and the output speech.
  • the step "the language feature processor of the non-parallel speech conversion model performs language content extraction processing on the original speech to obtain the language content features of the original speech" the method may include:
  • Perform language information screening processing on the original speech determine the language information corresponding to the original speech, generate a first specified length vector based on the language information, and use the first specified length vector as a language content feature.
  • the step "the prosodic feature processor of the non-parallel speech conversion model performs prosody extraction processing on the original speech to obtain the prosodic features of the original speech" the method may include:
  • the step "obtaining language content features and prosodic features from the target user's user voice” may include:
  • the prosodic feature processor of the target non-parallel speech conversion model performs prosody extraction processing on the user's voice to obtain the prosodic features of the user's voice.
  • the method may include:
  • the language content characteristics of the user's voice, the prosodic characteristics of the user's voice, and the specified timbre information are input into the target non-parallel speech conversion model to generate a specified converted speech of the specified timbre.
  • the step "training a speech conversion model based on the user's voice and the specified converted voice to obtain a target speech conversion model” may include:
  • the model parameters of the parallel speech conversion model are adjusted based on the user's voice and the specified converted voice until the model training end conditions of the parallel voice conversion model are met, and a trained parallel voice conversion model is obtained as a target voice conversion model.
  • the method may include:
  • the trained preset speech model is obtained as a speech synthesis model.
  • the embodiment of the present disclosure is provided with a pre-training stage.
  • the speech synthesis model and the non-parallel speech conversion model can be model trained.
  • Figure 3 is a schematic diagram of the training of the speech synthesis model.
  • you can use the existing multi-person voices in the database, the text data corresponding to the voices, and the preset timbre to train the speech synthesis model.
  • the pre-training stage of the speech synthesis model is to input a large amount of text speech and timbre mark data into the neural network model for training. It is generally based on an end-to-end deep neural network model. There are many specific model structures to choose from, including but not limited to Popular tacotron, fastspeech, etc.
  • Figure 4 is a training diagram of the non-parallel speech conversion model.
  • the feature module proposes the prosody-related feature representation of the original speech, and inputs the language-related feature representation and prosody-related feature representation plus the timbre mark and the output speech into the non-parallel speech conversion model for training.
  • the purpose of the language feature extraction module is to obtain a language feature representation that is independent of the timbre based on the input speech.
  • the language feature extraction module can remove the information that is irrelevant to the language content in the speech, extract only the language information and convert it into a fixed length.
  • the vector representation of , the extracted language information should accurately reflect the speaking content of the original speech, and there should be no errors or omissions.
  • this language feature extraction module requires a neural network model to be implemented. There are many specific implementation methods. One method is to train a speech recognition model through a large amount of speech and text, and select a specific hidden layer output of the model as a language feature representation. The other is to use an unsupervised training method, such as using the VQVAE model.
  • the prosodic feature extraction module aims to obtain a prosodic feature representation based on the input speech and convert it into a vector representation.
  • the prosodic feature extraction module of this language is to ensure that the converted speech is consistent with the original speech in terms of prosodic style, so that the data before and after conversion are completely parallel except for the timbre, which facilitates the modeling of the parallel conversion model.
  • signal processing tools and algorithms to extract such as using commonly used speech features: fundamental frequency, energy, etc., and you can also use features related to speech emotion classification.
  • the purpose of the non-parallel speech conversion model is to generate converted speech corresponding to timbre and semantic content based on the linguistic feature representation, prosodic feature representation and specified timbre mark extracted from the user's voice, and to construct parallel speech data for training the parallel transformation model .
  • the non-parallel speech conversion model requires that the timbre of the converted speech is similar to the timbre of the target user's user voice, and that the semantic content, prosody, etc. are completely consistent with the original speech.
  • the non-parallel speech conversion model inputs the language feature representation extracted by the language feature extraction model, the prosodic feature representation obtained by the prosodic feature module, the timbre mark and the corresponding output speech into the neural network model for training.
  • the specific model structure can be used in many ways, such as convolutional network, recurrent neural network, Transformer or any combination thereof to build a non-parallel speech conversion model.
  • the embodiment of the present disclosure is provided with a parallel speech conversion model training stage, which can perform model training on the parallel speech conversion model.
  • Figure 5 is a schematic diagram of the application of the non-parallel voice conversion model. After determining the user voice of the target user who needs to perform voice cloning, the non-parallel voice conversion model trained in the pre-training stage can be used to convert the user voice.
  • the text content and prosodic information of the specified timbre voice remain unchanged, that is, the text content and prosodic information of the specified timbre voice are the same as the text content and prosodic information of the user voice after conversion, so as to construct parallel speech data.
  • Figure 6 is a schematic diagram of training of the parallel speech conversion model. After obtaining the specified timbre voice, a voice pair can be formed based on the specified timbre voice and the user's voice, and the specified timbre voice and the user's voice are input into the parallel speech conversion model.
  • the speech conversion model is used for model training.
  • the parallel speech conversion model can use a simple neural network model, such as a layer of recurrent neural network, or other model structures that can meet the above conditions.
  • the embodiment of the present disclosure is provided with a model application stage of a speech synthesis model and a parallel speech conversion model.
  • the specific model applications of the speech synthesis model and the parallel speech conversion model are as follows.
  • Figure 7 is a schematic diagram of the application of the speech synthesis model.
  • the speech synthesis model can determine any text selected by the user as the target text, and basically Text is converted to intermediate speech of the specified timbre.
  • Figure 8 is a schematic diagram of the application of the parallel speech conversion model.
  • the parallel speech conversion model can convert the intermediate speech of a specified timbre into the timbre corresponding to the user's voice, thereby obtaining the target synthesized speech.
  • embodiments of the present disclosure provide a speech processing method.
  • the target text is synthesized into a specified timbre through a speech synthesis model shared by multiple users.
  • the intermediate voice after acquiring the user voice of the target user, directly converts the specified timbre of the intermediate voice into the timbre of the user's voice through the parallel voice conversion model to obtain the target synthesized voice, which can quickly perform voice cloning operations, allowing users to perform voice cloning.
  • the operation is simple, thereby improving the efficiency of voice cloning operations.
  • Figure 9 is a schematic structural diagram of a speech processing device provided by an embodiment of the present disclosure.
  • the device includes:
  • the first processing unit 201 is configured to perform voice conversion processing based on the target user's user voice and designated timbre information to obtain the designated converted voice of the designated timbre, wherein the designated timbre information is a timbre determined from a plurality of preset timbre information. Information, the specified converted voice is the user's voice with the specified timbre;
  • the training unit 202 is configured to train a speech conversion model according to the user's voice and the specified converted voice to obtain a target speech conversion model;
  • the generation unit 203 is configured to input the target text of the speech to be synthesized and the specified timbre information into the speech synthesis model, and generate an intermediate speech with the specified timbre;
  • the second processing unit 204 is configured to perform speech conversion processing on the intermediate speech through the target speech conversion model, and generate a target synthesized speech that matches the timbre of the target user.
  • the device further includes:
  • the first acquisition subunit is used to acquire language content features and prosodic features from the user voice of the target user;
  • the first processing subunit is configured to perform voice conversion processing based on the language content features, the prosodic features and designated timbre information to obtain the designated converted voice of the designated timbre.
  • the device further includes:
  • the second acquisition subunit is used to acquire the sample voice, the text of the sample voice, and the sample timbre information
  • a first adjustment unit configured to adjust the model parameters of the preset speech model based on the sample speech, the text of the sample speech, and the sample timbre information to obtain an adjusted preset speech model
  • the second processing subunit is used to continue to obtain the next sample voice, the text of the next sample voice, and the sample timbre information in the training sample voice set, and execute the method based on the sample voice, the text of the sample voice, and the sample.
  • the timbre information adjusts the model parameters of the preset speech synthesis model until the training situation of the adjusted speech model meets the model training end condition, and the trained preset speech model is obtained as the speech synthesis model.
  • the device further includes:
  • the second adjustment unit is configured to adjust the model parameters of the parallel speech conversion model based on the user's voice and the specified converted voice until the model training end conditions of the parallel speech conversion model are met, and a trained parallel speech conversion model is obtained as a target Speech conversion model.
  • the device further includes:
  • the third acquisition subunit is used to acquire a training voice pair and the preset timbre information corresponding to the training voice, wherein the training voice pair includes an original voice and an output voice, and the original voice and the output voice are the same voice. , all voices in the training voice pairs are voices in the training sample voice set;
  • the third adjustment unit is used to adjust the model parameters of the non-parallel speech conversion model based on the original speech, the output speech and the preset timbre information until the model training end condition of the non-parallel speech conversion model is met, and we get The trained non-parallel speech conversion model is used as the target non-parallel speech conversion model.
  • the device further includes:
  • the third processing subunit is configured to perform language content extraction processing on the original speech through the language feature processor of the non-parallel speech conversion model to obtain the language content features of the original speech;
  • the fourth processing subunit is used to perform prosody extraction processing on the original speech through the prosodic feature processor of the non-parallel speech conversion model to obtain the prosodic features of the original speech;
  • a fourth adjustment unit configured to adjust model parameters of the non-parallel speech conversion model based on the language content characteristics of the original speech, the prosodic characteristics of the original speech, the preset timbre information and the output speech.
  • the device further includes:
  • the first generation subunit is used to perform language information filtering processing on the original speech, determine the language information corresponding to the original speech, and generate a first specified length vector based on the language information, and convert the first specified length vector into as language content features.
  • the device further includes:
  • the second generation subunit is used to perform prosodic information screening processing on the original speech, determine the prosodic information corresponding to the original speech, and generate a second specified length vector based on the prosodic information, and convert the second specified length vector into as a rhythmic feature.
  • the device further includes:
  • the fifth processing subunit is used to perform language content extraction processing on the user's voice through the language feature processor of the target non-parallel voice conversion model to obtain the language content features of the user's voice;
  • the sixth processing subunit is configured to perform prosody extraction processing on the user's voice through the prosodic feature processor of the target non-parallel speech conversion model to obtain the prosodic features of the user's voice.
  • the device further includes:
  • the input subunit is used to input the language content characteristics of the user's voice, the prosodic characteristics of the user's voice, and the designated timbre information into the target non-parallel speech conversion model, and generate the designated converted speech of the designated timbre.
  • Embodiments of the present disclosure provide a voice processing device.
  • the first processing unit 201 performs voice conversion processing based on the target user's user voice and specified timbre information to obtain the specified converted voice of the specified timbre, wherein the specified timbre information is obtained from multiple The timbre information determined in the preset timbre information, the specified converted voice is the user voice with the specified timbre;
  • the training unit 202 trains the voice conversion model according to the user voice and the specified converted voice to obtain the target voice Conversion model;
  • the generation unit 203 inputs the target text of the speech to be synthesized and the specified timbre information into the speech synthesis model to generate an intermediate speech of the specified timbre;
  • the second processing unit 204 performs speech processing on the intermediate speech through the target speech conversion model.
  • Embodiments of the present disclosure construct a speech synthesis model, a non-parallel speech conversion model and a parallel speech conversion model, and synthesize the target text into an intermediate speech of a specified timbre through the speech synthesis model.
  • the parallel speech conversion model After obtaining the user voice of the target user, the parallel speech conversion model Directly converts the specified timbre of the intermediate voice to the timbre of the user's voice to obtain the target synthesized voice, thereby enabling quick voice cloning operations, making the user's voice cloning operation simple, and effectively improving the operational efficiency of voice cloning; and, this
  • the disclosed embodiments can generate corresponding parallel conversion models for user speech, and multiple users can share a non-parallel speech conversion model, which can simplify the structure of the speech conversion model and make the speech conversion model lightweight, thereby reducing the storage of the speech conversion model on computer equipment. consumption.
  • inventions of the present disclosure also provide a computer device.
  • the computer device may be a terminal or a server.
  • the terminal may be a smartphone, a tablet computer, a notebook computer, a touch screen, a game console, or a personal computer (PC). ), personal digital assistant (Personal Digital Assistant, PDA) and other terminal devices.
  • Figure 10 is a schematic structural diagram of a computer device provided by an embodiment of the present disclosure.
  • the computer device 300 includes a processor 301 with one or more processing cores, a memory 302 with one or more computer-readable storage media, and a computer program stored on the memory 302 and executable on the processor.
  • the processor 301 is electrically connected to the memory 302.
  • the structure of the computer equipment shown in the figures does not constitute a limitation on the computer equipment, and may include more or fewer components than shown in the figures, or combine certain components, or arrange different components.
  • the processor 301 is the control center of the computer device 300, using various interfaces and lines to connect various parts of the entire computer device 300, by running or loading software programs and/or modules stored in the memory 302, and calling the software programs and/or modules stored in the memory 302. data, perform various functions of the computer device 300 and process data, thereby overall monitoring the computer device 300.
  • the processor 301 in the computer device 300 will follow the following steps to load instructions corresponding to the processes of one or more application programs into the memory 302, and the processor 301 will run the instructions stored in the memory 302. 302 applications to achieve various functions:
  • Speech conversion processing is performed based on the target user's user voice and designated timbre information to obtain the designated converted voice of the designated timbre, wherein the designated timbre information is timbre information determined from a plurality of preset timbre information, and the designated converted voice is The user’s voice with the specified timbre;
  • the target speech conversion model performs speech conversion processing on the intermediate speech to generate a target synthesized speech that matches the timbre of the target user.
  • the method before performing voice conversion processing based on the target user's user voice and specified timbre information, the method further includes:
  • the voice conversion processing based on the target user's user voice and specified timbre information includes:
  • Speech conversion processing is performed based on the language content features, the prosodic features and designated timbre information to obtain a designated converted voice of the designated timbre.
  • the method before inputting the target text of the speech to be synthesized and the specified timbre information into the speech synthesis model and generating the intermediate speech with the specified timbre, the method further includes:
  • the trained preset speech model is obtained as a speech synthesis model.
  • training a speech conversion model based on the user's voice and the specified converted voice to obtain a target speech conversion model includes:
  • the model parameters of the parallel speech conversion model are adjusted based on the user's voice and the specified converted voice until the model training end conditions of the parallel voice conversion model are met, and a trained parallel voice conversion model is obtained as a target voice conversion model.
  • the method before performing speech conversion processing based on the language content characteristics, the prosodic characteristics and the specified timbre information to obtain the specified converted speech of the specified timbre, the method further includes:
  • the training voice pair includes an original voice and an output voice, and the original voice and the output voice are the same voice;
  • the model parameters of the non-parallel speech conversion model are adjusted based on the original speech, the output speech and the preset timbre information until the model training end conditions of the non-parallel speech conversion model are met, and the trained non-parallel speech conversion is obtained model, as a target non-parallel speech conversion model.
  • adjusting the model parameters of the non-parallel speech conversion model based on the original speech, the preset timbre information and the output speech includes:
  • the prosodic feature processor of the non-parallel speech conversion model performs prosody extraction processing on the original speech to obtain the prosodic features of the original speech;
  • the model parameters of the non-parallel speech conversion model are adjusted based on the language content characteristics of the original speech, the prosodic characteristics of the original speech, the preset timbre information and the output speech.
  • the language feature processor of the non-parallel speech conversion model performs language content extraction processing on the original speech to obtain the language content features of the original speech, including:
  • a first specified length vector is generated based on the language information, and the first specified length vector is used as a language content feature.
  • the prosodic feature processor of the non-parallel speech conversion model performs prosody extraction processing on the original speech to obtain the prosodic features of the original speech, including:
  • obtaining language content features and prosodic features from the target user's user voice includes:
  • the prosodic feature processor of the target non-parallel speech conversion model performs prosody extraction processing on the user's voice to obtain the prosodic features of the user's voice.
  • the voice conversion processing based on the language content features, the prosodic features and designated timbre information to obtain the designated converted voice of the designated timbre includes:
  • the language content characteristics of the user's voice, the prosodic characteristics of the user's voice, and the specified timbre information are input into the target non-parallel speech conversion model to generate a specified converted speech of the specified timbre.
  • the computer device 300 also includes: a touch display screen 303 , a radio frequency circuit 304 , an audio circuit 305 , an input unit 306 and a power supply 307 .
  • the processor 301 is electrically connected to the touch display screen 303, the radio frequency circuit 304, the audio circuit 305, the input unit 306 and the power supply 307 respectively.
  • the structure of the computer equipment shown in Figure 10 does not constitute a limitation on the computer equipment, and may include more or fewer components than shown, or combine certain components, or arrange different components.
  • the touch display screen 303 can be used to display a graphical user interface and receive operation instructions generated by the user acting on the graphical user interface.
  • the touch display screen 303 may include a display panel and a touch panel.
  • the display panel can be used to display information input by the user or information provided to the user as well as various graphical user interfaces of the computer device. These graphical user interfaces can be composed of graphics, text, icons, videos, and any combination thereof.
  • the display panel can be configured in the form of a liquid crystal display (LCD, Liquid Crystal Display), organic light-emitting diode (OLED, Organic Light-Emitting Diode), etc.
  • LCD liquid crystal display
  • OLED Organic Light-Emitting Diode
  • the touch panel can be used to collect the user's touch operations on or near it (such as the user's operations on or near the touch panel using a finger, stylus, or any suitable object or accessory), and generate corresponding operations. instruction, and the operation instruction executes the corresponding program.
  • the touch panel may include two parts: a touch detection device and a touch controller. Among them, the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact point coordinates, and then sends it to the touch controller. to the processor 301, and can receive commands sent by the processor 301 and execute them.
  • the touch panel can cover the display panel.
  • the touch panel When the touch panel detects a touch operation on or near the touch panel, it is sent to the processor 301 to determine the type of the touch event. Then the processor 301 provides information on the display panel according to the type of the touch event. Corresponding visual output.
  • the touch panel and the display panel can be integrated into the touch display 303 to realize input and output functions.
  • the touch panel and the touch panel can be used as two independent components to implement input and output functions. That is, the touch display screen 303 can also be used as a part of the input unit 306 to implement the input function.
  • the processor 301 executes an application program to generate a graphical interface on the touch display screen 303 .
  • the touch display screen 303 is used to present a graphical interface and receive operation instructions generated by the user acting on the graphical interface.
  • the radio frequency circuit 304 can be used to send and receive radio frequency signals to establish wireless communication with network equipment or other computer equipment through wireless communication, and to send and receive signals with network equipment or other computer equipment.
  • the audio circuit 305 may be used to provide an audio interface between the user and the computer device through speakers and microphones.
  • the audio circuit 305 can transmit the electrical signal converted from the received audio data to the speaker, which converts it into a sound signal and outputs it; on the other hand, the microphone converts the collected sound signal into an electrical signal, which is received and converted by the audio circuit 305
  • the audio data is processed by the audio data output processor 301 and then sent to, for example, another computer device via the radio frequency circuit 304, or the audio data is output to the memory 302 for further processing.
  • Audio circuitry 305 may also include an earphone jack to provide communication of peripheral headphones to the computer device.
  • the input unit 306 can be used to receive input numbers, character information or user characteristic information (such as fingerprints, iris, facial information, etc.), and to generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control. .
  • Power supply 307 is used to power various components of computer device 300 .
  • the power supply 307 can be logically connected to the processor 301 through a power management system, so that functions such as charging, discharging, and power consumption management can be implemented through the power management system.
  • Power supply 307 may also include one or more DC or AC power supplies, recharging systems, power failure detection circuits, power converters or inverters, power status indicators, and other arbitrary components.
  • the computer device 300 may also include a camera, a sensor, a wireless fidelity module, a Bluetooth module, etc., which will not be described again here.
  • the computer device obtains the specified converted voice of the specified timbre by performing voice conversion processing based on the user voice of the target user and the specified timbre information, wherein the specified timbre information is obtained from multiple preset timbres.
  • the timbre information determined in the information, the specified converted voice is the user voice with the specified timbre; train the voice conversion model according to the user voice and the specified converted voice to obtain the target voice conversion model; convert the voice to be synthesized
  • the target text and the specified timbre information are input into the speech synthesis model to generate an intermediate speech of the specified timbre; the intermediate speech is subjected to speech conversion processing through the target speech conversion model to generate a target synthesis that matches the timbre of the target user. voice.
  • Embodiments of the present disclosure construct a speech synthesis model, a non-parallel speech conversion model and a parallel speech conversion model, and synthesize the target text into an intermediate speech of a specified timbre through the speech synthesis model.
  • the parallel speech conversion model Directly convert the specified timbre of the intermediate voice to the timbre of the user's voice to obtain the target synthesized voice, thereby enabling quick voice cloning operations, making the user's voice cloning operation simple, and effectively improving the operating efficiency of voice cloning.
  • the disclosed embodiments can generate corresponding parallel conversion models for user speech, and multiple users can share a non-parallel speech conversion model, which can simplify the structure of the speech conversion model and make the speech conversion model lightweight, thereby reducing the storage of the speech conversion model on computer equipment. consumption.
  • embodiments of the present disclosure provide a computer-readable storage medium in which multiple computer programs are stored.
  • the computer programs can be loaded by the processor to execute any of the speech processing methods provided by the embodiments of the disclosure.
  • the computer program can perform the following steps:
  • Speech conversion processing is performed based on the target user's user voice and designated timbre information to obtain the designated converted voice of the designated timbre, wherein the designated timbre information is timbre information determined from a plurality of preset timbre information, and the designated converted voice is The user’s voice with the specified timbre;
  • the target speech conversion model performs speech conversion processing on the intermediate speech to generate a target synthesized speech that matches the timbre of the target user.
  • the method before performing voice conversion processing based on the target user's user voice and specified timbre information, the method further includes:
  • the voice conversion processing based on the target user's user voice and specified timbre information includes:
  • Speech conversion processing is performed based on the language content features, the prosodic features and designated timbre information to obtain a designated converted voice of the designated timbre.
  • the method before inputting the target text of the speech to be synthesized and the specified timbre information into the speech synthesis model and generating the intermediate speech with the specified timbre, the method further includes:
  • the trained preset speech model is obtained as a speech synthesis model.
  • training a speech conversion model based on the user's voice and the specified converted voice to obtain a target speech conversion model includes:
  • the model parameters of the parallel speech conversion model are adjusted based on the user's voice and the specified converted voice until the model training end conditions of the parallel voice conversion model are met, and a trained parallel voice conversion model is obtained as a target voice conversion model.
  • the method before performing speech conversion processing based on the language content characteristics, the prosodic characteristics and the specified timbre information to obtain the specified converted speech of the specified timbre, the method further includes:
  • the training voice pair includes an original voice and an output voice, and the original voice and the output voice are the same voice;
  • the model parameters of the non-parallel speech conversion model are adjusted based on the original speech, the output speech and the preset timbre information until the model training end conditions of the non-parallel speech conversion model are met, and the trained non-parallel speech conversion is obtained model, as a target non-parallel speech conversion model.
  • adjusting the model parameters of the non-parallel speech conversion model based on the original speech, the preset timbre information and the output speech includes:
  • the prosodic feature processor of the non-parallel speech conversion model performs prosody extraction processing on the original speech to obtain the prosodic features of the original speech;
  • the model parameters of the non-parallel speech conversion model are adjusted based on the language content characteristics of the original speech, the prosodic characteristics of the original speech, the preset timbre information and the output speech.
  • the language feature processor of the non-parallel speech conversion model performs language content extraction processing on the original speech to obtain the language content features of the original speech, including:
  • a first specified length vector is generated based on the language information, and the first specified length vector is used as a language content feature.
  • the prosodic feature processor of the non-parallel speech conversion model performs prosody extraction processing on the original speech to obtain the prosodic features of the original speech, including:
  • obtaining language content features and prosodic features from the target user's user voice includes:
  • the prosodic feature processor of the target non-parallel speech conversion model performs prosody extraction processing on the user's voice to obtain the prosodic features of the user's voice.
  • the voice conversion processing based on the language content features, the prosodic features and designated timbre information to obtain the designated converted voice of the designated timbre includes:
  • the language content characteristics of the user's voice, the prosodic characteristics of the user's voice, and the specified timbre information are input into the target non-parallel speech conversion model to generate a specified converted speech of the specified timbre.
  • the storage medium may include: read-only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.
  • the embodiments of the present disclosure perform voice conversion processing based on the user voice and designated timbre information of the target user, Obtain a designated converted voice of a designated timbre, wherein the designated timbre information is timbre information determined from a plurality of preset timbre information, and the designated converted voice is a user voice with the designated timbre; according to the user voice and The specified converted speech trains the speech conversion model to obtain a target speech conversion model; the target text of the speech to be synthesized and the specified timbre information are input into the speech synthesis model to generate an intermediate speech of the specified timbre; through the target speech conversion model Perform speech conversion processing on the intermediate speech to generate a target synthesized speech that matches the timbre of the target user.
  • Embodiments of the present disclosure construct a speech synthesis model, a non-parallel speech conversion model and a parallel speech conversion model, and synthesize the target text into an intermediate speech of a specified timbre through the speech synthesis model.
  • the parallel speech conversion model Directly convert the specified timbre of the intermediate voice to the timbre of the user's voice to obtain the target synthesized voice, thereby enabling quick voice cloning operations, making the user's voice cloning operation simple, and effectively improving the operating efficiency of voice cloning.
  • the disclosed embodiments can generate corresponding parallel conversion models for user speech, and multiple users can share a non-parallel speech conversion model, which can simplify the structure of the speech conversion model and make the speech conversion model lightweight, thereby reducing the storage of the speech conversion model on computer equipment. consumption.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Document Processing Apparatus (AREA)
  • Telephonic Communication Services (AREA)
  • Machine Translation (AREA)

Abstract

一种语音处理方法、装置、设备及计算机可读存储介质,该方法包括:预先构建语音合成模型和平行语音转换模型,通过语音合成模型将目标文本合成指定音色的中间语音,通过平行语音转换模型直接将中间语音的指定音色转换为用户语音的音色得到目标合成语音。该方法在语音克隆时操作简单,提高了语音克隆的操作效率。

Description

语音处理方法、装置、计算机设备及计算机可读存储介质
本公开要求于2022年04月27日提交中国专利局、申请号为202210455923.0、发明名称为“语音处理方法、装置、计算机设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及信息处理技术领域,具体涉及一种语音处理方法、装置、计算机设备及计算机可读存储介质。
背景技术
随着信息技术的不断发展,智能手机、平板电脑及笔记本电脑等计算机设备的大量普及应用,计算机设备向着多样化、个性化的方向发展,计算机设备已经可以合成媲美真人的语音,丰富了人机交互的体验,例如,目前常见的语音处理技术包括语音合成、语音转换及语音克隆等技术。声音克隆是指机器从用户提供的语音中提取音色信息,并使用用户音色合成语音的技术。声音克隆是语音合成技术的延伸,传统的语音合成是在固定的说话人上实现文本到语音的转换,而声音克隆则对说话人音色做了进一步指定。目前,声音克隆已有不少实践场景,如语音导航、有声小说等应用中,用户可以通过上传语音来定制自己的语音包,用自己的声音导航或朗读小说,以提升使用应用程序的趣味性。
现有技术中,用户在使用声音克隆技术进行个性化定制时,通常需要提供一段自身的语音以及语音对应的文本,才能够实现声音克隆。然而,在声音克隆的使用场景中,用户提供的录制语音与该语音的朗读内容可能会存在不一致的情况,这就导致在进行声音模型训练前,需要进行清洗矫正操作。
技术问题
本公开实施例提供一种语音处理方法、装置、计算机设备及计算机可读存储介质,可以解决用户提供的录制语音与朗读内容一致的录制声音的获取比较困难,在进行语音录制时对用户的要求高,影响用户体验的问题。
技术解决方案
第一方面,本公开实施例提供一种语音处理方法,包括:
基于目标用户的用户语音和指定音色信息进行语音转换处理,得到指定音色的指定转换语音,其中,所述指定音色信息为从多个预设音色信息中确定的音色信息,所述指定转换语音为具有所述指定音色的用户语音;
根据所述用户语音和所述指定转换语音对语音转换模型进行训练,得到目标语音转换模型;
将待合成语音的目标文本以及所述指定音色信息输入语音合成模型,生成指定音色的中间语音;
通过所述目标语音转换模型对所述中间语音进行语音转换处理,生成与所述目标用户的音色匹配的目标合成语音。
第二方面,本公开实施例还提供一种语音处理装置,包括:
第一处理单元,用于基于目标用户的用户语音和指定音色信息进行语音转换处理,得到指定音色的指定转换语音,其中,所述指定音色信息为从多个预设音色信息中确定的音色信息,所述指定转换语音为具有所述指定音色的用户语音;
训练单元,用于根据所述用户语音和所述指定转换语音对语音转换模型进行训练,得到目标语音转换模型;
生成单元,用于将待合成语音的目标文本以及所述指定音色信息输入语音合成模型,生成指定音色的中间语音;
第二处理单元,用于通过所述目标语音转换模型对所述中间语音进行语音转换处理,生成与所述目标用户的音色匹配的目标合成语音。
在一些实施例中,该装置还包括:
第一获取子单元,用于从目标用户的用户语音中获取语言内容特征和韵律特征;
第一处理子单元,用于基于所述语言内容特征、所述韵律特征和指定音色信息进行语音转换处理,得到指定音色的指定转换语音。
在一些实施例中,该装置还包括:
第二获取子单元,用于获取样本语音、样本语音的文本以及样本音色信息;
第一调整单元,用于基于所述样本语音、所述样本语音的文本以及所述样本音色信息调整预设语音模型的模型参数,得到调整后的预设语音模型;
第二处理子单元,用于继续获取训练样本语音集中的下一样本语音、下一样本语音的文本以及样本音色信息,执行所述基于所述样本语音、所述样本语音的文本以及所述样本音色信息调整预设语音合成模型的模型参数的步骤,直至所述调整后的语音模型的训练情况满足模型训练结束条件,得到训练好的预设语音模型,作为语音合成模型。
在一些实施例中,该装置还包括:
第二调整单元,用于基于所述用户语音和所述指定转换语音调整平行语音转换模型的模型参数,直至满足平行语音转换模型的模型训练结束条件,得到训练好的平行语音转换模型,作为目标语音转换模型。
在一些实施例中,该装置还包括:
第三获取子单元,用于获取训练语音对以及所述训练语音对应的预设音色信息,其中,所述训练语音对包括原始语音和输出语音,所述原始语音和所述输出语音为相同语音,所述训练语音对中的所有语音为所述训练样本语音集中的语音;
第三调整单元,用于基于所述原始语音、所述输出语音和所述预设音色信息调整非平行语音转换模型的模型参数,直到满足所述非平行语音转换模型的模型训练结束条件,得到训练好的非平行语音转换模型,作为目标非平行语音转换模型。
在一些实施例中,该装置还包括:
第三处理子单元,用于通过所述非平行语音转换模型的语言特征处理器对所述原始语音进行语言内容提取处理,得到原始语音的语言内容特征;
第四处理子单元,用于通过所述非平行语音转换模型的韵律特征处理器对所述原始语音进行韵律提取处理,得到原始语音的韵律特征;
第四调整单元,用于基于所述原始语音的语言内容特征、所述原始语音的韵律特征、所述预设音色信息和所述输出语音调整非平行语音转换模型的模型参数。
在一些实施例中,该装置还包括:
第一生成子单元,用于对所述原始语音进行语言信息筛选处理,确定所述原始语音对应的语言信息,并基于所述语言信息生成第一指定长度向量,将所述第一指定长度向量作为语言内容特征。
在一些实施例中,该装置还包括:
第二生成子单元,用于对所述原始语音进行韵律信息筛选处理,确定所述原始语音对应的韵律信息,并基于所述韵律信息生成第二指定长度向量,将所述第二指定长度向量作为韵律特征。
在一些实施例中,该装置还包括:
第五处理子单元,用于通过所述目标非平行语音转换模型的语言特征处理器对所述用户语音进行语言内容提取处理,得到用户语音的语言内容特征;
第六处理子单元,用于通过所述目标非平行语音转换模型的韵律特征处理器对所述用户语音进行韵律提取处理,得到用户语音的韵律特征。
在一些实施例中,该装置还包括:
输入子单元,用于将所述用户语音的语言内容特征、所述用户语音的韵律特征、所述指定音色信息输入所述目标非平行语音转换模型,生成指定音色的指定转换语音。
第三方面,本公开实施例还提供一种计算机设备,包括处理器、存储器及存储在所述存储器上并能够在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现语音处理方法任一项的步骤。
第四方面,本公开实施例还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现所述游戏场景控制方法任一项的步骤。
有益效果
本公开实施例提供一种语音处理方法、装置、计算机设备及计算机可读存储介质,通过构建语音合成模型和非平行语音转换模型,通过语音合成模型将目标文本合成为指定音色的中间语音,在获取目标用户的用户语音后,通过平行语音转换模型直接将中间语音的指定音色转换为用户语音的音色以得到目标合成语音,从而能够快速进行语音克隆操作,使得用户在进行语音克隆时的操作简单,能够有效提高语音克隆的操作效率;并且,本公开实施例可以针对用户语音生成对应的平行转换模型,多个用户可以共用一个语音合成模型和非平行语音转换模型,能够简化语音转换模型结构,使语音转换模型轻量化,从而降低语音转换模型对计算机设备的存储消耗。
附图说明
为了更清楚地说明本公开实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本公开实施例提供的语音处理系统的场景示意图;
图2是本公开实施例提供的语音处理方法的一种流程示意图;
图3是本公开实施例提供的语音合成模型的训练示意图;
图4是本公开实施例提供的非平行语音转换模型的训练示意图;
图5是本公开实施例提供的非平行语音转换模型的应用示意图;
图6是本公开实施例提供的平行语音转换模型的训练示意图;
图7是本公开实施例提供的语音合成模型的应用示意图;
图8是本公开实施例提供的平行语音转换模型的应用示意图;
图9是本公开实施例提供的语音处理装置的结构示意图;
图10是本公开实施例提供的计算机设备的结构示意图。
本发明的实施方式
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述。显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本公开实施例提供一种语音处理方法、装置、计算机设备及计算机可读存储介质。具体地,本公开实施例的语音处理方法可以由计算机设备执行,其中,该计算机设备可以为终端。该终端可以为智能手机、平板电脑、笔记本电脑、触控屏幕、游戏机、个人计算机(PC,Personal Computer)、个人数字助理(Personal Digital Assistant,PDA)等终端设备,终端还可以包括客户端,该客户端可以是视频应用客户端、音乐应用客户端、游戏应用客户端、携带有游戏程序的浏览器客户端或即时通信客户端等。
请参阅图1,图1为本公开实施例所提供的语音处理系统的场景示意图,包括计算机设备,该系统可以包括至少一个终端,至少一个服务器,以及网络。用户持有的终端可以通过网络连接到不同游戏的服务器。终端是具有计算硬件的任何设备,该计算硬件能够支持和执行与游戏对应的软件产品。另外,终端具有用于感测和获得用户通过在一个或者多个触控显示屏的多个点执行的触摸或者滑动操作的输入的一个或者多个多触敏屏幕。另外,当系统包括多个终端、多个服务器、多个网络时,不同的终端可以通过不同的网络、通过不同的服务器相互连接。网络可以是无线网络或者有线网络,比如无线网络为无线局域网(WLAN)、局域网(LAN)、蜂窝网络、2G网络、3G网络、4G网络、5G网络等。另外,不同的终端之间也可以使用自身的蓝牙网络或者热点网络连接到其他终端或者连接到服务器等。
其中,计算机设备可以从目标用户的用户语音中获取语言内容特征和韵律特征;基于所述语言内容特征、所述韵律特征和指定音色信息进行语音转换处理,得到指定音色的指定转换语音;根据所述用户语音和所述指定转换语音对语音转换模型进行训练,得到目标语音转换模型;将待合成语音的目标文本以及所述指定音色信息输入语音合成模型,生成指定音色的中间语音;通过所述目标语音转换模型对所述中间语音进行语音转换处理,生成与所述目标用户的音色匹配的目标合成语音。
需要说明的是,图1所示的语音处理系统的场景示意图仅仅是一个示例,本公开实施例描述的语音处理系统以及场景是为了更加清楚的说明本公开实施例的技术方案,并不构成对于本公开实施例提供的技术方案的限定,本领域普通技术人员可知,随着语音处理系统的演变和新业务场景的出现,本公开实施例提供的技术方案对于类似的技术问题,同样适用。
本发明实施例提供一种语音处理方法、装置、计算机设备及计算机可读存储介质,该语音处理方法可以配合终端使用,如智能手机、平板电脑、笔记本电脑或个人计算机等。以下对该语音处理方法、装置、终端以及存储介质进行详细说明。需说明的是,以下实施例的描述顺序不作为对实施例优选顺序的限定。
请参阅图2,图2为本公开实施例提供的语音处理方法的一种流程示意图,具体流程可以如下步骤101至步骤104:
101,基于目标用户的用户语音和指定音色信息进行语音转换处理,得到指定音色的指定转换语音,其中,所述指定音色信息为从多个预设音色信息中确定的音色信息,所述指定转换语音为具有所述指定音色的用户语音。
其中,在步骤“基于目标用户的用户语音和指定音色信息进行语音转换处理”之前,方法包括:
从目标用户的用户语音中获取语言内容特征和韵律特征;
所述基于目标用户的用户语音和指定音色信息进行语音转换处理,包括:
基于所述语言内容特征、所述韵律特征和指定音色信息进行语音转换处理,得到指定音色的指定转换语音。
在一实施例中,在步骤“基于所述语言内容特征、所述韵律特征和指定音色信息进行语音转换处理,得到指定音色的指定转换语音”之前,方法可以包括:
获取训练语音对以及所述训练语音对应的预设音色信息,其中,所述训练语音对包括原始语音和输出语音,所述原始语音和所述输出语音为相同语音,所述训练语音对中的所有语音为所述训练样本语音集中的语音;
基于所述原始语音、所述输出语音和所述预设音色信息调整非平行语音转换模型的模型参数,直到满足所述非平行语音转换模型的模型训练结束条件,得到训练好的非平行语音转换模型,作为目标非平行语音转换模型。
可选的,步骤“所述基于所述原始语音、所述预设音色信息和所述输出语音调整非平行语音转换模型的模型参数”,方法可以包括:
通过所述非平行语音转换模型的语言特征处理器对所述原始语音进行语言内容提取处理,得到原始语音的语言内容特征;
通过所述非平行语音转换模型的韵律特征处理器对所述原始语音进行韵律提取处理,得到原始语音的韵律特征;
基于所述原始语音的语言内容特征、所述原始语音的韵律特征、所述预设音色信息和所述输出语音调整非平行语音转换模型的模型参数。
具体的,步骤“所述通过所述非平行语音转换模型的语言特征处理器对所述原始语音进行语言内容提取处理,得到原始语音的语言内容特征”,方法可以包括:
对所述原始语音进行语言信息筛选处理,确定所述原始语音对应的语言信息,并基于所述语言信息生成第一指定长度向量,将所述第一指定长度向量作为语言内容特征。
在另一具体实施例中,步骤“所述通过所述非平行语音转换模型的韵律特征处理器对所述原始语音进行韵律提取处理,得到原始语音的韵律特征”,方法可以包括:
对所述原始语音进行韵律信息筛选处理,确定所述原始语音对应的韵律信息,并基于所述韵律信息生成第二指定长度向量,将所述第二指定长度向量作为韵律特征。
在本公开实施例中,步骤“所述从目标用户的用户语音中获取语言内容特征和韵律特征”,方法可以包括:
通过所述目标非平行语音转换模型的语言特征处理器对所述用户语音进行语言内容提取处理,得到用户语音的语言内容特征;
通过所述目标非平行语音转换模型的韵律特征处理器对所述用户语音进行韵律提取处理,得到用户语音的韵律特征。
为了得到指定音色的指定转换语音,步骤“所述基于所述语言内容特征、所述韵律特征和指定音色信息进行语音转换处理,得到指定音色的指定转换语音”,方法可以包括:
将所述用户语音的语言内容特征、所述用户语音的韵律特征、所述指定音色信息输入所述目标非平行语音转换模型,生成指定音色的指定转换语音。
102,根据所述用户语音和所述指定转换语音对语音转换模型进行训练,得到目标语音转换模型。
具体的,步骤“所述根据所述用户语音和所述指定转换语音对语音转换模型进行训练,得到目标语音转换模型”,方法可以包括:
基于所述用户语音和所述指定转换语音调整平行语音转换模型的模型参数,直至满足平行语音转换模型的模型训练结束条件,得到训练好的平行语音转换模型,作为目标语音转换模型。
103,将待合成语音的目标文本以及所述指定音色信息输入语音合成模型,生成指定音色的中间语音。
为了得到语音合成模型,在步骤“将待合成语音的目标文本以及所述指定音色信息输入语音合成模型,生成指定音色的中间语音”之前,方法可以包括:
获取样本语音、样本语音的文本以及样本音色信息;
基于所述样本语音、所述样本语音的文本以及所述样本音色信息调整预设语音模型的模型参数,得到调整后的预设语音模型;
继续获取训练样本语音集中的下一样本语音、下一样本语音的文本以及样本音色信息,执行所述基于所述样本语音、所述样本语音的文本以及所述样本音色信息调整预设语音合成模型的模型参数的步骤,直至所述调整后的语音模型的训练情况满足模型训练结束条件,得到训练好的预设语音模型,作为语音合成模型。
104,通过所述目标语音转换模型对所述中间语音进行语音转换处理,生成与所述目标用户的音色匹配的目标合成语音。
为了对本公开实施例提供的语音处理方法进行进一步说明,下面将以语音处理方法在具体实施场景中的应用为例进行说明,具体应用场景如下所述:
(1)本公开实施例设置有一预训练阶段,在预训练阶段中,可以对语音合成模型和非平行语音转换模型进行模型训练。
请参阅图3,图3为语音合成模型的训练示意图,在对语音合成模型进行模型训练时,可以利用数据库中已有的多人语音、及语音对应的文本数据和预设音色训练语音合成模型,在得到训练好的语音合成模型后保存模型以供模型应用阶段使用。具体的,语音合成模型的预训练阶段是将大量文本语音和音色标记数据输入到神经网络模型进行训练,一般基于端到端的深度神经网络模型,具体的模型结构有很多可选,包括但不限于流行的tacotron,fastspeech等。
请参阅图4,图4为非平行语音转换模型的训练示意图,在对非平行语音转换模型进行模型训练时,可以使用已经训练好的语言特征提取模块提取原始语音的语言相关特征表示,使用韵律特征模块提出原始语音的韵律相关特征表示,并将语言相关特征表示和韵律相关特征表示加上音色标记及输出语音输入到非平行语音转换模型进行训练。
其中,语言特征提取模块目的在于根据输入的语音得到与音色无关的语言特征表示,该语言特征提取模块能够将去除语音中与语言内容无关的信息,只提取出语言信息并将其转化为固定长度的向量表示,提取的语言信息应准确反映原语音的说话内容,并且没有错漏。需要说明的是,该语言特征提取模块需要神经网络模型来实现。具体的实现方式可以有多种,一种方式是通过大量语音和文本训练语音识别模型,选择模型的特定隐藏层输出作为语言特征表示,另一种是通过无监督的训练方式,如使用VQVAE模型,将语音压缩量化为若干语音单元的表示,再将这些语音单元还原为原语音。在这一自还原训练过程中,量化单元逐渐学习成为与音色无关的语音单元,这些语音单元即为语言特征表示。实现上还可采取其他方式,不限于以上两种。在一实施例中,语韵律特征提取模块目的在于根据输入的语音得到韵律特征表示,并转化为向量表示。该语韵律特征提取模块是为了保证转换后的语音在韵律风格上与原语音保持一致,使得转换前后的数据除音色外完全平行,便于平行转换模型的建模。在技术上可以有多种实现方式,主要通过信号处理工具和算法来提取,如使用常用的语音特征:基频、能量等,还可以使用语音情绪分类相关特征。
在本公开实施例中,非平行语音转换模型的目的在于根据用户语音提取的语言特征表示、韵律特征表示和指定音色标记生成对应音色及语义内容的转换语音,为训练平行转换模型构造平行语音数据。该非平行语音转换模型要求转换后语音的音色与目标用户的用户语音的音色相似,同时语义内容、韵律等与原语音完全一致。该非平行语音转换模型在与预训练阶段将语言特征提取模型提取的语言特征表示、韵律特征模块得到的韵律特征表示、音色标记及对应的输出语音输入到神经网络模型中进行训练,一般采用深度神经网络模型,具体的模型结构可有多种方式,如卷积网络,循环神经网络,Transformer或它们的任意组合等构建非平行语音转换模型。
(2)本公开实施例设置有平行语音转换模型训练阶段,可以对平行语音转换模型进行模型训练。
请参阅图5,图5为非平行语音转换模型的应用示意图,当确定需要进行声音克隆的目标用户的用户语音后,可以使用预训练阶段中训练好的非平行语音转换模型,将用户语音转换为指定音色语音,并且用户语音的文本内容和韵律信息保持不变,即在进行转换后指定音色语音的文本内容和韵律信息与用户语音的文本内容和韵律信息相同,以构造了平行语音数据。
请参阅图6,图6为平行语音转换模型的训练示意图,在得到指定音色语音后,可以基于指定音色语音和用户语音组成语音对,将指定音色语音和用户语音输入平行语音转换模型中对平行语音转换模型进行模型训练,该平行语音转换模型可以使用简单的神经网络模型,如一层循环神经网络等,或可满足以上条件的其他模型结构。
(3)本公开实施例设置有语音合成模型和平行语音转换模型的模型应用阶段,语音合成模型和平行语音转换模型的具体模型应用如下所述。
请参阅图7,图7为语音合成模型的应用示意图,当检测需要基于用户语音的音色对目标文本进行声音克隆时,该语音合成模型可以将用户选择的任意文本确定为目标文本,基将目标文本转换为指定音色的中间语音。
请参阅图8,图8为平行语音转换模型的应用示意图,该平行语音转换模型可以将指定音色的中间语音转换为用户语音对应的音色,从而得到目标合成语音。
综上所述,本公开实施例提供一种语音处理方法,通过构建语音合成模型、非平行语音转换模型和平行语音转换模型,通过多个用户共享的语音合成模型将目标文本合成为指定音色的中间语音,在获取目标用户的用户语音后,通过平行语音转换模型直接将中间语音的指定音色转换为用户语音的音色以得到目标合成语音,能够快速进行语音克隆操作,使得用户在进行语音克隆时的操作简单,从而提高语音克隆的操作效率。
请参阅图9,图9为本公开实施例提供的一种语音处理装置的结构示意图,该装置包括:
第一处理单元201,用于基于目标用户的用户语音和指定音色信息进行语音转换处理,得到指定音色的指定转换语音,其中,所述指定音色信息为从多个预设音色信息中确定的音色信息,所述指定转换语音为具有所述指定音色的用户语音;
训练单元202,用于根据所述用户语音和所述指定转换语音对语音转换模型进行训练,得到目标语音转换模型;
生成单元203,用于将待合成语音的目标文本以及所述指定音色信息输入语音合成模型,生成指定音色的中间语音;
第二处理单元204,用于通过所述目标语音转换模型对所述中间语音进行语音转换处理,生成与所述目标用户的音色匹配的目标合成语音。
在一些实施例中,该装置还包括:
第一获取子单元,用于从目标用户的用户语音中获取语言内容特征和韵律特征;
第一处理子单元,用于基于所述语言内容特征、所述韵律特征和指定音色信息进行语音转换处理,得到指定音色的指定转换语音。
在一些实施例中,该装置还包括:
第二获取子单元,用于获取样本语音、样本语音的文本以及样本音色信息;
第一调整单元,用于基于所述样本语音、所述样本语音的文本以及所述样本音色信息调整预设语音模型的模型参数,得到调整后的预设语音模型;
第二处理子单元,用于继续获取训练样本语音集中的下一样本语音、下一样本语音的文本以及样本音色信息,执行所述基于所述样本语音、所述样本语音的文本以及所述样本音色信息调整预设语音合成模型的模型参数的步骤,直至所述调整后的语音模型的训练情况满足模型训练结束条件,得到训练好的预设语音模型,作为语音合成模型。
在一些实施例中,该装置还包括:
第二调整单元,用于基于所述用户语音和所述指定转换语音调整平行语音转换模型的模型参数,直至满足平行语音转换模型的模型训练结束条件,得到训练好的平行语音转换模型,作为目标语音转换模型。
在一些实施例中,该装置还包括:
第三获取子单元,用于获取训练语音对以及所述训练语音对应的预设音色信息,其中,所述训练语音对包括原始语音和输出语音,所述原始语音和所述输出语音为相同语音,所述训练语音对中的所有语音为所述训练样本语音集中的语音;
第三调整单元,用于基于所述原始语音、所述输出语音和所述预设音色信息调整非平行语音转换模型的模型参数,直到满足所述非平行语音转换模型的模型训练结束条件,得到训练好的非平行语音转换模型,作为目标非平行语音转换模型。
在一些实施例中,该装置还包括:
第三处理子单元,用于通过所述非平行语音转换模型的语言特征处理器对所述原始语音进行语言内容提取处理,得到原始语音的语言内容特征;
第四处理子单元,用于通过所述非平行语音转换模型的韵律特征处理器对所述原始语音进行韵律提取处理,得到原始语音的韵律特征;
第四调整单元,用于基于所述原始语音的语言内容特征、所述原始语音的韵律特征、所述预设音色信息和所述输出语音调整非平行语音转换模型的模型参数。
在一些实施例中,该装置还包括:
第一生成子单元,用于对所述原始语音进行语言信息筛选处理,确定所述原始语音对应的语言信息,并基于所述语言信息生成第一指定长度向量,将所述第一指定长度向量作为语言内容特征。
在一些实施例中,该装置还包括:
第二生成子单元,用于对所述原始语音进行韵律信息筛选处理,确定所述原始语音对应的韵律信息,并基于所述韵律信息生成第二指定长度向量,将所述第二指定长度向量作为韵律特征。
在一些实施例中,该装置还包括:
第五处理子单元,用于通过所述目标非平行语音转换模型的语言特征处理器对所述用户语音进行语言内容提取处理,得到用户语音的语言内容特征;
第六处理子单元,用于通过所述目标非平行语音转换模型的韵律特征处理器对所述用户语音进行韵律提取处理,得到用户语音的韵律特征。
在一些实施例中,该装置还包括:
输入子单元,用于将所述用户语音的语言内容特征、所述用户语音的韵律特征、所述指定音色信息输入所述目标非平行语音转换模型,生成指定音色的指定转换语音。
本公开实施例提供一种语音处理装置,通过第一处理单元201基于目标用户的用户语音和指定音色信息进行语音转换处理,得到指定音色的指定转换语音,其中,所述指定音色信息为从多个预设音色信息中确定的音色信息,所述指定转换语音为具有所述指定音色的用户语音;训练单元202根据所述用户语音和所述指定转换语音对语音转换模型进行训练,得到目标语音转换模型;生成单元203将待合成语音的目标文本以及所述指定音色信息输入语音合成模型,生成指定音色的中间语音;第二处理单元204通过所述目标语音转换模型对所述中间语音进行语音转换处理,生成与所述目标用户的音色匹配的目标合成语音。本公开实施例通过构建语音合成模型、非平行语音转换模型和平行语音转换模型,通过语音合成模型将目标文本合成为指定音色的中间语音,在获取目标用户的用户语音后,通过平行语音转换模型直接将中间语音的指定音色转换为用户语音的音色以得到目标合成语音,从而能够快速进行语音克隆操作,使得用户在进行语音克隆时的操作简单,能够有效提高语音克隆的操作效率;并且,本公开实施例可以针对用户语音生成对应的平行转换模型,多个用户可以共用一个非平行语音转换模型,能够简化语音转换模型结构,使语音转换模型轻量化,从而降低语音转换模型对计算机设备的存储消耗。
相应的,本公开实施例还提供一种计算机设备,该计算机设备可以为终端或者服务器,该终端可以为智能手机、平板电脑、笔记本电脑、触控屏幕、游戏机、个人计算机(PC,Personal Computer)、个人数字助理(Personal Digital Assistant,PDA)等终端设备。如图10所示,图10为本公开实施例提供的计算机设备的结构示意图。该计算机设备300包括有一个或者一个以上处理核心的处理器301、有一个或一个以上计算机可读存储介质的存储器302及存储在存储器302上并可在处理器上运行的计算机程序。其中,处理器301与存储器302电性连接。本领域技术人员可以理解,图中示出的计算机设备结构并不构成对计算机设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
处理器301是计算机设备300的控制中心,利用各种接口和线路连接整个计算机设备300的各个部分,通过运行或加载存储在存储器302内的软件程序和/或模块,以及调用存储在存储器302内的数据,执行计算机设备300的各种功能和处理数据,从而对计算机设备300进行整体监控。
在本公开实施例中,计算机设备300中的处理器301会按照如下的步骤,将一个或一个以上的应用程序的进程对应的指令加载到存储器302中,并由处理器301来运行存储在存储器302中的应用程序,从而实现各种功能:
基于目标用户的用户语音和指定音色信息进行语音转换处理,得到指定音色的指定转换语音,其中,所述指定音色信息为从多个预设音色信息中确定的音色信息,所述指定转换语音为具有所述指定音色的用户语音;
根据所述用户语音和所述指定转换语音对语音转换模型进行训练,得到目标语音转换模型;
将待合成语音的目标文本以及所述指定音色信息输入语音合成模型,生成指定音色的中间语音;
通过所述目标语音转换模型对所述中间语音进行语音转换处理,生成与所述目标用户的音色匹配的目标合成语音。
在一实施例中,在基于目标用户的用户语音和指定音色信息进行语音转换处理之前,还包括:
从目标用户的用户语音中获取语言内容特征和韵律特征;
所述基于目标用户的用户语音和指定音色信息进行语音转换处理,包括:
基于所述语言内容特征、所述韵律特征和指定音色信息进行语音转换处理,得到指定音色的指定转换语音。
在一实施例中,在将待合成语音的目标文本以及所述指定音色信息输入语音合成模型,生成指定音色的中间语音之前,还包括:
获取样本语音、样本语音的文本以及样本音色信息;
基于所述样本语音、所述样本语音的文本以及所述样本音色信息调整预设语音模型的模型参数,得到调整后的预设语音模型;
继续获取训练样本语音集中的下一样本语音、下一样本语音的文本以及样本音色信息,执行所述基于所述样本语音、所述样本语音的文本以及所述样本音色信息调整预设语音合成模型的模型参数的步骤,直至所述调整后的语音模型的训练情况满足模型训练结束条件,得到训练好的预设语音模型,作为语音合成模型。
在一实施例中,所述根据所述用户语音和所述指定转换语音对语音转换模型进行训练,得到目标语音转换模型,包括:
基于所述用户语音和所述指定转换语音调整平行语音转换模型的模型参数,直至满足平行语音转换模型的模型训练结束条件,得到训练好的平行语音转换模型,作为目标语音转换模型。
在一实施例中,在基于所述语言内容特征、所述韵律特征和指定音色信息进行语音转换处理,得到指定音色的指定转换语音之前,还包括:
获取训练语音对以及预设音色信息,其中,所述训练语音对包括原始语音和输出语音,所述原始语音和所述输出语音为相同语音;
基于所述原始语音、所述输出语音和所述预设音色信息调整非平行语音转换模型的模型参数,直到满足所述非平行语音转换模型的模型训练结束条件,得到训练好的非平行语音转换模型,作为目标非平行语音转换模型。
在一实施例中,所述基于所述原始语音、所述预设音色信息和所述输出语音调整非平行语音转换模型的模型参数,包括:
通过所述非平行语音转换模型的语言特征处理器对所述原始语音进行语言内容提取处理,得到原始语音的语言内容特征;
通过所述非平行语音转换模型的韵律特征处理器对所述原始语音进行韵律提取处理,得到原始语音的韵律特征;
基于所述原始语音的语言内容特征、所述原始语音的韵律特征、所述预设音色信息和所述输出语音调整非平行语音转换模型的模型参数。
在一实施例中,所述通过所述非平行语音转换模型的语言特征处理器对所述原始语音进行语言内容提取处理,得到原始语音的语言内容特征,包括:
对所述原始语音进行语言信息筛选处理,确定所述原始语音对应的语言信息;
基于所述语言信息生成第一指定长度向量,将所述第一指定长度向量作为语言内容特征。
在一实施例中,所述通过所述非平行语音转换模型的韵律特征处理器对所述原始语音进行韵律提取处理,得到原始语音的韵律特征,包括:
对所述原始语音进行韵律信息筛选处理,确定所述原始语音对应的韵律信息,并基于所述韵律信息生成第二指定长度向量,将所述第二指定长度向量作为韵律特征。
在一实施例中,所述从目标用户的用户语音中获取语言内容特征和韵律特征,包括:
通过所述目标非平行语音转换模型的语言特征处理器对所述用户语音进行语言内容提取处理,得到用户语音的语言内容特征;
通过所述目标非平行语音转换模型的韵律特征处理器对所述用户语音进行韵律提取处理,得到用户语音的韵律特征。
在一实施例中,所述基于所述语言内容特征、所述韵律特征和指定音色信息进行语音转换处理,得到指定音色的指定转换语音,包括:
将所述用户语音的语言内容特征、所述用户语音的韵律特征、所述指定音色信息输入所述目标非平行语音转换模型,生成指定音色的指定转换语音。
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。
可选的,如图10所示,计算机设备300还包括:触控显示屏303、射频电路304、音频电路305、输入单元306以及电源307。其中,处理器301分别与触控显示屏303、射频电路304、音频电路305、输入单元306以及电源307电性连接。本领域技术人员可以理解,图10中示出的计算机设备结构并不构成对计算机设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
触控显示屏303可用于显示图形用户界面以及接收用户作用于图形用户界面产生的操作指令。触控显示屏303可以包括显示面板和触控面板。其中,显示面板可用于显示由用户输入的信息或提供给用户的信息以及计算机设备的各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。可选的,可以采用液晶显示器(LCD,Liquid Crystal Display)、有机发光二极管(OLED,Organic Light-Emitting Diode)等形式来配置显示面板。触控面板可用于收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板上或在触控面板附近的操作),并生成相应的操作指令,且操作指令执行对应程序。可选的,触控面板可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器301,并能接收处理器301发来的命令并加以执行。触控面板可覆盖显示面板,当触控面板检测到在其上或附近的触摸操作后,传送给处理器301以确定触摸事件的类型,随后处理器301根据触摸事件的类型在显示面板上提供相应的视觉输出。在本公开实施例中,可以将触控面板与显示面板集成到触控显示屏303而实现输入和输出功能。但是在某些实施例中,触控面板与触控面板可以作为两个独立的部件来实现输入和输出功能。即触控显示屏303也可以作为输入单元306的一部分实现输入功能。
在本公开实施例中,通过处理器301执行应用程序在触控显示屏303上生成图形界面。该触控显示屏303用于呈现图形界面以及接收用户作用于图形界面产生的操作指令。
射频电路304可用于收发射频信号,以通过无线通信与网络设备或其他计算机设备建立无线通讯,与网络设备或其他计算机设备之间收发信号。
音频电路305可以用于通过扬声器、传声器提供用户与计算机设备之间的音频接口。音频电路305可将接收到的音频数据转换后的电信号,传输到扬声器,由扬声器转换为声音信号输出;另一方面,传声器将收集的声音信号转换为电信号,由音频电路305接收后转换为音频数据,再将音频数据输出处理器301处理后,经射频电路304以发送给比如另一计算机设备,或者将音频数据输出至存储器302以便进一步处理。音频电路305还可能包括耳塞插孔,以提供外设耳机与计算机设备的通信。
输入单元306可用于接收输入的数字、字符信息或用户特征信息(例如指纹、虹膜、面部信息等),以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。
电源307用于给计算机设备300的各个部件供电。可选的,电源307可以通过电源管理系统与处理器301逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源307还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。
尽管图10中未示出,计算机设备300还可以包括摄像头、传感器、无线保真模块、蓝牙模块等,在此不再赘述。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
由上可知,本实施例提供的计算机设备,通过基于目标用户的用户语音和指定音色信息进行语音转换处理,得到指定音色的指定转换语音,其中,所述指定音色信息为从多个预设音色信息中确定的音色信息,所述指定转换语音为具有所述指定音色的用户语音;根据所述用户语音和所述指定转换语音对语音转换模型进行训练,得到目标语音转换模型;将待合成语音的目标文本以及所述指定音色信息输入语音合成模型,生成指定音色的中间语音;通过所述目标语音转换模型对所述中间语音进行语音转换处理,生成与所述目标用户的音色匹配的目标合成语音。本公开实施例通过构建语音合成模型、非平行语音转换模型和平行语音转换模型,通过语音合成模型将目标文本合成为指定音色的中间语音,在获取目标用户的用户语音后,通过平行语音转换模型直接将中间语音的指定音色转换为用户语音的音色以得到目标合成语音,从而能够快速进行语音克隆操作,使得用户在进行语音克隆时的操作简单,能够有效提高语音克隆的操作效率,并且,本公开实施例可以针对用户语音生成对应的平行转换模型,多个用户可以共用一个非平行语音转换模型,能够简化语音转换模型结构,使语音转换模型轻量化,从而降低语音转换模型对计算机设备的存储消耗。
本领域普通技术人员可以理解,上述实施例的各种方法中的全部或部分步骤可以通过指令来完成,或通过指令控制相关的硬件来完成,该指令可以存储于一计算机可读存储介质中,并由处理器进行加载和执行。
为此,本公开实施例提供一种计算机可读存储介质,其中存储有多条计算机程序,该计算机程序能够被处理器进行加载,以执行本公开实施例所提供的任一种语音处理方法中的步骤。例如,该计算机程序可以执行如下步骤:
基于目标用户的用户语音和指定音色信息进行语音转换处理,得到指定音色的指定转换语音,其中,所述指定音色信息为从多个预设音色信息中确定的音色信息,所述指定转换语音为具有所述指定音色的用户语音;
根据所述用户语音和所述指定转换语音对语音转换模型进行训练,得到目标语音转换模型;
将待合成语音的目标文本以及所述指定音色信息输入语音合成模型,生成指定音色的中间语音;
通过所述目标语音转换模型对所述中间语音进行语音转换处理,生成与所述目标用户的音色匹配的目标合成语音。
在一实施例中,在基于目标用户的用户语音和指定音色信息进行语音转换处理之前,还包括:
从目标用户的用户语音中获取语言内容特征和韵律特征;
所述基于目标用户的用户语音和指定音色信息进行语音转换处理,包括:
基于所述语言内容特征、所述韵律特征和指定音色信息进行语音转换处理,得到指定音色的指定转换语音。
在一实施例中,在将待合成语音的目标文本以及所述指定音色信息输入语音合成模型,生成指定音色的中间语音之前,还包括:
获取样本语音、样本语音的文本以及样本音色信息;
基于所述样本语音、所述样本语音的文本以及所述样本音色信息调整预设语音模型的模型参数,得到调整后的预设语音模型;
继续获取训练样本语音集中的下一样本语音、下一样本语音的文本以及样本音色信息,执行所述基于所述样本语音、所述样本语音的文本以及所述样本音色信息调整预设语音合成模型的模型参数的步骤,直至所述调整后的语音模型的训练情况满足模型训练结束条件,得到训练好的预设语音模型,作为语音合成模型。
在一实施例中,所述根据所述用户语音和所述指定转换语音对语音转换模型进行训练,得到目标语音转换模型,包括:
基于所述用户语音和所述指定转换语音调整平行语音转换模型的模型参数,直至满足平行语音转换模型的模型训练结束条件,得到训练好的平行语音转换模型,作为目标语音转换模型。
在一实施例中,在基于所述语言内容特征、所述韵律特征和指定音色信息进行语音转换处理,得到指定音色的指定转换语音之前,还包括:
获取训练语音对以及预设音色信息,其中,所述训练语音对包括原始语音和输出语音,所述原始语音和所述输出语音为相同语音;
基于所述原始语音、所述输出语音和所述预设音色信息调整非平行语音转换模型的模型参数,直到满足所述非平行语音转换模型的模型训练结束条件,得到训练好的非平行语音转换模型,作为目标非平行语音转换模型。
在一实施例中,所述基于所述原始语音、所述预设音色信息和所述输出语音调整非平行语音转换模型的模型参数,包括:
通过所述非平行语音转换模型的语言特征处理器对所述原始语音进行语言内容提取处理,得到原始语音的语言内容特征;
通过所述非平行语音转换模型的韵律特征处理器对所述原始语音进行韵律提取处理,得到原始语音的韵律特征;
基于所述原始语音的语言内容特征、所述原始语音的韵律特征、所述预设音色信息和所述输出语音调整非平行语音转换模型的模型参数。
在一实施例中,所述通过所述非平行语音转换模型的语言特征处理器对所述原始语音进行语言内容提取处理,得到原始语音的语言内容特征,包括:
对所述原始语音进行语言信息筛选处理,确定所述原始语音对应的语言信息;
基于所述语言信息生成第一指定长度向量,将所述第一指定长度向量作为语言内容特征。
在一实施例中,所述通过所述非平行语音转换模型的韵律特征处理器对所述原始语音进行韵律提取处理,得到原始语音的韵律特征,包括:
对所述原始语音进行韵律信息筛选处理,确定所述原始语音对应的韵律信息,并基于所述韵律信息生成第二指定长度向量,将所述第二指定长度向量作为韵律特征。
在一实施例中,所述从目标用户的用户语音中获取语言内容特征和韵律特征,包括:
通过所述目标非平行语音转换模型的语言特征处理器对所述用户语音进行语言内容提取处理,得到用户语音的语言内容特征;
通过所述目标非平行语音转换模型的韵律特征处理器对所述用户语音进行韵律提取处理,得到用户语音的韵律特征。
在一实施例中,所述基于所述语言内容特征、所述韵律特征和指定音色信息进行语音转换处理,得到指定音色的指定转换语音,包括:
将所述用户语音的语言内容特征、所述用户语音的韵律特征、所述指定音色信息输入所述目标非平行语音转换模型,生成指定音色的指定转换语音。
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。
其中,该存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)、磁盘或光盘等。
由于该存储介质中所存储的计算机程序,可以执行本公开实施例所提供的任一种语音处理方法中的步骤,本公开实施例通过基于目标用户的用户语音和指定音色信息进行语音转换处理,得到指定音色的指定转换语音,其中,所述指定音色信息为从多个预设音色信息中确定的音色信息,所述指定转换语音为具有所述指定音色的用户语音;根据所述用户语音和所述指定转换语音对语音转换模型进行训练,得到目标语音转换模型;将待合成语音的目标文本以及所述指定音色信息输入语音合成模型,生成指定音色的中间语音;通过所述目标语音转换模型对所述中间语音进行语音转换处理,生成与所述目标用户的音色匹配的目标合成语音。本公开实施例通过构建语音合成模型、非平行语音转换模型和平行语音转换模型,通过语音合成模型将目标文本合成为指定音色的中间语音,在获取目标用户的用户语音后,通过平行语音转换模型直接将中间语音的指定音色转换为用户语音的音色以得到目标合成语音,从而能够快速进行语音克隆操作,使得用户在进行语音克隆时的操作简单,能够有效提高语音克隆的操作效率,并且,本公开实施例可以针对用户语音生成对应的平行转换模型,多个用户可以共用一个非平行语音转换模型,能够简化语音转换模型结构,使语音转换模型轻量化,从而降低语音转换模型对计算机设备的存储消耗。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
以上对本公开实施例所提供的一种语音处理方法、装置、计算机设备及计算机可读存储介质进行了详细介绍,本文中应用了具体个例对本公开的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本公开的技术方案及其核心思想;本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本公开各实施例的技术方案的范围。

Claims (13)

  1. 一种语音处理方法,其包括:
    基于目标用户的用户语音和指定音色信息进行语音转换处理,得到指定音色的指定转换语音,其中,所述指定音色信息为从多个预设音色信息中确定的音色信息,所述指定转换语音为具有所述指定音色的用户语音;
    根据所述用户语音和所述指定转换语音对语音转换模型进行训练,得到目标语音转换模型;
    将待合成语音的目标文本以及所述指定音色信息输入语音合成模型,生成指定音色的中间语音;
    通过所述目标语音转换模型对所述中间语音进行语音转换处理,生成与所述目标用户的音色匹配的目标合成语音。
  2. 根据权利要求1所述的语音处理方法,其中,在基于目标用户的用户语音和指定音色信息进行语音转换处理之前,还包括:
    从目标用户的用户语音中获取语言内容特征和韵律特征;
    所述基于目标用户的用户语音和指定音色信息进行语音转换处理,包括:
    基于所述语言内容特征、所述韵律特征和指定音色信息进行语音转换处理,得到指定音色的指定转换语音。
  3. 根据权利要求1所述的语音处理方法,其中,在将待合成语音的目标文本以及所述指定音色信息输入语音合成模型,生成指定音色的中间语音之前,还包括:
    获取样本语音、样本语音的文本以及样本音色信息;
    基于所述样本语音、所述样本语音的文本以及所述样本音色信息调整预设语音模型的模型参数,得到调整后的预设语音模型;
    继续获取训练样本语音集中的下一样本语音、下一样本语音的文本以及样本音色信息,执行所述基于所述样本语音、所述样本语音的文本以及所述样本音色信息调整预设语音合成模型的模型参数的步骤,直至所述调整后的语音模型的训练情况满足模型训练结束条件,得到训练好的预设语音模型,作为语音合成模型。
  4. 根据权利要求1所述的语音处理方法,其中,所述根据所述用户语音和所述指定转换语音对语音转换模型进行训练,得到目标语音转换模型,包括:
    基于所述用户语音和所述指定转换语音调整平行语音转换模型的模型参数,直至满足平行语音转换模型的模型训练结束条件,得到训练好的平行语音转换模型,作为目标语音转换模型。
  5. 根据权利要求2所述的语音处理方法,其中,在基于所述语言内容特征、所述韵律特征和指定音色信息进行语音转换处理,得到指定音色的指定转换语音之前,还包括:
    获取训练语音对以及所述训练语音对应的预设音色信息,其中,所述训练语音对包括原始语音和输出语音,所述原始语音和所述输出语音为相同语音,所述训练语音对中的所有语音为训练样本语音集中的语音;
    基于所述原始语音、所述输出语音和所述预设音色信息调整非平行语音转换模型的模型参数,直到满足所述非平行语音转换模型的模型训练结束条件,得到训练好的非平行语音转换模型,作为目标非平行语音转换模型。
  6. 根据权利要求5所述的语音处理方法,其中,所述基于所述原始语音、所述预设音色信息和所述输出语音调整非平行语音转换模型的模型参数,包括:
    通过所述非平行语音转换模型的语言特征处理器对所述原始语音进行语言内容提取处理,得到原始语音的语言内容特征;
    通过所述非平行语音转换模型的韵律特征处理器对所述原始语音进行韵律提取处理,得到原始语音的韵律特征;
    基于所述原始语音的语言内容特征、所述原始语音的韵律特征、所述预设音色信息和所述输出语音调整非平行语音转换模型的模型参数。
  7. 根据权利要求5所述的语音处理方法,其中,所述通过所述非平行语音转换模型的语言特征处理器对所述原始语音进行语言内容提取处理,得到原始语音的语言内容特征,包括:
    对所述原始语音进行语言信息筛选处理,确定所述原始语音对应的语言信息,并基于所述语言信息生成第一指定长度向量,将所述第一指定长度向量作为语言内容特征。
  8. 根据权利要求5所述的语音处理方法,其中,所述通过所述非平行语音转换模型的韵律特征处理器对所述原始语音进行韵律提取处理,得到原始语音的韵律特征,包括:
    对所述原始语音进行韵律信息筛选处理,确定所述原始语音对应的韵律信息,并基于所述韵律信息生成第二指定长度向量,将所述第二指定长度向量作为韵律特征。
  9. 根据权利要求5所述的语音处理方法,其中,所述从目标用户的用户语音中获取语言内容特征和韵律特征,包括:
    通过所述目标非平行语音转换模型的语言特征处理器对所述用户语音进行语言内容提取处理,得到用户语音的语言内容特征;
    通过所述目标非平行语音转换模型的韵律特征处理器对所述用户语音进行韵律提取处理,得到用户语音的韵律特征。
  10. 根据权利要求9所述的语音处理方法,其中,所述基于所述语言内容特征、所述韵律特征和指定音色信息进行语音转换处理,得到指定音色的指定转换语音,包括:
    将所述用户语音的语言内容特征、所述用户语音的韵律特征、所述指定音色信息输入所述目标非平行语音转换模型,生成指定音色的指定转换语音。
  11. 一种语音处理装置,其包括:
    第一处理单元,用于基于目标用户的用户语音和指定音色信息进行语音转换处理,得到指定音色的指定转换语音,其中,所述指定音色信息为从多个预设音色信息中确定的音色信息,所述指定转换语音为具有所述指定音色的用户语音;
    训练单元,用于根据所述用户语音和所述指定转换语音对语音转换模型进行训练,得到目标语音转换模型;
    生成单元,用于将待合成语音的目标文本以及所述指定音色信息输入语音合成模型,生成指定音色的中间语音;
    第二处理单元,用于通过所述目标语音转换模型对所述中间语音进行语音转换处理,生成与所述目标用户的音色匹配的目标合成语音。
  12. 一种计算机设备,其特征在于,所述计算机设备包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器通过调用所述存储器中存储的所述计算机程序,执行如下:
    基于目标用户的用户语音和指定音色信息进行语音转换处理,得到指定音色的指定转换语音,其中,所述指定音色信息为从多个预设音色信息中确定的音色信息,所述指定转换语音为具有所述指定音色的用户语音;
    根据所述用户语音和所述指定转换语音对语音转换模型进行训练,得到目标语音转换模型;
    将待合成语音的目标文本以及所述指定音色信息输入语音合成模型,生成指定音色的中间语音;
    通过所述目标语音转换模型对所述中间语音进行语音转换处理,生成与所述目标用户的音色匹配的目标合成语音。
  13. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序适于处理器进行加载,以执行如下:
    基于目标用户的用户语音和指定音色信息进行语音转换处理,得到指定音色的指定转换语音,其中,所述指定音色信息为从多个预设音色信息中确定的音色信息,所述指定转换语音为具有所述指定音色的用户语音;
    根据所述用户语音和所述指定转换语音对语音转换模型进行训练,得到目标语音转换模型;
    将待合成语音的目标文本以及所述指定音色信息输入语音合成模型,生成指定音色的中间语音;
    通过所述目标语音转换模型对所述中间语音进行语音转换处理,生成与所述目标用户的音色匹配的目标合成语音。
PCT/CN2022/119157 2022-04-27 2022-09-15 语音处理方法、装置、计算机设备及计算机可读存储介质 WO2023206928A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210455923.0 2022-04-27
CN202210455923.0A CN114708849A (zh) 2022-04-27 2022-04-27 语音处理方法、装置、计算机设备及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2023206928A1 true WO2023206928A1 (zh) 2023-11-02

Family

ID=82176836

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/119157 WO2023206928A1 (zh) 2022-04-27 2022-09-15 语音处理方法、装置、计算机设备及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN114708849A (zh)
WO (1) WO2023206928A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114708849A (zh) * 2022-04-27 2022-07-05 网易(杭州)网络有限公司 语音处理方法、装置、计算机设备及计算机可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110136691A (zh) * 2019-05-28 2019-08-16 广州多益网络股份有限公司 一种语音合成模型训练方法、装置、电子设备及存储介质
CN111968617A (zh) * 2020-08-25 2020-11-20 云知声智能科技股份有限公司 一种非平行数据的语音转换方法及系统
CN112309366A (zh) * 2020-11-03 2021-02-02 北京有竹居网络技术有限公司 语音合成方法、装置、存储介质及电子设备
CN112820268A (zh) * 2020-12-29 2021-05-18 深圳市优必选科技股份有限公司 个性化语音转换训练方法、装置、计算机设备及存储介质
WO2022035586A1 (en) * 2020-08-13 2022-02-17 Google Llc Two-level speech prosody transfer
CN114708849A (zh) * 2022-04-27 2022-07-05 网易(杭州)网络有限公司 语音处理方法、装置、计算机设备及计算机可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110136691A (zh) * 2019-05-28 2019-08-16 广州多益网络股份有限公司 一种语音合成模型训练方法、装置、电子设备及存储介质
WO2022035586A1 (en) * 2020-08-13 2022-02-17 Google Llc Two-level speech prosody transfer
CN111968617A (zh) * 2020-08-25 2020-11-20 云知声智能科技股份有限公司 一种非平行数据的语音转换方法及系统
CN112309366A (zh) * 2020-11-03 2021-02-02 北京有竹居网络技术有限公司 语音合成方法、装置、存储介质及电子设备
CN112820268A (zh) * 2020-12-29 2021-05-18 深圳市优必选科技股份有限公司 个性化语音转换训练方法、装置、计算机设备及存储介质
CN114708849A (zh) * 2022-04-27 2022-07-05 网易(杭州)网络有限公司 语音处理方法、装置、计算机设备及计算机可读存储介质

Also Published As

Publication number Publication date
CN114708849A (zh) 2022-07-05

Similar Documents

Publication Publication Date Title
WO2020182153A1 (zh) 基于自适应语种进行语音识别的方法及相关装置
WO2022052481A1 (zh) 基于人工智能的vr互动方法、装置、计算机设备及介质
US20220044463A1 (en) Speech-driven animation method and apparatus based on artificial intelligence
CN112863547B (zh) 虚拟资源转移处理方法、装置、存储介质及计算机设备
WO2020073944A1 (zh) 语音合成方法及设备
WO2020177190A1 (zh) 一种处理方法、装置及设备
JP2021103328A (ja) 音声変換方法、装置及び電子機器
CN108520743A (zh) 智能设备的语音控制方法、智能设备及计算机可读介质
WO2019242414A1 (zh) 语音处理方法、装置、存储介质及电子设备
CN110265011B (zh) 一种电子设备的交互方法及其电子设备
CN112840396A (zh) 用于处理用户话语的电子装置及其控制方法
EP4336490A1 (en) Voice processing method and related device
WO2020057624A1 (zh) 语音识别的方法和装置
CN107564532A (zh) 电子设备的唤醒方法、装置、设备及计算机可读存储介质
WO2023246163A1 (zh) 一种虚拟数字人驱动方法、装置、设备和介质
WO2023206928A1 (zh) 语音处理方法、装置、计算机设备及计算机可读存储介质
WO2022227507A1 (zh) 唤醒程度识别模型训练方法及语音唤醒程度获取方法
CN112149599B (zh) 表情追踪方法、装置、存储介质和电子设备
US20230223006A1 (en) Voice conversion method and related device
US11150923B2 (en) Electronic apparatus and method for providing manual thereof
WO2020154916A1 (zh) 视频字幕合成方法、装置、存储介质及电子设备
WO2020102979A1 (zh) 语音信息的处理方法、装置、存储介质及电子设备
CN116092466A (zh) 语音模型的处理方法、装置、计算机设备及存储介质
CN116645955A (zh) 语音合成方法、装置、电子设备及计算机可读存储介质
US20240071363A1 (en) Electronic device and method of controlling text-to-speech (tts) rate

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22939752

Country of ref document: EP

Kind code of ref document: A1