CN114242093A - Voice tone conversion method and device, computer equipment and storage medium - Google Patents

Voice tone conversion method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN114242093A
CN114242093A CN202111544655.1A CN202111544655A CN114242093A CN 114242093 A CN114242093 A CN 114242093A CN 202111544655 A CN202111544655 A CN 202111544655A CN 114242093 A CN114242093 A CN 114242093A
Authority
CN
China
Prior art keywords
voice
source
tone
speech
timbre
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111544655.1A
Other languages
Chinese (zh)
Inventor
崔洋洋
余俊澎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Youmi Technology Shenzhen Co ltd
Original Assignee
Youmi Technology Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Youmi Technology Shenzhen Co ltd filed Critical Youmi Technology Shenzhen Co ltd
Priority to CN202111544655.1A priority Critical patent/CN114242093A/en
Publication of CN114242093A publication Critical patent/CN114242093A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a voice tone conversion method, a voice tone conversion device, computer equipment and a storage medium. The method comprises the following steps: acquiring source speech of an original role and role speech of a target role; determining source speech content and source speech timbre corresponding to the source speech, and determining target speech timbre corresponding to the role speech; filtering voice characteristic information in the source voice timbre to obtain a source basic timbre; performing first splicing processing on source audio content and source basic tone to obtain first splicing information, and encoding the first splicing information to obtain a corresponding encoding vector sequence; and performing second splicing processing on the coding vector sequence and the target voice timbre to obtain second splicing information, and decoding the second splicing information to convert the source voice output by the original role into the target voice output by the target role. The method can output the target voice with high similarity to the target tone, and achieves the effect of enhancing voice tone conversion.

Description

Voice tone conversion method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to a method and an apparatus for converting speech timbre, a computer device, and a storage medium.
Background
With the development of voice processing technology, personalized voice conversion technology has important significance for man-machine voice interaction. The user can convert the tone of the original character into the tone of the personalized movie and cartoon character by using the voice conversion equipment, and the speaking content of the original character is not changed.
At present, when the tone of an original role is subjected to personalized conversion, the problems of poor conversion tone quality of the output tone and low similarity with the target tone still exist. Therefore, how to perform voice tone conversion efficiently and improve the voice tone conversion effect is a problem to be solved by the present disclosure.
Disclosure of Invention
In view of the above, it is necessary to provide a voice tone conversion method, apparatus, computer device, computer readable storage medium and computer program product for improving the voice tone conversion effect.
In a first aspect, the present application provides a method for voice timbre conversion. The method comprises the following steps:
acquiring source speech of an original role and role speech of a target role;
determining source speech content and source speech timbre corresponding to the source speech, and determining target speech timbre corresponding to the role speech;
filtering the voice characteristic information in the source voice timbre to obtain a source basic timbre;
performing first splicing processing on the source audio content and the source basic tone to obtain first splicing information, and encoding the first splicing information to obtain a corresponding encoding vector sequence;
and performing second splicing processing on the coding vector sequence and the target voice timbre to obtain second splicing information, and decoding the second splicing information to convert the source voice output by the original role into the target voice output by the target role.
In one embodiment, the determining source speech content and source speech color corresponding to the source speech includes: performing feature extraction processing on the source speech to obtain acoustic features and the tone of the source speech; determining a phoneme category included in the acoustic features; determining the posterior probability of each phoneme corresponding to each phoneme; and obtaining the source speech content corresponding to the source speech according to the phoneme category and the phoneme posterior probability.
In one embodiment, determining a target voice tone corresponding to the character voice comprises: performing feature extraction processing on the role voice to obtain the role voice tone; and carrying out logarithmic transformation on the feature matrix of the role voice tone to obtain a target voice tone corresponding to the role voice tone.
In one embodiment, the filtering the speech characteristic information in the source speech timbre to obtain a source base timbre includes: determining a voice characteristic point in the acoustic characteristic of the source voice, and determining voice characteristic information corresponding to the voice characteristic point in the voice color of the source voice; the voice feature points represent accent features of source voice; and carrying out normalization processing on the source speech timbre according to the speech characteristic information to obtain a source basic timbre corresponding to the source speech.
In one embodiment, the method for voice timbre conversion is performed by a voice timbre conversion model, where the voice timbre conversion model includes an encoding network, and the encoding the first splicing information to obtain a corresponding code vector sequence includes: encoding the source audio content in the first splicing information through an encoding network in the voice tone conversion model to obtain a first source encoding vector; coding the source basic tone in the first splicing information through a coding network in the voice tone conversion model to obtain a second source coding vector; and obtaining a coding vector sequence corresponding to the source speech according to the first source coding vector, the second source coding vector and the position information of the source speech content and the source basic tone in the first splicing information.
In one embodiment, the performing a second splicing process on the coded vector sequence and the target speech timbre to obtain second splicing information includes: coding the target voice timbre through a coding network in the voice timbre conversion model to obtain a second target coding vector corresponding to the target voice timbre; and replacing a second source coding vector in the coding vector sequence through a second target coding vector corresponding to the target voice tone to obtain second splicing information corresponding to the role voice.
In one embodiment, the number of character voices of the target character is less than or equal to a preset number threshold.
In one embodiment, the method for voice timbre conversion is performed by a voice timbre conversion model, and the training step of the voice timbre conversion model includes: acquiring a first sample voice set and a second sample voice set; the first sample voice set comprises a plurality of first sample voices, and the second sample voice set comprises at least one second sample voice corresponding to each sample role and a sample label corresponding to each second sample voice; determining the posterior probability of each phoneme corresponding to each phoneme in the first sample voice through a content extraction structure in the voice tone model, and training the content extraction structure through the posterior probability of the phonemes to obtain a trained content extraction structure; determining a sample speech timbre of the second sample speech; determining the posterior probability of each phoneme in the second sample voice through the trained content extraction structure; coding and decoding the posterior probability of each phoneme in the second sample voice and the voice timbre of the second sample through a timbre extraction structure in the voice timbre model to obtain a predicted voice timbre; and training the tone extraction structure through the predicted voice tone and the sample label to obtain a trained tone extraction structure, and synthesizing the trained content extraction structure and the trained tone extraction structure to obtain a voice tone model.
In a second aspect, the present application further provides a voice tone conversion apparatus. The device comprises:
the voice acquisition module is used for acquiring source voice of an original role and role voice of a target role;
the voice processing module is used for determining source voice content and source voice tone corresponding to the source voice and determining target voice tone corresponding to the role voice; filtering the voice characteristic information in the source voice timbre to obtain a source basic timbre;
the first splicing module is used for carrying out first splicing processing on the source audio content and the source basic tone to obtain first splicing information, and coding the first splicing information to obtain a corresponding coding vector sequence;
and the second splicing module is used for carrying out second splicing processing on the coding vector sequence and the target voice timbre to obtain second splicing information, and decoding the second splicing information to convert the source voice output by the original role into the target voice output by the target role.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:
acquiring source speech of an original role and role speech of a target role;
determining source speech content and source speech timbre corresponding to the source speech, and determining target speech timbre corresponding to the role speech;
filtering the voice characteristic information in the source voice timbre to obtain a source basic timbre;
performing first splicing processing on the source audio content and the source basic tone to obtain first splicing information, and encoding the first splicing information to obtain a corresponding encoding vector sequence;
and performing second splicing processing on the coding vector sequence and the target voice timbre to obtain second splicing information, and decoding the second splicing information to convert the source voice output by the original role into the target voice output by the target role.
In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:
acquiring source speech of an original role and role speech of a target role;
determining source speech content and source speech timbre corresponding to the source speech, and determining target speech timbre corresponding to the role speech;
filtering the voice characteristic information in the source voice timbre to obtain a source basic timbre;
performing first splicing processing on the source audio content and the source basic tone to obtain first splicing information, and encoding the first splicing information to obtain a corresponding encoding vector sequence;
and performing second splicing processing on the coding vector sequence and the target voice timbre to obtain second splicing information, and decoding the second splicing information to convert the source voice output by the original role into the target voice output by the target role.
In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:
acquiring source speech of an original role and role speech of a target role;
determining source speech content and source speech timbre corresponding to the source speech, and determining target speech timbre corresponding to the role speech;
filtering the voice characteristic information in the source voice timbre to obtain a source basic timbre;
performing first splicing processing on the source audio content and the source basic tone to obtain first splicing information, and encoding the first splicing information to obtain a corresponding encoding vector sequence;
and performing second splicing processing on the coding vector sequence and the target voice timbre to obtain second splicing information, and decoding the second splicing information to convert the source voice output by the original role into the target voice output by the target role.
According to the voice tone conversion method, the voice tone conversion device, the computer equipment, the storage medium and the computer program product, the source voice content and the source voice tone corresponding to the source voice can be determined and the target voice tone corresponding to the character voice can be determined by acquiring the source voice of the original character and the character voice of the target character, and then the source basic tone can be obtained by filtering the voice characteristic information in the source voice tone, so that the first splicing processing can be carried out on the source voice content and the source basic tone to obtain the first splicing information. After the first splicing information is obtained, coding the first splicing information to obtain a corresponding coding vector sequence; performing second splicing processing on the coding vector sequence and the target voice timbre to obtain second splicing information; and finally, decoding the second splicing information to convert the source voice output by the original role into the target voice output by the target role. Because the source voice of the original role is converted into the target voice output by the target role by integrating the source voice content, the source basic tone and the target voice tone, the target voice with high similarity to the target tone can be output, and the voice tone conversion effect is enhanced. In addition, the user can use the technology to perform tone conversion, so that the favorite target role can repeat the sound of the user, and the user experience is improved.
According to the voice tone conversion model for executing the voice tone conversion method, the content extraction structure is trained through the first sample voice set to obtain the trained content extraction structure, then on the basis of the trained content extraction structure, the tone extraction structure is trained through the second sample voice set to obtain the trained tone extraction structure, and the trained content extraction structure and the trained tone extraction structure are integrated to obtain the voice tone model. Because the content extraction structure and the tone extraction structure are trained respectively, a trained voice tone conversion model can be obtained, and therefore the voice tone conversion model can only comprise the content extraction structure and the tone extraction structure, so that the model structure applied by the user is more compact, and the controllability and the expandability of the voice tone conversion model are improved. The tone extraction structure can be trained only through at least one second sample voice corresponding to each sample role, so that the acquisition quantity of voice data samples of the target role is reduced, and the structure of a tone conversion model is simplified. In addition, on the basis of the content extraction structure, the tone extraction structure is trained through the second sample voice set, so that parameters of the tone extraction structure obtained through training are more accurate, target voice obtained after source voice is converted through the tone conversion model has better accuracy, and the effect of voice tone conversion is improved.
Drawings
FIG. 1 is a diagram illustrating an exemplary embodiment of a method for voice timbre conversion;
FIG. 2 is a flow chart illustrating a method for voice timbre conversion in one embodiment;
FIG. 3 is a flow diagram illustrating tone color conversion according to one embodiment;
FIG. 4 is a flow chart illustrating voice timbre conversion in another embodiment;
FIG. 5 is a block diagram of a vocoder in one embodiment;
FIG. 6 is a diagram illustrating an exemplary structure of a tone color conversion model;
FIG. 7 is a schematic flow chart illustrating training a phonetic tone model according to one embodiment;
FIG. 8 is a block diagram showing the structure of a voice tone conversion apparatus according to an embodiment;
FIG. 9 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The voice tone conversion method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 and the server 104 may be used alone to execute the voice tone conversion method provided in the embodiment of the present application, or may be used in cooperation with the voice tone conversion method provided in the embodiment of the present application. Taking an example that the terminal 102 and the server 104 cooperate to perform the voice tone color conversion method, the terminal 102 is configured to send the acquired source voice of the original character and the acquired character voice of the target character to the server 104. The server 104 is configured to process and convert the received source speech and the character speech, and return the obtained speech sound color conversion result to the terminal 102, where the terminal 102 is configured to output the received speech sound color conversion result. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.
In one embodiment, as shown in fig. 2, a voice tone conversion method is provided, which is described by taking the method as an example applied to the computer device in fig. 1, where the computer device may be the terminal or the server in fig. 1, and the voice tone conversion method includes the following steps:
step 202, obtaining source speech of the original role and role speech of the target role.
The source speech is speech of an original role, that is, speech of which the user needs to perform tone conversion, and may be sound emitted by the user himself or sound emitted by other original roles acquired by the user. The target role is a role selected by a user from a designated target role set, such as a cartoon character, a movie and television star and the like, and the role voice of the target role can be collected in real time through a microphone in computer equipment and can also be extracted from a voice library in which the role voice is stored in advance.
Specifically, according to the acquired source speech of the original role and the role speech of the target role, the computer device determines a tone conversion instruction corresponding to the target role. The tone color conversion instruction is used for converting the tone color of the original role into the tone color of the target role. For example, the tone of the original character a is male voice, and the tone of the male voice can be converted into the tone of the target character B is female voice through the tone conversion instruction.
In one embodiment, the computer device determines at least one target role corresponding to a voice library in which role voices are stored in advance, displays the at least one target role to a user through an interface, and determines role voices associated with the target roles in response to selection operation of the user on the interface. The role voice associated with the target role is any one of at least one role voice corresponding to the target role in the voice library.
Step 204, determining source speech content and source speech timbre corresponding to the source speech, and determining target speech timbre corresponding to the character speech.
The voice is composed of voice content and voice tone, the voice content refers to the text content expressed by the voice, the voice tone generally consists of basic tone and accent tone, wherein the basic tone is the tone without any voice characteristic information, and the accent tone refers to the voice characteristic information which can be used for representing the character identity information, such as the voice characteristic information of the character, such as accent, speaking speed, strong and weak voice, etc.
Specifically, the computer device performs feature extraction processing on source speech to obtain acoustic features and a source speech tone, and determines a phoneme type and a phoneme included in the acoustic features. For each of a plurality of phonemes in the acoustic signature, the computer device determines a phoneme class to which the current phoneme belongs and determines a time frame of the current phoneme in the source speech, denoted as the current time frame. Further, the computer device determines a posterior probability of a phoneme category to which the current phoneme belongs in the current time frame to obtain a phoneme posterior probability, and determines the source audio content of the source audio through the phoneme posterior probability, for example, using a tone posterior probability corresponding to each phoneme as the source audio content of the source audio. Wherein the source audio content is comprised of a phoneme sequence comprising a plurality of phonemes. And similarly, the computer equipment performs feature extraction processing on the role voice to obtain the target voice tone.
The acoustic features may include, but are not limited to, any one or more of the following: MFCC (Mel-scale Frequency Cepstrum Coefficients, Mel Frequency cepstral Coefficients), PLP (Perceptual Linear prediction) parameters. Wherein MFCC is a cepstrum parameter extracted in the Mel scale frequency domain, and Mel scale describes the non-linear characteristic of human ear frequency; the PLP parameter is a characteristic parameter based on an auditory model, and the characteristic parameter is a group of coefficients of an all-pole model Prediction polynomial, which is equivalent to a Linear Prediction Coefficient (LPC) characteristic. The extraction of the acoustic features may be performed using prior art techniques, which are not described in detail herein.
Phonemes are the smallest phonetic units divided according to the natural attributes of the phonetic content, and are defined according to the pronunciation actions in syllables, and one pronunciation action constitutes one phoneme, such as "a (ā)" in chinese only has one phoneme, "ai (aji)" has two phonemes, "d (aji)" has three phonemes, and so on. According to different voice types, phonemes can be divided into different categories, for example, Chinese voice can be generally divided into two major phoneme categories of vowel phoneme and consonant phoneme, so that the voice content in the source voice can be summarized into a phoneme category dictionary through phoneme extraction, and further used for recognizing the voice content.
The phone Posterior Probabilities (PPGs) used to characterize the Phonetic content of the source speech are a matrix of time t-phone classes y, representing the Posterior probability of each phone class y for each particular time frame in each piece of audio.
And step 206, filtering the voice characteristic information in the source voice tone to obtain a source basic tone.
Specifically, the computer device determines a voice feature point in the acoustic feature of the source speech and determines voice characteristic information corresponding to the voice feature point according to the acoustic feature and the source speech timbre obtained after feature extraction processing is performed on the source speech. The computer equipment carries out normalization processing on the source sound timbre according to the sound characteristic information, namely, the sound characteristic information in the source sound timbre is filtered out, namely, the accent timbre in the source sound timbre is filtered out, and then the source basic timbre corresponding to the source sound is obtained. The speech feature points represent features of accent timbre in the source speech, such as dialect style of the original character.
And 208, performing first splicing processing on the source audio content and the source basic tone to obtain first splicing information, and encoding the first splicing information to obtain a corresponding encoding vector sequence.
Specifically, as shown in fig. 3, fig. 3 is a schematic flow chart of voice timbre conversion. The computer equipment determines a content feature matrix of source audio content and a source tone feature matrix of source basic tone through a feature extraction structure in the audio tone conversion model, combines the content feature matrix and the source tone feature matrix to obtain first splicing information corresponding to the source audio, and meanwhile, the computer equipment also determines position information of the content feature matrix and the source tone feature matrix in the first splicing information.
And the computer equipment encodes the content characteristic matrix in the first splicing information through an encoding network in the voice tone conversion model to obtain a first source encoding vector, and encodes the source tone characteristic matrix in the first splicing information to obtain a second source encoding vector. And obtaining a coding vector sequence corresponding to the source voice according to the first source coding vector, the second source coding vector, the content characteristic matrix and the position information of the source tone characteristic matrix in the first splicing information.
And step 210, performing second splicing processing on the coding vector sequence and the target voice timbre to obtain second splicing information, and decoding the second splicing information to convert the source voice output by the original role into the target voice output by the target role.
Specifically, referring to fig. 3, the computer device determines a target tone feature matrix of the character voice through the feature extraction structure, and codes the target tone feature matrix through a coding network in the voice tone conversion model to obtain a target coding vector corresponding to the target voice tone. And the computer equipment replaces a second source coding vector corresponding to the source tone characteristic matrix in the coding vector sequence through the target coding vector to obtain second splicing information corresponding to the role voice. And the computer equipment decodes the second splicing information through a decoding network in the voice tone conversion model to obtain target voice corresponding to the role voice of the target role. It is readily understood that the phonetic content of the target speech is identical to the phonetic content of the source speech.
In one embodiment, the number of character voices of the target character is less than or equal to the preset number threshold.
The preset number threshold may be number 1, that is, when the number of character voices is one, voice tone conversion may still be performed on the source voice. In this embodiment, when the number of the character voices of the target character is less than or equal to the preset number threshold, the source voice of the original character can still be converted into the target voice output by the target character, so that the acquisition cost of the character voice is reduced, and the efficiency of voice tone conversion is improved.
In one embodiment, the voice timbre conversion method is performed by a voice timbre conversion model. The voice tone conversion model comprises a feature extraction structure, a content extraction structure and a tone extraction structure. The feature extraction structure is used for acquiring source voice of an original role and role voice of a target role, and filtering voice characteristic information in the source voice tone to obtain a source basic tone; the content extraction structure is used for determining source speech content and source speech timbre corresponding to the source speech and determining target speech timbre corresponding to the role speech; the tone extraction structure is used for performing first splicing processing on source audio content and the source basic tone to obtain first splicing information, and coding the first splicing information to obtain a corresponding coding vector sequence; and the tone extraction structure is also used for carrying out second splicing processing on the coding vector sequence and the tone of the target voice to obtain second splicing information, and decoding the second splicing information to convert the source voice output by the original role into the target voice output by the target role.
In the voice tone conversion method, the source voice content and the source voice tone corresponding to the source voice can be determined and the target voice tone corresponding to the character voice can be determined by obtaining the source voice of the original character and the character voice of the target character, and then the source basic tone can be obtained by filtering the voice characteristic information in the source voice tone, so that the first splicing processing can be carried out on the source voice content and the source basic tone to obtain the first splicing information. After the first splicing information is obtained, coding the first splicing information to obtain a corresponding coding vector sequence; second splicing processing is carried out on the coding vector sequence and the target voice timbre to obtain second splicing information; and finally, decoding the second splicing information to convert the source voice output by the original role into the target voice output by the target role. Because the source voice of the original role is converted into the target voice output by the target role by integrating the source voice content, the source basic tone and the target voice tone, the target voice with high similarity to the target tone can be output, and the voice tone conversion effect is enhanced. In addition, the user can use the technology to perform tone conversion, so that the favorite target role can repeat the sound of the user, and the user experience is improved.
In one embodiment, determining source audio content and source audio timbre corresponding to source audio comprises: performing feature extraction processing on source speech to obtain acoustic features and the tone of the source speech; determining a phoneme category included in the acoustic features; determining the posterior probability of each phoneme corresponding to each phoneme; and obtaining the source speech content corresponding to the source speech according to the phoneme type and the phoneme posterior probability.
Specifically, for better describing the present application, a process of performing a feature extraction process on source speech is referred to as a first feature extraction process, a process of performing a feature extraction process on character speech is referred to as a second feature extraction process, an acoustic feature extracted from the source speech is referred to as a first acoustic feature, and an acoustic feature extracted from character speech is referred to as a second acoustic feature, as follows. The computer equipment performs first feature extraction processing on source speech, extracts first acoustic features and a first fundamental frequency from the source speech, and takes the first fundamental frequency as the tone of the source speech. The first fundamental frequency is the source voice tone with the voice characteristic information of the original role. Based on a first acoustic feature of the source speech, the computer device determines phonemes in the first acoustic feature and determines a respective phoneme class for each phoneme. Further, the computer device looks up the phoneme posterior probabilities for the respective phoneme classes for each phoneme from the phoneme dictionary. After the phoneme posterior probability corresponding to each phoneme in the first acoustic feature is determined, the computer equipment confirms a phoneme sequence formed by the phonemes in the first acoustic feature according to the phoneme posterior probability and the phoneme category, and then obtains source speech content corresponding to the source speech through the phoneme sequence.
In this embodiment, the source speech content corresponding to the source speech is obtained by extracting the first acoustic feature of the source speech and determining the posterior probability of the phoneme corresponding to each phoneme according to the phoneme category included in the first acoustic feature. Because the phoneme posterior probability is directly used as the intermediate characteristic of the voice conversion, the voice content of the source voice does not need to be mapped into an acoustic model, and the flexibility of the subsequent voice tone conversion is improved.
In one embodiment, the role voice timbre is obtained by performing second feature extraction processing on the role voice; and carrying out logarithmic transformation on the characteristic matrix of the tone of the character voice to obtain the target voice tone corresponding to the tone of the character voice.
Specifically, the computer device performs second feature extraction processing on the character voice, extracts a second fundamental frequency from the character voice, and uses the second fundamental frequency as the tone of the character voice. The second fundamental frequency is a character voice tone with voice characteristic information of a target character. And the computer equipment performs logarithmic conversion on the characteristic matrix of the second fundamental frequency to obtain the voice timbre which is easily recognized by human ears, namely obtaining the target voice timbre corresponding to the role voice timbre.
In this embodiment, the computer device extracts only the voice timbre in the character voice, instead of all voices including the voice timbre and the voice content, thereby reducing the extraction of unnecessary voice features and saving resources of the computer device.
In one embodiment, as shown in fig. 4, fig. 4 is a flow chart illustrating voice timbre conversion in another embodiment. The computer equipment performs feature extraction on the role voice through the tone extraction structure to obtain tone features of the role voice, performs content feature extraction on the source voice through the content extraction structure to obtain content features of the source voice, and integrates the content features and the tone features through the synthesis structure to obtain a Mel spectrogram of a target voice output by a target role. The sound decoding structure decodes the Mel spectrogram of the target voice, thereby obtaining the target voice containing the voice characteristics of the target role.
In this embodiment, the computer device respectively adopts different feature extraction structures for the role speech and the source speech, obtains a mel spectrum of the target speech output by the target role through the synthesis structure, and finally decodes the target speech of the target role through the sound decoding structure. Because each voice feature point in the Mel spectrogram of the target voice can be directly identified, the target voice of the target role is obtained, and the voice tone conversion efficiency is improved.
In one embodiment, as shown in FIG. 3, filtering speech characteristic information in a source speech timbre to obtain a source base timbre comprises: determining a voice characteristic point in the acoustic characteristic of source voice, and determining voice characteristic information corresponding to the voice characteristic point in the voice color of the source voice; the voice characteristic points represent the accent characteristics of the source voice; and carrying out normalization processing on the tone of the source speech according to the speech characteristic information to obtain the source basic tone corresponding to the source speech.
Specifically, the acoustic feature of the source speech is a first acoustic feature obtained when first feature extraction processing is performed on the source speech, wherein the first acoustic feature is composed of feature points including source speech sound color and source speech content, and the speech feature points are feature points of accent features in the source speech, which embody an original role, in the first acoustic feature. The computer equipment determines the voice characteristic information corresponding to the voice characteristic point in the source voice tone color by determining the voice characteristic point in the first acoustic characteristic of the source voice. And according to the voice characteristic information corresponding to the voice characteristic point, performing normalization processing on the source voice tone of the source voice, namely filtering the voice characteristic information in the source voice tone to obtain a source basic tone which does not contain the voice characteristic information. The main purpose of normalization is to reduce the random difference between the timbres of the voices of natural people, for example, to reduce the characteristics of accents between natural people, so as to obtain information with linguistic significance. The computer device may normalize the source audio tone of the source audio by a preset normalization algorithm, for example, a logarithmic z-score (lz) algorithm, a standard deviation algorithm, a frequency domain definition algorithm, or the like.
In the embodiment, the source basic tone is obtained by filtering the voice characteristic information in the source voice tone, so that unnecessary influence of the source voice tone on the voice tone conversion process is avoided, and the subsequent voice tone conversion can be realized more efficiently.
In one embodiment, the method for voice timbre conversion is performed by a voice timbre conversion model, where the voice timbre conversion model includes a coding network, and the coding network codes the first splicing information to obtain a corresponding coding vector sequence, and the method includes: encoding the source audio content in the first splicing information through an encoding network in the voice tone conversion model to obtain a first source encoding vector; coding the source basic tone in the first splicing information through a coding network in the voice tone conversion model to obtain a second source coding vector; and obtaining a coding vector sequence corresponding to the source speech according to the first source coding vector, the second source coding vector and the position information of the source speech content and the source basic tone in the first splicing information.
Specifically, referring to fig. 3, the computer device obtains a first source code vector corresponding to the source audio content by determining first splicing information corresponding to the source audio and encoding the source audio content in the first splicing information by using an encoding network of a speech audio conversion model, and obtains a second source code vector corresponding to the source base audio by encoding the source base audio through the encoding network. And combining the first source coding vector and the second source coding vector according to the position information of the source audio content and the source basic tone in the first splicing information to obtain a coding vector sequence corresponding to the source audio.
In one embodiment, the tone extraction structure includes a coding network.
In this embodiment, the computer device obtains the coding vector sequence corresponding to the source speech by coding the source speech in the first splicing information, so that the coding vector sequence can be subsequently used as an update object for the role speech, thereby improving the speech conversion efficiency. In one embodiment, as shown in fig. 5, fig. 5 is a block diagram of a sound synthesizer 500. The computer device splices the acquired phoneme posterior probability and the logarithm fundamental frequency characteristic, uses the spliced speech characteristic signal as an input signal of a condition network 502 in a sound synthesizer 500, respectively encodes the phoneme posterior probability and the logarithm fundamental frequency characteristic in the input signal through the condition network, then performs upsampling on the encoded input signal through an upsampling layer 504 in the sound synthesizer to make the time resolution of the upsampled output signal consistent with the time resolution of the input signal, and finally sends the output signal to a synthesizer 506 in the sound synthesizer for recovering the speech waveform signal. Wherein the sound synthesizer may be a conditional wavenet synthesizer.
In one embodiment, the conditional network 502 includes a bidirectional long-short memory neural network, and the bidirectional long-short memory neural network is used for determining the relationship between the phoneme posterior probabilities of different phonemes in a section of phoneme posterior probabilities in the phoneme sequence.
In one embodiment, after the second splicing information is encoded by the conditional network 502, the encoded splicing information is upsampled by a deconvolution network.
In an embodiment, the performing a second splicing process on the coded vector sequence and the target speech timbre to obtain second splicing information includes: coding the target voice timbre through a coding network in the voice timbre conversion model to obtain a second target coding vector corresponding to the target voice timbre; and replacing a second source coding vector in the coding vector sequence through a second target coding vector corresponding to the target voice tone to obtain second splicing information corresponding to the role voice.
Specifically, referring to fig. 3, the computer device obtains a second target encoding vector corresponding to the target voice tone by determining the target voice tone in the target voice and encoding the target voice tone using the encoding network of the voice tone conversion model. And replacing the second source coding vector in the coding vector sequence by the computer equipment through the second target coding vector, wherein the first source coding vector in the coding vector sequence is not changed to obtain a replaced coding vector sequence, and further taking the replaced coding vector sequence as second splicing information corresponding to the role speech.
In one embodiment, a decoding network is included in the tone extraction structure. The second splicing information corresponding to the character voice can be converted into the target voice output by the target character through the decoding network.
In this embodiment, the computer device can update the coded vector sequence to second concatenation information with the speech feature information of the role speech by replacing the second source coded vector in the coded vector sequence, so that the second concatenation information can be decoded subsequently, and the source speech output by the original role is converted into the target speech output by the target role more accurately.
In one embodiment, as shown in fig. 6, fig. 6 is a schematic structural diagram of a voice tone conversion model, where a training structure includes a first training phase, a second training phase, and a conversion phase.
In the first training phase, the computer device performs feature extraction on the first sample speech in the first sample speech set through the feature extraction structure 621 to obtain mel-frequency cepstrum coefficients (MFCCs) of the first sample speech. The computer device extracts the speech content from the mel-frequency cepstrum coefficients through the content extraction structure 622, that is, extracts the phoneme Posterior Probabilities (PPGs), determines the phoneme posterior probability corresponding to each phoneme included in the mel-frequency cepstrum coefficients, and trains the content extraction structure 622 through the phoneme posterior probabilities, thereby obtaining the trained content extraction structure 622.
In the second training phase, the computer device performs feature extraction on the second sample speech in the second sample speech set through the feature extraction structure 641, so as to obtain a mel-frequency cepstrum coefficient and a fundamental frequency (F0) of the second sample speech. The trained content extraction structure 622 is used for extracting the phoneme posterior probability of the second sample voice from the mel frequency cepstrum coefficient of the second sample voice, and then the tone extraction structure 642 is trained through the phoneme posterior probability and the fundamental frequency in the second sample voice, so that the trained tone extraction structure 642 is obtained.
In the conversion stage, the computer device performs feature extraction on the source speech through the feature extraction structure 661 to obtain mel-frequency cepstrum coefficients and a fundamental frequency of the source speech, and performs filtering processing on the fundamental frequency of the source speech to obtain a source speech tone not containing speech characteristic information, where the source speech tone not containing speech characteristic information is the source basic tone. Through the trained content extraction structure 622, the phoneme posterior probability of the source speech is extracted from the mel-frequency cepstrum coefficient of the source speech, the phoneme posterior probability of the source speech and the filtered fundamental frequency are input into the tone extraction structure 642, and then the input information is coded and decoded by the tone extraction structure 642 by combining the fundamental frequency in the second sample speech input into the tone extraction structure 642, so that the source speech is converted into a speech waveform signal of a corresponding target speech.
In one embodiment, as shown in fig. 7, the voice tone conversion method is performed by a voice tone model, and the steps of the training process of the voice tone model are as follows:
step 702, obtaining a first sample voice set and a second sample voice set; the first sample voice set comprises a plurality of first sample voices, and the second sample voice set comprises at least one second sample voice corresponding to each sample role and a sample label corresponding to each second sample voice.
The first sample voice set is mainly used for training a model capable of accurately recognizing voice content, and for different role voices of the target role collected each time, each role voice does not need to be added into the first sample voice set. The second sample voice set is mainly used for training a model capable of accurately recognizing voice timbre, and different character voices of the target character collected each time need to be added to the second sample voice set, so that parameters of a trained timbre extraction structure are more accurate. It is to be understood that the number of the second sample voices corresponding to each sample voice in the second sample voice set may be one.
The computer equipment acquires a first sample voice set and a second sample voice set from a preset voice library, wherein the sample voices in the first sample voice set and the second sample voice set are both composed of voice timbre and voice content corresponding to the voice timbre.
Because the voice content and the voice timbre are respectively trained, when the voice timbre conversion model is trained, the first sample voice set is not required to be updated, and the voice timbre conversion model can be trained only by acquiring the role voices of a small number of sample roles.
Step 704, determining a phoneme posterior probability corresponding to each phoneme in the first sample voice through a content extraction structure in the voice tone model, and training the content extraction structure through the phoneme posterior probability to obtain a trained content extraction structure.
Specifically, for better description of the present application, a process of performing feature extraction on a first sample speech is referred to as first sample feature extraction, and a process of performing feature extraction on a second sample is referred to as second sample feature extraction. When the computer equipment extracts the first sample feature of the first sample voice, the first sample acoustic feature of the first sample voice is obtained, the phoneme category in the first sample acoustic feature and the phoneme posterior probability corresponding to each phoneme are determined through the content extraction structure in the voice tone color model, and the content extraction structure is trained through the phoneme posterior probability to obtain the trained content extraction structure.
If the type of the first sample voice is English, setting the phoneme type according to the phoneme type of English pronunciation; if the type of the first sample speech is chinese or other dialects, the phoneme type may be set according to different pronunciation rules, which is not limited in the embodiment of the present application.
At step 706, a sample voice timbre of the second sample voice is determined.
Specifically, the computer device performs second sample feature extraction on a second sample voice in the second sample voice set through the feature extraction structure to obtain a second sample fundamental frequency of the second sample voice, where the second sample fundamental frequency is a sample voice tone with voice characteristic information of the sample character voice.
Step 708, determining the posterior probability of each phoneme in the second sample speech through the trained content extraction structure; and coding and decoding the posterior probability of each phoneme in the second sample voice and the voice timbre of the second sample through a timbre extraction structure in the voice timbre model to obtain the predicted voice timbre.
Specifically, the computer device extracts a second sample acoustic feature in the second sample speech through the feature extraction structure, and determines a phoneme posterior probability corresponding to each phoneme in the second sample acoustic feature through the trained content extraction structure, where the phoneme posterior probability is the speech content of the sample character speech. And through a tone extraction structure in the voice tone model, the computer equipment encodes and decodes the posterior probability of each phoneme in the second sample voice and the tone of the second sample voice to obtain the predicted voice tone. The specific process of determining the phoneme posterior probability of each tone in the second sample speech may refer to the specific process of determining the phoneme posterior probability of the source speech, the specific process of encoding the tone of the second sample speech, and the process of encoding and decoding the tone of the target speech. The embodiments of the present application are not described herein again.
And 710, training the tone extraction structure by predicting the voice tone and the sample label to obtain the trained tone extraction structure.
The sample label is voice characteristic information corresponding to the voice characteristic point in the acoustic characteristic of the second sample, namely, the voice characteristic information in the predicted voice timbre cannot deviate from the voice characteristic information in the voice of the second sample, so that the target voice with the voice timbre characteristic of the target role can be accurately converted by the trained timbre extraction structure.
And 712, synthesizing the trained content extraction structure and tone extraction structure to obtain a voice tone model.
Specifically, the computer device splices the trained content extraction structure and the tone extraction structure to obtain a trained voice tone model, so that source voice output by an original role is converted into target voice output by a target role through the trained voice tone model.
It is easy to understand that in an application scenario using a voice tone conversion model, the role a is an original role, the role B is a target role, and the role a wants to convert its own voice into the voice of the role B, that is, wants to repeat its own voice content through the role B. If role A sends out "hello! "the voice of character A is inputted into the voice tone conversion model, the voice outputted by the voice tone conversion model is the voice of character B, and at this time, character B sends out" you are good! "is selected.
In this embodiment, the content extraction structure is trained through the first sample voice set to obtain a trained content extraction structure, then the tone extraction structure is trained through the second sample voice set on the basis of the trained content extraction structure to obtain a trained tone extraction structure, and the trained content extraction structure and the trained tone extraction structure are integrated to obtain a voice tone model. Because the content extraction structure and the tone extraction structure are trained respectively, a trained voice tone conversion model can be obtained, and therefore the voice tone conversion model can only comprise the content extraction structure and the tone extraction structure, so that the model structure applied by the user is more compact, and the controllability and the expandability of the voice tone conversion model are improved. The tone extraction structure can be trained only through at least one second sample voice corresponding to each sample role, so that the acquisition quantity of voice data samples of the target role is reduced, and the structure of a tone conversion model is simplified. In addition, on the basis of the content extraction structure, the tone extraction structure is trained through the second sample voice set, so that parameters of the tone extraction structure obtained through training are more accurate, target voice obtained after source voice is converted through the tone conversion model has better accuracy, and the effect of voice tone conversion is improved.
In one embodiment, the number of the second sample voices is less than or equal to the preset number threshold, for example, there may be only one second sample voice corresponding to each sample role.
In this embodiment, when the second sample speech corresponding to each sample role is less than or equal to the preset number threshold, the source speech of the original role can still be converted into the target speech output by the target role. Therefore, compared with the traditional method that a large number of voice data samples of the target role are needed to establish the voice conversion model, the method and the device can reduce the acquisition cost of the voice of the sample role.
It should be understood that, although the steps in the flowcharts related to the embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the above embodiments may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.
Based on the same inventive concept, the embodiment of the present application further provides a voice tone conversion apparatus for implementing the voice tone conversion method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme described in the above method, so specific limitations in one or more embodiments of the voice tone conversion device provided below can be referred to the limitations of the voice tone conversion method in the above, and details are not described here.
In one embodiment, as shown in fig. 8, there is provided a voice tone conversion apparatus 800, including: a voice acquisition module 802, a voice processing module 804, a first concatenation module 806, and a second concatenation module 808, wherein:
a voice obtaining module 802, configured to obtain source voice of an original role and role voice of a target role.
The voice processing module 804 is used for determining source voice content and source voice tone corresponding to the source voice, and determining target voice tone corresponding to the role voice; and filtering the voice characteristic information in the source voice timbre to obtain a source basic timbre.
The first splicing module 806 is configured to perform a first splicing process on the source audio content and the source basic tone to obtain first splicing information, and encode the first splicing information to obtain a corresponding encoding vector sequence.
And the second splicing module 808 is configured to perform second splicing processing on the coded vector sequence and the target speech timbre to obtain second splicing information, and decode the second splicing information to convert source speech output by the original role into target speech output by the target role.
In one embodiment, the speech processing module 804 further includes a feature extraction module 8041, configured to perform feature extraction processing on the source speech to obtain an acoustic feature and a source speech timbre; determining a phoneme category included in the acoustic features; determining the posterior probability of each phoneme corresponding to each phoneme; and obtaining the source speech content corresponding to the source speech according to the phoneme type and the phoneme posterior probability.
In an embodiment, the feature extraction module 8041 is further configured to perform feature extraction processing on the character voice to obtain a tone of the character voice; and carrying out logarithmic transformation on the characteristic matrix corresponding to the tone of the character voice to obtain the tone of the target voice corresponding to the tone of the character voice.
In one embodiment, the speech processing module 804 further includes a tone color filtering module 8042, configured to determine a speech feature point in the acoustic feature of the source speech, and determine speech characteristic information corresponding to the speech feature point in the source speech tone; the voice characteristic points represent the accent characteristics of the source voice; and carrying out normalization processing on the tone of the source speech according to the speech characteristic information to obtain the source basic tone corresponding to the source speech.
In one embodiment, the voice tone conversion method is performed by a voice tone conversion model, where the voice tone conversion model includes a coding network, and the first concatenating module 806 is further configured to encode the source audio content in the first concatenation information through the coding network in the voice tone conversion model to obtain a first source coding vector; coding the source basic tone in the first splicing information through a coding network in the voice tone conversion model to obtain a second source coding vector; and obtaining a coding vector sequence corresponding to the source speech according to the first source coding vector, the second source coding vector and the position information of the source speech content and the source basic tone in the first splicing information.
In an embodiment, the second concatenation module 808 is further configured to encode the target voice timbre through a coding network in the voice timbre conversion model to obtain a second target coding vector corresponding to the target voice timbre; and replacing the first source coding vector in the coding vector sequence through a second target coding vector corresponding to the tone of the target voice to obtain second splicing information corresponding to the role voice.
In one embodiment, the voice tone color conversion device 800 is configured to determine that the number of character voices of the target character is less than or equal to a preset number threshold.
The modules in the voice tone conversion device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of voice timbre conversion. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), Magnetic Random Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (11)

1. A method for voice timbre conversion, the method comprising:
acquiring source speech of an original role and role speech of a target role;
determining source speech content and source speech timbre corresponding to the source speech, and determining target speech timbre corresponding to the role speech;
filtering the voice characteristic information in the source voice timbre to obtain a source basic timbre;
performing first splicing processing on the source audio content and the source basic tone to obtain first splicing information, and encoding the first splicing information to obtain a corresponding encoding vector sequence;
and performing second splicing processing on the coding vector sequence and the target voice timbre to obtain second splicing information, and decoding the second splicing information to convert the source voice output by the original role into the target voice output by the target role.
2. The method of claim 1, wherein determining source audio content and source audio timbre corresponding to the source audio comprises:
performing feature extraction processing on the source speech to obtain acoustic features and the tone of the source speech;
determining a phoneme category included in the acoustic features;
determining the posterior probability of each phoneme corresponding to each phoneme;
and obtaining the source speech content corresponding to the source speech according to the phoneme category and the phoneme posterior probability.
3. The method of claim 1, wherein filtering the speech characteristic information in the source audio timbre to obtain a source base timbre comprises:
determining a voice characteristic point in the acoustic characteristic of the source voice, and determining voice characteristic information corresponding to the voice characteristic point in the voice color of the source voice; the voice feature points represent accent features of source voice;
and carrying out normalization processing on the source speech timbre according to the speech characteristic information to obtain a source basic timbre corresponding to the source speech.
4. The method of claim 1, wherein the voice timbre conversion method is performed by a voice timbre conversion model, the voice timbre conversion model comprises an encoding network, and the encoding the first splicing information to obtain a corresponding sequence of encoding vectors comprises:
encoding the source audio content in the first splicing information through an encoding network in the voice tone conversion model to obtain a first source encoding vector;
coding the source basic tone in the first splicing information through a coding network in the voice tone conversion model to obtain a second source coding vector;
and obtaining a coding vector sequence corresponding to the source speech according to the first source coding vector, the second source coding vector and the position information of the source speech content and the source basic tone in the first splicing information.
5. The method according to claim 4, wherein said performing a second concatenation process on the sequence of code vectors and the target speech timbre to obtain second concatenation information comprises:
coding the target voice timbre through a coding network in the voice timbre conversion model to obtain a second target coding vector corresponding to the target voice timbre;
and replacing a second source coding vector in the coding vector sequence through a second target coding vector corresponding to the target voice tone to obtain second splicing information corresponding to the role voice.
6. The method of claim 1, wherein the number of character voices of the target character is less than or equal to a preset number threshold.
7. The method of claim 1, wherein the voice timbre conversion method is performed by a voice timbre conversion model, and the training step of the voice timbre conversion model comprises:
acquiring a first sample voice set and a second sample voice set; the first sample voice set comprises a plurality of first sample voices, and the second sample voice set comprises at least one second sample voice corresponding to each sample role and a sample label corresponding to each second sample voice;
determining the posterior probability of each phoneme corresponding to each phoneme in the first sample voice through a content extraction structure in the voice tone model, and training the content extraction structure through the posterior probability of the phonemes to obtain a trained content extraction structure;
determining a sample speech timbre of the second sample speech;
determining the posterior probability of each phoneme in the second sample voice through the trained content extraction structure;
coding and decoding the posterior probability of each phoneme in the second sample voice and the voice timbre of the second sample through a timbre extraction structure in the voice timbre model to obtain a predicted voice timbre;
training the tone extraction structure through the predicted voice tone and the sample label to obtain a trained tone extraction structure,
and synthesizing the trained content extraction structure and the tone extraction structure to obtain a voice tone model.
8. A voice tone conversion apparatus, characterized in that the apparatus comprises:
the voice acquisition module is used for acquiring source voice of an original role and role voice of a target role;
the voice processing module is used for determining source voice content and source voice tone corresponding to the source voice and determining target voice tone corresponding to the role voice; filtering the voice characteristic information in the source voice timbre to obtain a source basic timbre;
the first splicing module is used for carrying out first splicing processing on the source audio content and the source basic tone to obtain first splicing information, and coding the first splicing information to obtain a corresponding coding vector sequence;
and the second splicing module is used for carrying out second splicing processing on the coding vector sequence and the target voice timbre to obtain second splicing information, and decoding the second splicing information to convert the source voice output by the original role into the target voice output by the target role.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
11. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 7 when executed by a processor.
CN202111544655.1A 2021-12-16 2021-12-16 Voice tone conversion method and device, computer equipment and storage medium Pending CN114242093A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111544655.1A CN114242093A (en) 2021-12-16 2021-12-16 Voice tone conversion method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111544655.1A CN114242093A (en) 2021-12-16 2021-12-16 Voice tone conversion method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114242093A true CN114242093A (en) 2022-03-25

Family

ID=80757470

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111544655.1A Pending CN114242093A (en) 2021-12-16 2021-12-16 Voice tone conversion method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114242093A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115116458A (en) * 2022-06-10 2022-09-27 腾讯科技(深圳)有限公司 Voice data conversion method and device, computer equipment and storage medium
CN115602182A (en) * 2022-12-13 2023-01-13 广州感音科技有限公司(Cn) Sound conversion method, system, computer device and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110223705A (en) * 2019-06-12 2019-09-10 腾讯科技(深圳)有限公司 Phonetics transfer method, device, equipment and readable storage medium storing program for executing
US20200058288A1 (en) * 2018-08-16 2020-02-20 National Taiwan University Of Science And Technology Timbre-selectable human voice playback system, playback method thereof and computer-readable recording medium
US20200410976A1 (en) * 2018-02-16 2020-12-31 Dolby Laboratories Licensing Corporation Speech style transfer
CN112259072A (en) * 2020-09-25 2021-01-22 北京百度网讯科技有限公司 Voice conversion method and device and electronic equipment
CN112466275A (en) * 2020-11-30 2021-03-09 北京百度网讯科技有限公司 Voice conversion and corresponding model training method, device, equipment and storage medium
CN112767958A (en) * 2021-02-26 2021-05-07 华南理工大学 Zero-learning-based cross-language tone conversion system and method
CN113744713A (en) * 2021-08-12 2021-12-03 北京百度网讯科技有限公司 Speech synthesis method and training method of speech synthesis model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200410976A1 (en) * 2018-02-16 2020-12-31 Dolby Laboratories Licensing Corporation Speech style transfer
US20200058288A1 (en) * 2018-08-16 2020-02-20 National Taiwan University Of Science And Technology Timbre-selectable human voice playback system, playback method thereof and computer-readable recording medium
CN110223705A (en) * 2019-06-12 2019-09-10 腾讯科技(深圳)有限公司 Phonetics transfer method, device, equipment and readable storage medium storing program for executing
CN112259072A (en) * 2020-09-25 2021-01-22 北京百度网讯科技有限公司 Voice conversion method and device and electronic equipment
CN112466275A (en) * 2020-11-30 2021-03-09 北京百度网讯科技有限公司 Voice conversion and corresponding model training method, device, equipment and storage medium
CN112767958A (en) * 2021-02-26 2021-05-07 华南理工大学 Zero-learning-based cross-language tone conversion system and method
CN113744713A (en) * 2021-08-12 2021-12-03 北京百度网讯科技有限公司 Speech synthesis method and training method of speech synthesis model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄孝建: "《多媒体技术》", vol. 978, 28 February 2010, 北京:北京邮电大学出版社, pages: 230 - 232 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115116458A (en) * 2022-06-10 2022-09-27 腾讯科技(深圳)有限公司 Voice data conversion method and device, computer equipment and storage medium
CN115116458B (en) * 2022-06-10 2024-03-08 腾讯科技(深圳)有限公司 Voice data conversion method, device, computer equipment and storage medium
CN115602182A (en) * 2022-12-13 2023-01-13 广州感音科技有限公司(Cn) Sound conversion method, system, computer device and storage medium

Similar Documents

Publication Publication Date Title
CN111489734B (en) Model training method and device based on multiple speakers
WO2021218324A1 (en) Song synthesis method, device, readable medium, and electronic apparatus
CN111292720A (en) Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment
CN112786007B (en) Speech synthesis method and device, readable medium and electronic equipment
CN112712813B (en) Voice processing method, device, equipment and storage medium
CN116364055B (en) Speech generation method, device, equipment and medium based on pre-training language model
CN110570876A (en) Singing voice synthesis method and device, computer equipment and storage medium
WO2022252904A1 (en) Artificial intelligence-based audio processing method and apparatus, device, storage medium, and computer program product
CN113327580A (en) Speech synthesis method, device, readable medium and electronic equipment
CN114242093A (en) Voice tone conversion method and device, computer equipment and storage medium
CN112786008A (en) Speech synthesis method, device, readable medium and electronic equipment
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
US11322133B2 (en) Expressive text-to-speech utilizing contextual word-level style tokens
CN114678032B (en) Training method, voice conversion method and device and electronic equipment
CN114255740A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
JP2022133392A (en) Speech synthesis method and device, electronic apparatus, and storage medium
CN113450758B (en) Speech synthesis method, apparatus, device and medium
CN116665642A (en) Speech synthesis method, speech synthesis system, electronic device, and storage medium
CN116343747A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN110781329A (en) Image searching method and device, terminal equipment and storage medium
CN115359780A (en) Speech synthesis method, apparatus, computer device and storage medium
CN114566140A (en) Speech synthesis model training method, speech synthesis method, equipment and product
CN114464163A (en) Method, device, equipment, storage medium and product for training speech synthesis model
CN114783409A (en) Training method of speech synthesis model, speech synthesis method and device
CN114495896A (en) Voice playing method and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination