CN112802462B - Training method of sound conversion model, electronic equipment and storage medium - Google Patents
Training method of sound conversion model, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN112802462B CN112802462B CN202011627564.XA CN202011627564A CN112802462B CN 112802462 B CN112802462 B CN 112802462B CN 202011627564 A CN202011627564 A CN 202011627564A CN 112802462 B CN112802462 B CN 112802462B
- Authority
- CN
- China
- Prior art keywords
- feature
- acoustic
- conversion model
- voice data
- tone
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 152
- 238000000034 method Methods 0.000 title claims abstract description 72
- 238000012549 training Methods 0.000 title claims abstract description 63
- 238000012545 processing Methods 0.000 claims description 8
- 230000000694 effects Effects 0.000 abstract description 14
- 230000015572 biosynthetic process Effects 0.000 description 4
- 238000010276 construction Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000003786 synthesis reaction Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- OSXPVFSMSBQPBU-UHFFFAOYSA-N 2-(2-carboxyethoxycarbonyl)benzoic acid Chemical compound OC(=O)CCOC(=O)C1=CC=CC=C1C(O)=O OSXPVFSMSBQPBU-UHFFFAOYSA-N 0.000 description 1
- 238000012935 Averaging Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 210000005266 circulating tumour cell Anatomy 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Electrophonic Musical Instruments (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The application discloses a training method of a sound conversion model, electronic equipment and a storage medium. The method comprises the following steps: acquiring first training voice data from a first voice data set, wherein the first voice data set comprises a plurality of pieces of voice data of a target speaker, and the first training voice data corresponds to a first acoustic feature; acquiring posterior probability characteristics corresponding to the first acoustic characteristics; inputting posterior probability characteristics corresponding to the first acoustic characteristics and the first auxiliary tone characteristics into a sound conversion model to obtain first parallel characteristics; acquiring posterior probability characteristics corresponding to the first parallel characteristics; inputting posterior probability characteristics and target tone characteristics corresponding to the first parallel characteristics into a sound conversion model to obtain second acoustic characteristics; parameters of the acoustic conversion model are adjusted based on differences between the second acoustic feature and the first acoustic feature. By the mode, the conversion effect of the sound conversion model can be improved.
Description
Technical Field
The present application relates to the field of speech synthesis, and in particular, to a training method for a sound conversion model, an electronic device, and a storage medium.
Background
The voice conversion means that voice data of a source speaker is converted by utilizing a voice conversion model so that the voice data has tone of a target speaker, and semantic content of the voice data can be kept unchanged. The technology has wide application prospect and practical value. For example, the techniques may be used to enrich the effects of synthesized speech data. By combining the technology with a voice data synthesis system, voice data with different tone colors can be conveniently and quickly generated. Compared with the traditional voice data synthesis system which needs to record the corpus for several hours, the voice conversion system generally only needs tens of sentences of training corpus, and the construction cost of the synthesis system can be greatly reduced. In addition, the speaker voice conversion technology can also be used for playing role in video dubbing or games in entertainment fields, identity hiding in security fields, auxiliary sounding in medical fields and the like.
However, the conversion effect of the sound conversion model obtained by the existing training method is poor.
Disclosure of Invention
The application provides a training method of a sound conversion model, electronic equipment and a storage medium, which can solve the problem that the conversion effect of the sound conversion model obtained by the existing training method is poor.
In order to solve the technical problems, the application adopts a technical scheme that: a training method of a sound conversion model is provided. The method comprises the following steps: acquiring first training voice data from a first voice data set, wherein the first voice data set comprises a plurality of pieces of voice data of a target speaker, and the first training voice data corresponds to a first acoustic feature; acquiring posterior probability characteristics corresponding to the first acoustic characteristics; inputting the posterior probability characteristic corresponding to the first acoustic characteristic and the first auxiliary tone characteristic into a sound conversion model to obtain a first parallel characteristic, wherein the first auxiliary tone characteristic does not belong to a target speaker; acquiring posterior probability characteristics corresponding to the first parallel characteristics; inputting posterior probability characteristics corresponding to the first parallel characteristics and target tone characteristics into a sound conversion model to obtain second acoustic characteristics, wherein the target tone characteristics belong to a target speaker; parameters of the acoustic conversion model are adjusted based on differences between the second acoustic feature and the first acoustic feature.
In order to solve the technical problems, the application adopts another technical scheme that: providing an electronic device comprising a processor, a memory connected to the processor, wherein the memory stores program instructions; the processor is configured to execute the program instructions stored in the memory to implement the method described above.
In order to solve the technical problems, the application adopts another technical scheme that: there is provided a storage medium storing program instructions which, when executed, enable the above-described method to be implemented.
Through the mode, the posterior probability feature corresponding to the first acoustic feature and the first auxiliary tone feature are converted by the sound conversion model to obtain the first parallel feature, and then the posterior probability feature corresponding to the first parallel feature and the target tone feature are converted/reconstructed by the sound conversion model to obtain the second acoustic feature. The posterior probability feature and the target tone feature corresponding to the first parallel feature are not extracted from the same acoustic feature, so that when the sound conversion model converts the first parallel feature into the second acoustic feature, only the speaker information represented by the posterior probability feature corresponding to the first parallel feature can be ignored, interference of the speaker information on a process of obtaining the second acoustic feature through conversion is avoided, and further the trained sound conversion model can ignore the source speaker information represented by the posterior probability feature corresponding to the source voice data in an application stage, so that the conversion effect is improved.
Drawings
FIG. 1 is a flowchart of a training method of a voice conversion model according to an embodiment of the present application;
FIG. 2 is a flowchart of a training method of the voice conversion model according to a second embodiment of the present application;
FIG. 3 is a flowchart of a training method of the voice conversion model according to a third embodiment of the present application;
FIG. 4 is a flowchart of a training method of the voice conversion model according to the fourth embodiment of the present application;
FIG. 5 is a flow chart of an embodiment of a sound conversion method according to the present application
FIG. 6 is a schematic diagram of an embodiment of an electronic device of the present application;
FIG. 7 is a schematic diagram of an embodiment of a storage medium of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The terms "first," "second," "third," and the like in this disclosure are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", and "a third" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
The prior training method for the sound conversion model is described:
And inputting the posterior probability characteristics and the tone characteristics corresponding to the acoustic characteristics of the target voice data into a voice conversion model, so as to convert/reconstruct the posterior probability characteristics and the tone characteristics corresponding to the target speaker/the target voice data by utilizing the voice conversion model, and obtain the reconstructed acoustic characteristics. And adjusting parameters of the sound conversion model according to the difference between the reconstructed acoustic characteristics and the acoustic characteristics of the target voice data.
It will be appreciated that, ideally, only textual information is characterized in the posterior probability characteristics. But in practice it is also possible to characterize the targeted speaker information in posterior probability features. The targeted speaker information that it characterizes may be utilized by the voice conversion model.
In the training stage, the speaker information represented by the posterior probability characteristics corresponding to the target voice data can have positive influence, namely, the information of the target speaker can be helpful to the reconstruction of the target voice data. However, in the subsequent application stage, the source speaker information represented by the posterior probability features corresponding to the source voice data may have a negative effect, that is, the source speaker information may interfere with the process of converting the tone information of the source voice data into the tone information of the target voice data, so that the conversion effect is poor.
In order to improve the conversion effect, the training method for the sound conversion model provided by the application comprises the following steps:
fig. 1 is a flowchart of a training method of a voice conversion model according to an embodiment of the present application. It should be noted that, if there are substantially the same results, the present embodiment is not limited to the flow sequence shown in fig. 1. As shown in fig. 1, the present embodiment may include:
s11: first training speech data is obtained from a first speech data set.
Wherein the first speech data set includes a plurality of pieces of speech data of the target speaker, from which a portion can be randomly selected as first training speech data; the first training speech data corresponds to a first acoustic feature.
It can be understood that the present embodiment uses the voice data of the target speaker as training data to train the voice conversion model, so that the trained voice conversion model can convert the tone information of other speakers in the source voice data into the tone information of the target speaker.
The acoustic features to which the present application relates may be spectral features (MCEP), mel-spectral features, fundamental and non-periodic harmonic component features, and the like. It may include text information and speaker information, where the speaker information may include timbre information, prosodic information, style information, and so forth. The tone color feature (Speaker code) mentioned in this embodiment may be in the form of Speaker embedding.
S12: and acquiring posterior probability characteristics corresponding to the first acoustic characteristics.
The posterior probability features may also be referred to as contextual posterior probability features. The posterior probability feature may consist of probabilities that the first acoustic feature belongs to the respective phoneme category, and thus it may reflect a probability distribution that the first acoustic feature belongs to the respective phoneme category.
It will be appreciated that the posterior probability features ideally characterize only textual information. In practice, however, the posterior probability features may characterize speaker information (timbre information, prosodic information, style information) in addition to text information. That is, posterior probability features may additionally characterize speaker information.
The first training speech data/first acoustic features may be input into a speech data recognition model to obtain corresponding posterior probability features. The hidden layer of the voice data recognition model can be one or a combination of several network types of a fully connected network, a cyclic neural network and a convolution network.
In addition, in order to improve the recognition effect of the voice data recognition model, the voice data recognition model can be trained before the voice data recognition model is applied in the application. In one embodiment, the speech data may be segmented into multiple frames, each frame carrying a phoneme label that may be used to represent the true phoneme class to which it belongs. The speech data recognition model is trained using the speech data with the phoneme labels. In another embodiment, the speech data recognition model may be trained using CTCs and attributes.
S13: and inputting the posterior probability characteristic corresponding to the first acoustic characteristic and the first auxiliary tone characteristic into a sound conversion model to obtain a first parallel characteristic.
Wherein the first auxiliary timbre feature does not belong to the targeted speaker.
The tone color features mentioned in the present application are used to characterize tone color information included in the corresponding acoustic features. The first auxiliary timbre feature does not belong to the target speaker, and it is understood that the first auxiliary timbre feature is not extracted from the voice data of the target speaker. Alternatively, the first auxiliary tone characteristic may be extracted from voice data of other speakers included in the second voice data set, which may characterize tone information of the other speakers. For a description of the second speech data set, reference is made to the following embodiments. In this case, the voice data in the second voice data set may be referred to as first auxiliary voice data.
Wherein different first auxiliary timbre features may be acquired for conversion to obtain a plurality of different first parallel features of the first acoustic features. For simplicity of description, the application will be described hereinafter with reference to only one first parallel feature transformed to a first acoustic feature.
The first auxiliary tone feature may be constructed using tone features corresponding to at least one piece of first auxiliary voice data prior to the step being performed.
In a specific embodiment, the tone characteristic corresponding to the first auxiliary voice data belonging to the same speaker in the second voice data set may be weighted to obtain the first auxiliary tone characteristic. The weight of each piece of first auxiliary voice data can be the same or different. The first auxiliary voice data can be determined according to parameters such as quality, acquisition time and the like of the first auxiliary voice data. In the case where the weight of each piece of first auxiliary voice data is the same, the process of constructing the first auxiliary tone characteristics may be referred to as interpolation processing. The weighted average may also be regarded as essentially a linear interpolation, the interpolation process corresponding to an averaging of the tone characteristics corresponding to the at least one piece of first auxiliary speech data.
In another embodiment, the tone color features corresponding to the first auxiliary voice data belonging to different speakers in the second voice data set may be interpolated to generate new tone color features as the first auxiliary tone color features. The first auxiliary tone characteristic obtained through interpolation is a tone characteristic which does not exist in the second voice data set, namely new tone information can be constructed through interpolation. This approach can therefore be used as a data enhancement means and applied to augment existing speech data sets, increasing the amount of speech data available for training the voice conversion model.
S14: and acquiring posterior probability characteristics corresponding to the first parallel characteristics.
And carrying out voice data recognition on the first parallel features to obtain corresponding posterior probability features.
S15: and inputting the posterior probability characteristic and the target tone characteristic corresponding to the first parallel characteristic into a sound conversion model to obtain a second acoustic characteristic.
Wherein the target tone characteristic belongs to the target speaker. In other words, the target tone color feature is extracted from the voice data of the target speaker.
Before this step is performed, a weighting process may be performed on the tone characteristic corresponding to at least one piece of voice data in the first voice data set, where the weight of each piece of voice data may be the same or different. The same will be described by way of example. The tone characteristic corresponding to at least one piece of voice data can be obtained from the first voice data set; and performing interpolation processing on tone characteristics corresponding to at least one piece of voice data to obtain target tone characteristics.
S16: parameters of the acoustic conversion model are adjusted based on differences between the second acoustic feature and the first acoustic feature.
The loss of the acoustic conversion model may be calculated based on a difference (e.g., a mean square error) between the second acoustic feature and the first acoustic feature, and parameters of the acoustic conversion model are adjusted based on the loss until a preset condition is satisfied. The preset condition may be the number of training times reaching a threshold, loss function convergence, etc.
Through implementation of the embodiment, the posterior probability feature and the first auxiliary tone feature corresponding to the first acoustic feature are converted by using the sound conversion model to obtain the first parallel feature of the first acoustic feature, and then the posterior probability feature and the target tone feature corresponding to the first parallel feature are converted/reconstructed by using the sound conversion model to obtain the second acoustic feature. The posterior probability feature and the target tone feature corresponding to the first parallel feature are not extracted from the same acoustic feature, so that when the sound conversion model converts the first parallel feature into the second acoustic feature, only the speaker information represented by the posterior probability feature corresponding to the first parallel feature can be ignored, interference of the speaker information on a process of obtaining the second acoustic feature through conversion is avoided, and further the trained sound conversion model can ignore the source speaker information represented by the posterior probability feature corresponding to the source voice data in an application stage, so that the conversion effect is improved.
The sound conversion models (parameters) in S13 and S15 may be the same or different.
The following describes the implementation procedure of the first embodiment in detail, taking the kth voice data of the target speaker as an example, where the parameters are the same.
If the parameters are the same, the voice conversion Model is named Model. The first acoustic feature corresponding to the kth voice data is thatPair/>Performing speech data recognition (ASR) to obtain corresponding posterior probability features:
Will be And a first auxiliary tone color feature e j is input into the sound conversion model to obtain a first parallel feature:
For the first parallel feature Performing voice data recognition to obtain corresponding posterior probability characteristics:
Will be And inputting the target tone color feature e trg into the sound conversion model to obtain a second acoustic feature:
Based on And/>The difference between them, the parameters of the Model are adjusted.
In the case where the acoustic conversion models in S13 and S15 are the same, the first embodiment is essentially to construct the first parallel feature of the first acoustic feature by using the acoustic conversion model, and then convert the first parallel feature into the second acoustic feature by using the acoustic conversion model. In this case, the acoustic conversion model learns the following construction and conversion processes at the same time.
Wherein, the construction process can be expressed as:
Wherein the conversion process can be expressed as:
In order to enhance the training stability, the construction process may be learned for a certain number of times during the learning conversion process, specifically please refer to the description of the initial training process.
Fig. 2 is a flowchart of a training method of the acoustic conversion model according to a second embodiment of the present application. It should be noted that, if there are substantially the same results, the embodiment is not limited to the flow sequence shown in fig. 2. This embodiment is a further extension of the first embodiment, and the parameters of the sound conversion model in S13 and S15 are not identical. In other words, the first sound conversion model and the second sound conversion model in the present embodiment are two sound conversion models having different parameters. As shown in fig. 2, the present embodiment may include:
s21: first training speech data is obtained from a first speech data set.
S22: and acquiring posterior probability characteristics corresponding to the first acoustic characteristics.
S23: and inputting the posterior probability characteristic corresponding to the first acoustic characteristic and the first auxiliary tone characteristic into a first sound conversion model to obtain a first parallel characteristic.
S24: and acquiring posterior probability characteristics corresponding to the first parallel characteristics.
S25: and inputting the posterior probability characteristics and the target tone characteristics corresponding to the first parallel characteristics into a second sound conversion model to obtain second sound characteristics.
S26: parameters of the second acoustic conversion model are adjusted based on differences between the second acoustic feature and the first acoustic feature.
The detailed description of the steps in this embodiment is referred to the description of the first embodiment, and will not be repeated here.
The process of this embodiment may be understood as a training of the second sound conversion model. And, the second sound conversion model is put into practical use later.
The implementation of this embodiment (in the case of different parameters) will be described in detail below taking the kth voice data of the target speaker as an example.
When the parameters are different, the first sound conversion Model is named model_x, and the second sound conversion Model is named model_y. For a pair ofPerforming speech data recognition (ASR) to obtain corresponding posterior probability features:
Will be And the auxiliary tone color feature e j is input into a first sound conversion model to obtain a first parallel feature:
For the first parallel feature Performing voice data recognition to obtain corresponding posterior probability characteristics:
Will be And inputting the target tone color feature e trg into a second sound conversion model to obtain a second sound feature:
the parameters of model_y are adjusted based on the difference between the second acoustic feature and the first acoustic feature.
In other embodiments, to enhance training stability, model_X may also be initially trained prior to training model_Y using the method of the present embodiment. For initial training, see description of the examples below.
It can be appreciated that the essence of this embodiment is to construct a first parallel feature with the same acoustic feature text information and different tone color information by using the first acoustic conversion model. The second parallel feature is then utilized to train a second sound conversion model. Wherein the process of constructing the first parallel feature may be regarded as a data enhancement process.
Through implementation of the embodiment, first parallel features of the first acoustic features are constructed by the first acoustic conversion model, the posterior probability features and the target tone features corresponding to the first parallel features are converted by the second acoustic conversion model to obtain second acoustic features, and the conversion effect of the second acoustic conversion model is measured through differences between the converted second acoustic features and the first acoustic features. Therefore, the auxiliary training of the second sound conversion model by using the first sound conversion model can be realized, and only the parameters of the second sound conversion model are required to be adjusted in the training stage, so that the training process of the second sound conversion model is simplified.
Since the number of target voice data in the first voice data set is relatively small, the training effect of the voice conversion model may be affected. The sound conversion model may also be pre-trained with a second speech data set comprising speech data of a plurality of other speakers before training the sound conversion model with the first speech data set, to further enhance the effect of the sound conversion model. The sound conversion model architecture of the pre-training process is the same as the model architecture of the training process. The method can be concretely as follows:
Fig. 3 is a flowchart of a training method of the acoustic conversion model according to a third embodiment of the present application. It should be noted that, if there are substantially the same results, the embodiment is not limited to the flow sequence shown in fig. 3. As shown in fig. 3, the present embodiment may include:
s31: second training speech data is obtained from the second speech data set.
Wherein the second set of speech data includes speech data of a plurality of other speakers, the second training speech data corresponding to a third acoustic feature.
The other speaker may be a person other than the targeted speaker.
S32: and acquiring posterior probability characteristics corresponding to the third acoustic characteristics.
S33: and inputting the posterior probability characteristic corresponding to the third acoustic characteristic and the second auxiliary tone characteristic into a sound conversion model to obtain a second parallel characteristic.
Wherein the speaker to which the second auxiliary tone characteristic belongs is different from the speaker to which the third acoustic characteristic belongs.
In this embodiment, tone color features (Speaker codes) may be in the form of one-hot, speaker embedding, and so on. Wherein one-hot may be a1 x S vector. For the S-th speaker, the S-th dimension of its speaker code feature vectors is 1 and the other dimensions are all 0.
Before the step is executed, at least one piece of second auxiliary voice data and corresponding tone characteristics can be acquired from the second voice data set; and constructing a second auxiliary tone characteristic by utilizing tone characteristics corresponding to at least one piece of second auxiliary voice data. In this way, the second speech data set may refer to the other speech data than the speech data of the speaker to which the second training speech data/third acoustic feature belongs as second auxiliary speech data.
S34: and acquiring posterior probability characteristics corresponding to the second parallel characteristics.
S35: and inputting the posterior probability characteristic corresponding to the second parallel characteristic and the tone characteristic of the speaker to which the third acoustic characteristic belongs into a sound conversion model to obtain a fourth acoustic characteristic.
Before the step is executed, at least one piece of voice data of a speaker to which the third acoustic feature belongs and the corresponding tone feature can be acquired from the second voice data set; and performing interpolation processing on tone characteristics corresponding to at least one piece of voice data of the speaker to which the third acoustic characteristics belong to obtain second auxiliary tone characteristics.
S36: parameters of the sound conversion model are adjusted based on differences between the fourth acoustic feature and the third acoustic feature.
It should be noted that the voice conversion model architecture of the pre-training stage and the training stage are correspondingly the same.
The second speech data set includes different speakers s1, s2, …, sn, etc., as exemplified by the kth speech data of speaker si. In the case of the same parameters:
The first acoustic feature corresponding to the kth voice data is that Pair/>Performing speech data recognition (ASR) to obtain corresponding posterior probability features:
Will be And a second auxiliary tone color feature e j is input into the sound conversion model to obtain a second parallel feature:
For a pair of Performing voice data recognition to obtain posterior probability characteristics corresponding to the second parallel characteristics:
Will be And inputting the tone characteristic e i of the speaker to which the third acoustic characteristic belongs into the sound conversion model to obtain a fourth acoustic characteristic:
Based on And/>The difference between them adjusts the parameters of the sound conversion model.
For the case where the parameters are not the same, please refer to the description of the previous embodiment two, and will not be repeated here.
Model_X may be initially trained prior to the training/pre-training described above. The method can be concretely as follows:
Fig. 4 is a flowchart of a training method of the acoustic conversion model according to the fourth embodiment of the present application. It should be noted that, if there are substantially the same results, the embodiment is not limited to the flow sequence shown in fig. 4. As shown in fig. 4, the present embodiment may include:
S41: third training speech data is obtained from the second speech data set.
Wherein the third training speech data corresponds to a fifth acoustic feature.
S42: and acquiring posterior probability characteristics and tone characteristics corresponding to the fifth acoustic characteristics.
S43: and inputting the posterior probability characteristic and the tone characteristic corresponding to the second acoustic characteristic into the first sound conversion model to obtain a sixth acoustic characteristic.
S44: parameters of the first acoustic conversion model are adjusted based on differences between the sixth acoustic feature and the fifth acoustic feature.
The detailed description of the steps in this embodiment is referred to in the previous embodiment and will not be repeated here.
Through implementation of the embodiment, the posterior probability feature and the tone feature corresponding to the fifth acoustic feature are obtained, the posterior probability feature and the tone feature corresponding to the fifth acoustic feature are reconstructed by using the first sound conversion model, the sixth acoustic feature (reconstructed fifth acoustic feature) is obtained, and the parameters of the first sound conversion model are adjusted based on the difference between the sixth acoustic feature and the fifth acoustic feature, so that the reconstruction capability of the first sound conversion model can be improved.
Fig. 5 is a flow chart of an embodiment of the sound conversion method of the present application. It should be noted that, if there are substantially the same results, the embodiment is not limited to the flow sequence shown in fig. 5. As shown in fig. 5, the present embodiment may include:
S51: first source voice data and target voice data are acquired.
The first source speech data corresponds to a first source acoustic feature and the target speech data corresponds to a target acoustic feature.
The first source voice data may be voice data of a source speaker (voice data to be converted), and the target voice data may be voice data of a target speaker.
S52: and acquiring posterior probability characteristics corresponding to the first source acoustic characteristics and tone characteristics corresponding to the target acoustic characteristics.
S53: and inputting the posterior probability characteristic corresponding to the first source acoustic characteristic and the tone characteristic corresponding to the target acoustic characteristic into a sound conversion model to obtain a second source acoustic characteristic.
The sound conversion model may be trained by the method provided in the above embodiment.
The second source speech data may be speech data of tone information having target acoustic features and text information having source acoustic features.
Furthermore, in order to facilitate subsequent use, in other embodiments, it may further include: the second source acoustic features are converted into second source speech data. For example, a vocoder may be utilized to convert the second source acoustic features into second source voice data. The second source voice data may be voice data having tone color information of the target voice data and text information of the source voice data.
Through implementation of the embodiment, the application can convert tone information corresponding to the source acoustic characteristics into acoustic information in the target acoustic characteristics by utilizing the sound conversion model. The sound conversion model is obtained by the training method, so that text information represented by posterior probability characteristics corresponding to the source acoustic characteristics can be ignored in the conversion process of the sound conversion model, and the conversion effect is improved.
Fig. 6 is a schematic structural diagram of an embodiment of the electronic device of the present application. As shown in fig. 6, the electronic device includes a processor 61, a memory 62 coupled to the processor 61.
Wherein the memory 62 stores program instructions for implementing the methods of any of the embodiments described above; the processor 61 is arranged to execute program instructions stored in the memory 62 for carrying out the steps of the method embodiments described above. The processor 61 may also be referred to as a CPU (Central Processing Unit ). The processor 61 may be an integrated circuit chip with signal processing capabilities. Processor 61 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The general purpose processor may be a microprocessor or the processor 61 may be any conventional processor or the like.
FIG. 7 is a schematic diagram of an embodiment of a storage medium of the present application. As shown in fig. 7, a computer-readable storage medium 70 of an embodiment of the present application stores program instructions 71, which when executed, implement the method provided by the above-described embodiment of the present application. Wherein the program instructions 71 may form a program file stored in the above-mentioned computer readable storage medium 70 in the form of a software product for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods of the various embodiments of the application. And the aforementioned computer-readable storage medium 70 includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes, or a terminal device such as a computer, a server, a mobile phone, a tablet, or the like.
In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units. The foregoing is only the embodiments of the present application, and therefore, the patent scope of the application is not limited thereto, and all equivalent structures or equivalent processes using the descriptions of the present application and the accompanying drawings, or direct or indirect application in other related technical fields, are included in the scope of the application.
Claims (14)
1. A method for training a sound conversion model, comprising:
Acquiring first training voice data from a first voice data set, wherein the first voice data set comprises a plurality of voice data of a person of a target speaker, and the first training voice data corresponds to a first acoustic feature;
acquiring posterior probability characteristics corresponding to the first acoustic characteristics;
Inputting a posterior probability feature and a first auxiliary tone feature corresponding to the first acoustic feature into a sound conversion model to obtain a first parallel feature, wherein the first auxiliary tone feature does not belong to the target speaker, and the first parallel feature is an acoustic feature which is the same as the first acoustic feature text information and different in tone information;
acquiring posterior probability characteristics corresponding to the first parallel characteristics;
inputting posterior probability characteristics and target tone characteristics corresponding to the first parallel characteristics into the sound conversion model to obtain second acoustic characteristics, wherein the target tone characteristics belong to the target speaker;
Parameters of the sound conversion model are adjusted based on differences between the second acoustic feature and the first acoustic feature.
2. The method of claim 1, comprising, prior to said inputting the posterior probability feature and the first auxiliary timbre feature corresponding to the first acoustic feature into the acoustic conversion model:
acquiring tone characteristics corresponding to at least one piece of first auxiliary voice data;
and performing interpolation processing on tone characteristics corresponding to the at least one piece of first auxiliary voice data to obtain the first auxiliary tone characteristics.
3. The method of claim 1, comprising, prior to said inputting the posterior probability feature and the target timbre feature corresponding to the first parallel acoustic feature into the acoustic conversion model:
acquiring at least one tone characteristic corresponding to the voice data from the first voice data set;
And constructing the target tone characteristic by utilizing the tone characteristic corresponding to the at least one piece of voice data.
4. The method of claim 1, comprising, prior to said inputting the posterior probability feature and the first auxiliary timbre feature corresponding to the first acoustic feature into the acoustic conversion model, obtaining a first parallel feature:
and pre-training the sound conversion model.
5. The method of claim 4, wherein pre-training the sound conversion model comprises:
Acquiring second training voice data from a second voice data set, wherein the second voice data set comprises voice data of a plurality of other speakers, and the second training voice data corresponds to a third acoustic feature;
Acquiring posterior probability characteristics corresponding to the third acoustic characteristics;
Inputting a posterior probability feature corresponding to the third acoustic feature and a second auxiliary tone feature into a sound conversion model to obtain a second parallel feature, wherein the speaker to which the second auxiliary tone feature belongs is different from the speaker to which the third acoustic feature belongs;
acquiring posterior probability characteristics corresponding to the second parallel characteristics;
inputting posterior probability characteristics corresponding to the second parallel characteristics and tone characteristics of the speaker to which the third acoustic characteristics belong into a sound conversion model to obtain fourth acoustic characteristics;
parameters of the sound conversion model are adjusted based on a difference between the fourth acoustic feature and the third acoustic feature.
6. The method of claim 5, comprising, prior to said inputting the posterior probability feature and the second auxiliary timbre feature corresponding to the third acoustic feature into the acoustic conversion model, deriving a second parallel feature:
Acquiring at least one piece of second auxiliary voice data and corresponding tone characteristics from the second voice data set;
and constructing the second auxiliary tone color feature by utilizing the tone color feature corresponding to the at least one piece of second auxiliary voice data.
7. The method of claim 5, wherein before inputting the posterior probability feature corresponding to the second parallel feature and the timbre feature of the speaker to which the third acoustic feature belongs into a sound conversion model to obtain a fourth acoustic feature, comprising:
acquiring at least one piece of voice data of the speaker to which the third acoustic feature belongs and corresponding tone features from the second voice data set;
And carrying out interpolation processing on the tone characteristic corresponding to the at least one piece of voice data of the speaker to which the third acoustic characteristic belongs to obtain the tone characteristic of the speaker to which the third acoustic characteristic belongs.
8. The method of claim 1, wherein the sound conversion model includes a first sound conversion model and a second sound conversion model, parameters of the first sound conversion model and the second sound conversion model are different, and the inputting the posterior probability feature and the first auxiliary timbre feature corresponding to the first acoustic feature into the sound conversion model to obtain a first parallel feature includes:
inputting the posterior probability characteristic corresponding to the first acoustic characteristic and the first auxiliary tone characteristic into the first sound conversion model to obtain the first parallel characteristic;
inputting the posterior probability feature corresponding to the first parallel feature and the target tone feature into a sound conversion model to obtain a second acoustic feature, including:
And inputting the posterior probability characteristic corresponding to the first parallel characteristic and the target tone characteristic into the second sound conversion model to obtain the second acoustic characteristic.
9. The method of claim 8, wherein the adjusting parameters of the acoustic conversion model based on the difference between the second acoustic feature and the first acoustic feature comprises:
Parameters of the second acoustic conversion model are adjusted based on differences between the second acoustic feature and the first acoustic feature.
10. The method of claim 8, comprising, prior to said inputting the posterior probability feature and the first auxiliary timbre feature corresponding to the first acoustic feature into the first sound conversion model, deriving the first parallel feature:
and performing initial training on the first sound conversion model.
11. A sound conversion method, comprising:
acquiring first source voice data and target voice data, wherein the first source voice data corresponds to a first source acoustic feature, and the target voice data corresponds to a target acoustic feature;
Acquiring posterior probability characteristics corresponding to the first source acoustic characteristics and tone characteristics corresponding to the target acoustic characteristics;
inputting posterior probability characteristics corresponding to the first source acoustic characteristics and tone characteristics corresponding to the target acoustic characteristics into a sound conversion model to obtain second source acoustic characteristics;
Wherein the sound conversion model is trained by the method of any one of claims 1-7, or the sound conversion model is a second sound conversion model trained by the method of any one of claims 8-10.
12. The method of claim 11, wherein after the inputting the posterior probability feature corresponding to the first source acoustic feature and the timbre feature corresponding to the target acoustic feature into the acoustic conversion model, obtaining a second source acoustic feature comprises:
The second source acoustic features are converted into second source speech data.
13. An electronic device comprising a processor, a memory coupled to the processor, wherein,
The memory stores program instructions;
the processor is configured to execute the program instructions stored by the memory to implement the method of any one of claims 1-12.
14. A storage medium storing program instructions which, when executed, implement the method of any one of claims 1-12.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011627564.XA CN112802462B (en) | 2020-12-31 | 2020-12-31 | Training method of sound conversion model, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011627564.XA CN112802462B (en) | 2020-12-31 | 2020-12-31 | Training method of sound conversion model, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112802462A CN112802462A (en) | 2021-05-14 |
CN112802462B true CN112802462B (en) | 2024-05-31 |
Family
ID=75807909
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011627564.XA Active CN112802462B (en) | 2020-12-31 | 2020-12-31 | Training method of sound conversion model, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112802462B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101665882B1 (en) * | 2015-08-20 | 2016-10-13 | 한국과학기술원 | Apparatus and method for speech synthesis using voice color conversion and speech dna codes |
CN110930981A (en) * | 2018-09-20 | 2020-03-27 | 深圳市声希科技有限公司 | Many-to-one voice conversion system |
CN112037766A (en) * | 2020-09-09 | 2020-12-04 | 广州华多网络科技有限公司 | Voice tone conversion method and related equipment |
CN112037754A (en) * | 2020-09-09 | 2020-12-04 | 广州华多网络科技有限公司 | Method for generating speech synthesis training data and related equipment |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101359473A (en) * | 2007-07-30 | 2009-02-04 | 国际商业机器公司 | Auto speech conversion method and apparatus |
-
2020
- 2020-12-31 CN CN202011627564.XA patent/CN112802462B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101665882B1 (en) * | 2015-08-20 | 2016-10-13 | 한국과학기술원 | Apparatus and method for speech synthesis using voice color conversion and speech dna codes |
CN110930981A (en) * | 2018-09-20 | 2020-03-27 | 深圳市声希科技有限公司 | Many-to-one voice conversion system |
CN112037766A (en) * | 2020-09-09 | 2020-12-04 | 广州华多网络科技有限公司 | Voice tone conversion method and related equipment |
CN112037754A (en) * | 2020-09-09 | 2020-12-04 | 广州华多网络科技有限公司 | Method for generating speech synthesis training data and related equipment |
Also Published As
Publication number | Publication date |
---|---|
CN112802462A (en) | 2021-05-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112735373B (en) | Speech synthesis method, device, equipment and storage medium | |
CN113470662B (en) | Generating and using text-to-speech data for keyword detection system and speaker adaptation in speech recognition system | |
Kelly et al. | Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors | |
CN109767778B (en) | Bi-L STM and WaveNet fused voice conversion method | |
CN111465982B (en) | Signal processing device and method, training device and method, and program | |
JP2015180966A (en) | Speech processing system | |
CN113658583B (en) | Ear voice conversion method, system and device based on generation countermeasure network | |
CN112530403B (en) | Voice conversion method and system based on semi-parallel corpus | |
EP4266306A1 (en) | A speech processing system and a method of processing a speech signal | |
CN113498536A (en) | Electronic device and control method thereof | |
CN114708857A (en) | Speech recognition model training method, speech recognition method and corresponding device | |
WO2024055752A9 (en) | Speech synthesis model training method, speech synthesis method, and related apparatuses | |
CN112002302B (en) | Speech synthesis method and device | |
CN111326170A (en) | Method and device for converting ear voice into normal voice by combining time-frequency domain expansion convolution | |
Xue et al. | Cross-modal information fusion for voice spoofing detection | |
Kumar et al. | Towards building text-to-speech systems for the next billion users | |
CN112185340B (en) | Speech synthesis method, speech synthesis device, storage medium and electronic equipment | |
CN113077783A (en) | Method and device for amplifying Chinese speech corpus, electronic equipment and storage medium | |
CN112802462B (en) | Training method of sound conversion model, electronic equipment and storage medium | |
Li et al. | Intelligibility enhancement via normal-to-lombard speech conversion with long short-term memory network and bayesian Gaussian mixture model | |
McLoughlin et al. | Speech reconstruction using a deep partially supervised neural network | |
JP5706368B2 (en) | Speech conversion function learning device, speech conversion device, speech conversion function learning method, speech conversion method, and program | |
Zhao et al. | A real-time speech driven talking avatar based on deep neural network | |
CN118398004B (en) | Construction and training method of large voice model, audio output method and application | |
US20240185832A1 (en) | Systems and methods to automate trust delivery |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |