CN117935770A - Synthetic voice adjusting method, training method and related device - Google Patents

Synthetic voice adjusting method, training method and related device Download PDF

Info

Publication number
CN117935770A
CN117935770A CN202410029165.5A CN202410029165A CN117935770A CN 117935770 A CN117935770 A CN 117935770A CN 202410029165 A CN202410029165 A CN 202410029165A CN 117935770 A CN117935770 A CN 117935770A
Authority
CN
China
Prior art keywords
attribute
voice
speaker
feature
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410029165.5A
Other languages
Chinese (zh)
Inventor
刘利娟
潘嘉
高建清
刘聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202410029165.5A priority Critical patent/CN117935770A/en
Publication of CN117935770A publication Critical patent/CN117935770A/en
Priority to CN202410882124.0A priority patent/CN118411979A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The application discloses a synthetic voice adjusting method, a training method and a related device. The method comprises the following steps: acquiring an attribute adjustment text of initial synthesized voice and acquiring original attribute characteristics of a target speaker, wherein the attribute adjustment text is used for representing attribute differences for performing voice attribute adjustment on the initial synthesized voice, and the initial synthesized voice is obtained by performing voice synthesis by utilizing the original attribute characteristics and acoustic characteristics; performing attribute prediction by utilizing the attribute adjustment text and the original attribute characteristics to obtain new attribute characteristics; and performing voice synthesis based on the new attribute characteristics and the acoustic characteristics to obtain the adjusted synthesized voice. According to the scheme, the voice attribute can be adjusted, and personalized requirements of a user on synthesized voice are met.

Description

Synthetic voice adjusting method, training method and related device
Technical Field
The present application relates to the field of speech synthesis technology, and in particular, to a method for adjusting synthesized speech, a training method, and a related device.
Background
Speech synthesis is an intelligent speech technology that converts text into synthesized speech, which is one of the key technologies for achieving human-computer interaction. With the development of artificial intelligence technology, speech synthesis technology is widely used in various fields, such as the field of intelligent mobile terminals, the field of intelligent home and the field of vehicle-mounted equipment.
At present, the existing voice synthesis method can only synthesize voice data with single fixed attribute, however, the preference of different users for the attribute of the synthesized voice is usually different, so that the use requirement of the users can not be met any more.
Disclosure of Invention
The application mainly solves the technical problem of providing a method for adjusting synthesized voice, a training method and a related device, which can realize the adjustment of voice attribute and meet the personalized requirements of users on the synthesized voice.
In order to solve the above problem, a first aspect of the present application provides a method for adjusting a synthesized voice, the method comprising: acquiring an attribute adjustment text of initial synthesized voice and acquiring original attribute characteristics of a target speaker, wherein the attribute adjustment text is used for representing attribute differences for performing voice attribute adjustment on the initial synthesized voice, and the initial synthesized voice is obtained by performing voice synthesis by utilizing the original attribute characteristics and acoustic characteristics; performing attribute prediction by utilizing the attribute adjustment text and the original attribute characteristics to obtain new attribute characteristics; and performing voice synthesis based on the new attribute characteristics and the acoustic characteristics to obtain the adjusted synthesized voice.
In order to solve the above problem, a second aspect of the present application provides a training method of a speech synthesis system, the method comprising: acquiring attribute difference samples among voice data samples of a plurality of pairs of speakers; obtaining attribute characteristic samples of each speaker by using a voice synthesis system; carrying out attribute prediction on the attribute difference sample and the attribute feature sample of the speaker by using a voice synthesis system to obtain predicted attribute features; based on the predicted attribute characteristics, network parameters of the speech synthesis system are adjusted to obtain the trained speech synthesis system.
In order to solve the above-mentioned problems, a third aspect of the present application provides an adjusting apparatus for synthesized speech, comprising: the device comprises an attribute adjustment unit, an attribute acquisition unit, an attribute prediction unit and a voice synthesis unit, wherein the attribute adjustment unit is used for acquiring an attribute adjustment text of initial synthesized voice; the attribute acquisition unit is used for acquiring original attribute characteristics of a target speaker, wherein the attribute adjustment text is used for representing attribute differences for performing voice attribute adjustment on initial synthesized voice, and the initial synthesized voice is obtained by performing voice synthesis by utilizing the original attribute characteristics and acoustic characteristics; the attribute prediction unit is used for predicting the attribute by utilizing the attribute adjustment text and the original attribute characteristics to obtain new attribute characteristics; the voice synthesis unit is used for carrying out voice synthesis based on the new attribute characteristics and the acoustic characteristics to obtain adjusted synthesized voice.
In order to solve the above-mentioned problems, a fourth aspect of the present application provides a training apparatus of a speech synthesis system, the apparatus comprising: the device comprises a difference acquisition unit, an attribute acquisition unit, a voice prediction unit and a parameter adjustment unit, wherein the difference acquisition unit is used for acquiring attribute difference samples among voice data samples of a plurality of pairs of speakers; the attribute acquisition unit is used for acquiring attribute characteristic samples of each speaker; the voice prediction unit is used for carrying out attribute prediction on the attribute difference sample and the attribute characteristic sample of the speaker to obtain predicted attribute characteristics; the parameter adjusting unit is used for adjusting network parameters of the speech synthesis system based on the predicted attribute characteristics to obtain the trained speech synthesis system.
In order to solve the above-mentioned problems, a fifth aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, the memory storing program data, and the processor executing the program data to implement any step of the above-mentioned method for adjusting synthesized speech and/or the training method of the speech synthesis system.
In order to solve the above-described problems, a sixth aspect of the present application provides a computer-readable storage medium storing program data executable by a processor for implementing any one of the steps of the above-described synthetic speech adjustment method and/or training method of a speech synthesis system.
According to the scheme, the original attribute characteristics of the target speaker are obtained, the original attribute characteristics and the acoustic characteristics are utilized for voice synthesis, the attribute of the synthesized original synthesized voice can be obtained, the attribute adjustment text of the original synthesized voice is obtained, the attribute adjustment text is used for representing the attribute difference of voice attribute adjustment of the original synthesized voice, the voice attribute difference relative to the original synthesized voice can be obtained, namely, the attribute difference of the voice attribute between the synthesized voice wanted by the user and the original synthesized voice, then attribute prediction is carried out by utilizing the attribute adjustment text and the original attribute characteristics, the new attribute characteristics are obtained, the new attribute characteristics are enabled to be closer to the attribute characteristics required by the user corresponding to the attribute adjustment text, voice synthesis is carried out based on the new attribute characteristics and the acoustic characteristics, the synthesized voice which tends to the attribute adjustment text is obtained, and the voice attribute adjustment of the synthesized voice which tends to the attribute adjustment text is achieved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.
Drawings
In order to more clearly illustrate the technical solutions of the present application, the drawings required in the description of the embodiments will be briefly described below, it being obvious that the drawings described below are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:
FIG. 1 is a flow chart of an embodiment of a method for adjusting synthesized speech according to the present application;
FIG. 2 is a flow chart of an embodiment of the step S11 of the present application;
FIG. 3 is a schematic diagram of a speech synthesis model according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an embodiment of an attribute adjustment model according to the present application;
FIG. 5 is a flow chart of a first embodiment of a training method of the speech synthesis system of the present application;
FIG. 6 is a flow chart of a second embodiment of a training method of the speech synthesis system of the present application;
FIG. 7 is a schematic diagram of an embodiment of a device for adjusting synthesized speech according to the present application;
FIG. 8 is a schematic diagram of an embodiment of a training device of the speech synthesis system of the present application;
FIG. 9 is a schematic diagram of an embodiment of an electronic device of the present application;
FIG. 10 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The terms "first" and "second" in the present application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, may mean including any one or more elements selected from the group consisting of A, B and C.
The present application provides the following examples, and each example is specifically described below.
It should be understood that the method for adjusting the synthesized speech and the method for training the speech synthesis system in the present application may be performed by an electronic device, which may be any device having processing capability, for example, a mobile phone, a computer, a server, etc., which is not limited in this aspect of the present application.
Referring to fig. 1, fig. 1 is a flow chart illustrating an embodiment of a method for adjusting synthesized speech according to the present application. The method may comprise the steps of:
S11: and acquiring an attribute adjustment text of the initial synthesized voice and acquiring original attribute characteristics of the target speaker, wherein the attribute adjustment text is used for representing attribute differences for performing voice attribute adjustment on the initial synthesized voice, and the initial synthesized voice is obtained by performing voice synthesis by utilizing the original attribute characteristics and the acoustic characteristics.
The method for adjusting the synthesized voice can be realized by adopting a voice synthesis system, the voice synthesis system can comprise a voice synthesis model and an attribute adjustment model, the voice synthesis model is used for synthesizing voice by utilizing the attribute characteristics and the acoustic characteristics of a speaker to obtain the synthesized voice, and the attribute adjustment model is used for adjusting the attribute characteristics by utilizing the attribute adjustment text. Attribute features such as timbre, emotion, etc. Illustratively, tone such as gender-related "male", "female sweet", "male muddy", etc.; such as gender independent "soft", "mellow", etc., the present application is not limited thereto. Illustratively, emotions such as "happy", "sad", "excited", and the like, to which the present application is not limited. It is to be appreciated that the attribute features may also include features of other attributes, and the application is not limited in this regard.
In some embodiments, the original attribute characteristics of the target speaker may be obtained by encoding the voice data of the target speaker using an attribute encoding module. For example, the attribute coding module, such as a timbre coding module, may code the voice data of the target speaker to obtain a vector of timbre characteristics.
In some embodiments, the original attribute features and acoustic features of the target speaker may be speech synthesized using a speech synthesis model to obtain an initial synthesized speech. Then, according to the initial synthesized voice, an attribute adjustment text of the initial synthesized voice is obtained, wherein the attribute adjustment text is used for representing attribute differences for adjusting voice attributes of the initial synthesized voice, such as tone differences, emotion differences and the like.
For example, a user may want "a brighter sound", "a softer sound", or "a healer sound", etc., than the original synthesized voice, and the attribute adjustment text may be expressed as "brighter", "softer", "healer", etc., to characterize the attribute difference of the voice attribute adjustment with respect to the original synthesized voice.
In some embodiments, the attribute adjustment text of one voice attribute adjustment may include one attribute adjustment or a combination of multiple attributes, such as adjusting attribute of tone color alone or adjusting attribute of emotion, such as adjusting attribute of tone color and emotion, etc., which is not limited in this application.
In some embodiments, before the attribute adjustment text of the initial synthesized speech is acquired in step S11, the initial synthesized speech may be acquired, and the original attribute features of the target speaker may be acquired.
In some embodiments, referring to fig. 2, step S11 of the above embodiments may be further extended. The method for obtaining the original synthesized voice and the original attribute characteristics of the target speaker can comprise the following steps:
S111: and coding the text features to be synthesized to obtain text coding features.
The initial synthesized voice is obtained by voice synthesis of original attribute features and acoustic features through a voice synthesis model.
Referring to fig. 3, the speech synthesis model includes a Text encoding module (Text encoder), a duration prediction module (Duration predictor), a duration adjustment module (Length adjuster), an attribute encoding module (Attribute encoder), and a decoding module (Decoder). In this embodiment, a tone color coding is taken as an example, and the attribute coding module may be a tone color coding module (Timbre encoder), an emotion coding module, etc., and other speech attribute coding modules may also be obtained in a similar manner, which is not limited in this application. In an application scenario for adjusting Timbre and emotion simultaneously, the attribute adjustment module may include a Timbre encoding module (Timbre encoder) and an emotion encoding module. For the following embodiments of this application scenario, reference may be made to this approach similarly, and the application is not limited thereto.
And acquiring the text characteristic x of the voice to be synthesized, wherein the text characteristic x to be synthesized comprises information such as phonemes, tones, prosody levels and the like. The text coding module is used for coding the text characteristic x to be synthesized, so that the text coding characteristic h can be obtained, and the process can obtain higher-layer pronunciation content representation. Thus, the text coding feature h is respectively input into a duration prediction module and a duration adjustment module.
S112: and predicting the time length of the text coding feature by using the characterization feature of the target speaker to obtain the time length feature.
And obtaining a characteristic feature s of the target speaker, wherein the characteristic feature s can be expressed in a vector form and is used for realizing distinguishing modeling on text features x (such as prosody) of different speakers. The token vector s may take a variety of forms, for example, it may be encoded by a single-hot encoding method, to obtain a token vector for the target speaker for each speaker. Wherein One-Hot Encoding (One Encoding) is to create a new binary feature for each class of each discrete attribute. For each sample, only one binary feature is 1, indicating that it belongs to the corresponding class, the other features are 0. Alternatively, the trained speaker recognition model may be used to extract voiceprint representation of the target speaker, and the voiceprint representation may be used as a token vector of the target speaker. Wherein, the target speaker characterization vector contains prosody related information of the target speaker.
The time length prediction module predicts the time length of the text coding feature by utilizing the characteristic feature of the target speaker, and is used for predicting the time length information of the text feature based on the text coding vector h and the input characteristic vector s of the target speaker to obtain the time length featureIt is also understood that the duration of each phoneme in the text feature in the initial synthesized speech is predicted and the duration of each phoneme is expanded.
Then, the duration featureAnd (5) inputting a time length adjusting module.
S113: and carrying out acoustic expansion on the text coding features according to the duration features to obtain acoustic features.
The time length adjusting module acoustically expands the text coding feature according to the time length feature, and in the process, the hidden layer representation of the phoneme-level text coding feature can be expanded into a feature sequence with equal length to the acoustic feature to serve as an acoustic feature p. In some application scenarios, the acoustic feature p may be a target acoustic characterization vector that is related to the duration of the speaker prosody and independent of the speaker timbre.
The text coding feature may be expanded according to the phoneme duration, which is exemplified by vector h j corresponding to each phoneme in the text coding feature h, if the duration of the jth phoneme is n frames, vector h j corresponding to the phoneme is copied and expanded in n copies, and the finally obtained expanded acoustic characterization vector is acoustic feature p. It will be appreciated that the manner of expanding the text-coding feature according to the phoneme duration is not limited to the above-described copy, but may take other expanding forms, which are not described in detail herein.
The acoustic feature p is input to a decoding module.
S114: and encoding the voice data of the target speaker to obtain the original attribute characteristics.
The voice data y of the target speaker is acquired, and the voice data y can be natural language spectrum characteristics of the target speaker. And inputting the voice data of the target speaker into an attribute coding module for coding to obtain the original attribute characteristics h s. Taking the attribute coding module as a tone coding module as an example, the tone coding module codes the voice data of the target speaker, so that a tone characteristic vector, namely a primary tone characteristic, can be obtained and can be used as a primary attribute characteristic h s.
In some embodiments, the speech synthesis model further includes a pooling layer and other layers, and after the original attribute feature h s is obtained, the original attribute feature h s finally obtained may be a sentence-level feature vector. The original attribute feature h s may be input as a timbre feature vector (i.e., an attribute feature vector) of the target speaker to the decoding module to implement attribute control, such as timbre control.
In some embodiments, the original attribute feature h s may also be input into an attribute adjustment model to enable attribute prediction.
S115: and decoding by utilizing the original attribute characteristics and the acoustic characteristics to obtain initial synthesized voice.
The decoding module decodes according to the original attribute characteristics h s and the acoustic characteristics p, and can predict and obtain the frequency spectrum characteristicsAnd obtaining the initial synthesized voice.
In the above process, the attribute coding model (such as a timbre coding model) may be a model that is initialized randomly, or a trained speaker classification model, etc., which is not limited by the present application.
According to the scheme, the voice attribute of the initial synthesized voice can be controlled through the original attribute feature, so that the synthesized voice wanted by the user can be obtained.
In some embodiments, in the step of obtaining the attribute adjustment text of the initial synthesized speech, the initial synthesized speech may be played, the user may listen to the initial synthesized speech, and the attribute difference description text may be obtained based on attribute difference input of the speech attribute compared to the initial synthesized speech according to factors such as the auditory effect of the user. It is also understood that the input is based on the attribute difference between the voice attributes of the initial synthesized voice and the target synthesized voice, which is the synthesized voice intended by the user. The target speech synthesis speech as intended by the user is "bright", "sweet". After the user listens to the initial synthesized voice, the current initial synthesized voice is not desired by the user, or further attribute adjustment or the like is desired, and based on the voice attribute of the initial synthesized voice, such as the synthesized sound tone color of "brighter" or "more sweet" or the like than the current sound of the initial synthesized voice, it can be determined that the attribute difference description text is "brighter" or "more sweet" or the like.
And receiving the attribute difference description text input by the user to the initial synthesized voice to obtain the attribute adjustment text of the initial synthesized voice. The attribute difference description text may be tone difference text, emotion difference text, etc., to which the present application is not limited. The receiving mode may be voice input, text input, etc. of the user, which is not limited in the present application.
According to the scheme, after the user listens to the initial synthesized voice, the attribute difference description text is input based on the attribute difference of the voice attribute compared with the initial synthesized voice, and is used as the attribute adjustment text of the initial synthesized voice, so that the personalized use requirement of the user can be met, the voice attribute of the synthesized voice can be adjusted according to the attribute difference in the subsequent process, and the synthesized voice wanted by the user can be synthesized to a greater degree.
With continued reference to fig. 1, after the step S11, the method may further include the following steps:
S12: and predicting the attribute by utilizing the attribute adjustment text and the original attribute characteristics to obtain new attribute characteristics.
The attribute adjustment model can be used for carrying out attribute prediction on the attribute adjustment text and the original attribute characteristics, and in the process, the original attribute characteristics tend to be adjusted by the attribute adjustment text so as to obtain new attribute characteristics.
In some embodiments, referring to FIG. 4, the attribute adjustment model may include a Language model (Language model) and an attribute prediction model (Attribute predictor). The attribute prediction module may be a speech attribute related prediction module, such as a Timbre prediction module (Timbre predictor), an emotion prediction module, etc. Other speech attribute encoding modules may be similarly obtained, and the application is not limited in this regard.
The language module may include a trained language model, such as BERT (Bidirectional Encoder Representations from Transformers, bi-directional coding based on a transducer) model, a large language model, and the like, and the application is not limited to the specific structure and type of the language module. For the model structure of the attribute prediction module (e.g., tone prediction module), a convolutional neural network, a cyclic neural network, or a combination of other network modules, etc., may be employed, and the present application is not limited thereto.
The attribute adjustment text L can be input into a language module, semantic coding is carried out on the attribute adjustment text by adopting the language module, and semantic characterization of the attribute adjustment text L can be carried out to obtain an attribute difference characteristic h L. The semantic representation mode may be that all characters in the attribute adjustment text L are averaged at sentence level through a language module encoding vector, or the output of the first character is adopted as the representation, which can be understood that other semantic representation modes may be adopted, which is not limited in the present application. Taking attribute adjustment text as a tone difference text as an example, adopting a language module to carry out semantic coding on the tone difference text, and obtaining tone difference characteristics.
The attribute difference characteristic h L and the original attribute characteristic h s are input into an attribute prediction module, and the attribute prediction module performs attribute prediction on the attribute difference characteristic h L and the original attribute characteristic h s to obtain a new attribute characteristic h s. Taking an attribute prediction module as a tone prediction module as an example, the attribute prediction module performs tone prediction on the tone difference characteristic h L and the original tone characteristic h s to obtain a new tone characteristic h s'. It will be appreciated that other attributes may be referred to in this manner as new attribute features, as the application is not limited in this respect.
According to the scheme, the attribute difference features and the original attribute features of the initial synthesized voice are subjected to attribute prediction, so that the attribute adjustment model can predict and obtain the attribute feature vector tending to the attribute difference features, and the personalized requirements of users on the synthesized voice are met.
The new attribute features may then be used in place of the original attribute features for subsequent speech synthesis.
S13: and performing voice synthesis based on the new attribute characteristics and the acoustic characteristics to obtain the adjusted synthesized voice.
And substituting the original attribute features with the new attribute features, and inputting the new attribute features into the voice synthesis model to perform voice synthesis based on the new attribute features and the acoustic features so as to obtain the adjusted synthesized voice. The speech synthesis process may refer to the implementation process for obtaining the initial synthesized speech, which is not described herein.
In some embodiments, after the adjusted synthesized speech is obtained, the adjusted synthesized speech may be used as the initial synthesized speech, and the above-mentioned step S11 and the subsequent steps may be continuously performed. If the user listens to the adjusted synthesized voice is not the desired synthesized voice, the attribute adjustment text can be acquired again according to the steps so as to repeat the steps until the synthesized voice desired by the user is obtained.
According to the scheme, the original attribute characteristics of the target speaker are obtained, the original attribute characteristics and the acoustic characteristics are utilized for voice synthesis, the attribute of the synthesized original synthesized voice can be obtained, the attribute adjustment text of the original synthesized voice is obtained, the attribute adjustment text is used for representing the attribute difference of voice attribute adjustment of the original synthesized voice, the voice attribute difference relative to the original synthesized voice can be obtained, namely, the attribute difference of the voice attribute between the synthesized voice wanted by the user and the original synthesized voice, then attribute prediction is carried out by utilizing the attribute adjustment text and the original attribute characteristics, the new attribute characteristics are obtained, the new attribute characteristics are enabled to be closer to the attribute characteristics required by the user corresponding to the attribute adjustment text, voice synthesis is carried out based on the new attribute characteristics and the acoustic characteristics, the synthesized voice which tends to the attribute adjustment text is obtained, and the voice attribute adjustment of the synthesized voice which tends to the attribute adjustment text is achieved.
In some embodiments, the speech synthesis system may be trained prior to use of the speech synthesis system described above, such as at least one of training a speech synthesis model, training an attribute encoding module, training an attribute adjustment model, and the like, to which the present application is not limited.
In some embodiments, the attribute adjustment model is used for predicting attributes by using the attribute adjustment text and the original attribute features to obtain new attribute features. When the attribute adjustment model is trained, the corresponding attribute difference sample can be obtained through training. The attribute adjustment model is obtained by training attribute difference samples of multiple pairs of speakers, each pair of speakers comprises a first speaker and a second speaker, and the attribute difference samples comprise attribute differences between voice data samples of the second speaker relative to the first speaker.
In some embodiments, the speech synthesis model may be trained using text feature samples of multiple speakers and speech data samples, characterization feature samples, etc. of the speakers.
In this regard, the present application also provides a training method for a speech synthesis system, and the following embodiments may be referred to specifically.
Referring to fig. 5, fig. 5 is a flowchart of a training method of the speech synthesis system according to a first embodiment of the present application. The method may comprise the steps of:
s21: acquiring attribute difference samples among voice data samples of a plurality of pairs of speakers; and obtaining attribute characteristic samples of each speaker by using the voice synthesis system.
The voice data samples of a plurality of speakers and the corresponding pronunciation texts thereof can be obtained, and the pronunciation texts are taken as the corresponding text feature samples thereof. The voice data samples of a plurality of speakers may contain as many voice data as possible, such as different ages, sexes, timbres, emotions, etc., and each voice attribute, age, sex, etc. may be equalized.
Alternatively, multiple voice data samples may be collected for each speaker, e.g., at least 100 sentences of voice data may be collected for each speaker, each sentence having a voice duration of about ten seconds. The speech data samples and text feature samples of the plurality of speakers are used as training data sets, and the number of speakers in the training data sets is denoted as S.
The speakers can be combined into a pair, so that attribute difference samples among voice data samples of a plurality of pairs of speakers can be obtained in a comparison mode.
Optionally, a preset number of voice data samples may be selected from each speaker as the compared voice data samples, so as to obtain attribute difference samples between the compared voice data samples of each speaker. For example, three sentences may be selected from the speech data samples in the training data set in a random manner for each speaker as their speech samples, i.e., as comparative speech data samples. For example, for the ith speaker, three voice data samples are selected, which may be denoted as { x i,1,xi,2,xi,3 }.
For any two speakers i and j in the training dataset ({ i, j } ∈ [1, s ], and i+.j) as a pair of speakers, a comparison speech number sample of the speakers i and j may be obtained, respectively, denoted as { x i,1,xi,2,xi,3 } and { x j,1,xj,2,xj,3 }, respectively. Then, the attribute differences of the voice attributes of { x i,1,xi,2,xi,3 } and { x j,1,xj,2,xj,3 } are compared, and marked to obtain an attribute difference sample between the speakers i and j.
Taking the tone color difference as an example, the tone color feature differences of { x i,1,xi,2,xi,3 } and { x j,1,xj,2,xj,3 } are labeled by labeling personnel. Specifically, the labeling personnel can listen to the voices { x i,1,xi,2,xi,3 } and { x j,1,xj,2,xj,3 } of the speaker i and the speaker j respectively so as to be familiar with the tone characteristics of the two speakers. And then, according to the auditory perception of the labeling personnel, receiving the timbre difference of the speaker j relative to the speaker i by using a sentence to obtain an attribute difference sample. For example, the marked result may be "lower, somewhat thicker". The labeling result is denoted by l i,j to be used as an attribute difference sample. For S speakers, can be obtained by the above wayFor labeling results, i.e./>For the attribute difference sample l i,j.
The speech synthesis system of the present embodiment includes a speech synthesis model and an attribute adjustment model. At least one of the speech synthesis model, the attribute encoding module, the attribute adjustment model, and the like may be trained. The speech synthesis model comprises an attribute coding module.
Optionally, the attribute encoding module may be used to encode the voice data samples of each speaker to obtain attribute feature samples, where the encoded voice data samples may be voice data representative of the speaker in the training dataset.
After step S11, attribute prediction may be performed on the attribute difference sample and the attribute feature sample of the speaker by using the speech synthesis system, to obtain a predicted attribute feature; and/or performing voice synthesis on the attribute feature sample and the acoustic feature sample of the speaker by using the voice synthesis system to obtain predicted synthesized voice, and adjusting network parameters of the voice synthesis system based on the predicted attribute feature and/or the predicted synthesized voice to obtain the trained voice synthesis system.
In some embodiments, the speech synthesis model may be trained alone, the attribute adjustment model may be trained alone, or both the speech synthesis model and the attribute adjustment model may be trained, as the application is not limited in this respect.
Optionally, when training the speech synthesis model, the predicted synthesized speech obtained by the speech synthesis model may be used to determine a synthesis loss value, and then the network parameters of the speech synthesis model are adjusted to obtain the trained speech synthesis model.
Optionally, when training the attribute adjustment model, the predicted attribute features obtained by the attribute adjustment model may be used to determine a predicted loss value, and then the network parameters of the attribute adjustment model are adjusted to obtain the trained attribute adjustment model.
Alternatively, when the speech synthesis model and the attribute adjustment model are trained simultaneously, the predicted loss value may be determined based on the predicted attribute characteristics and the synthesis loss value may be determined based on the predicted synthesized speech, respectively, and the network parameters of the speech synthesis model and the attribute adjustment model may be adjusted, respectively. Or respectively determining the predicted loss value and the synthesized loss value, and then synthesizing the predicted loss value and the synthesized loss value to obtain a total loss value, and adjusting network parameters of the voice synthesis model and the attribute adjustment model based on the total loss value to obtain the trained voice synthesis system.
In some embodiments, the speech synthesis model may be trained first, and then the attribute feature sample may be obtained through the speech synthesis model to train the attribute adjustment model, which is not limited in this regard by the present application.
In some embodiments, this embodiment is described taking training the attribute adjustment model as an example, and after the step S21, the following steps may be included:
S22: and carrying out attribute prediction on the attribute difference sample and the attribute feature sample of the speaker by using a voice synthesis system to obtain predicted attribute features.
In this step, the specific implementation process of predicting the attribute difference sample and the attribute feature sample of the speaker by using the speech synthesis system to obtain the predicted attribute feature may refer to the implementation process of obtaining the new attribute feature in the foregoing embodiment, which is not described herein.
In some embodiments, attribute adjustment models are used to predict attributes of the attribute difference samples and the attribute feature samples of the speaker, so as to obtain predicted attribute features.
The attribute difference sample l i,j includes the attribute difference between the voice data sample of the second speaker j relative to the first speaker i. In the above process of obtaining the attribute feature samples, the attribute coding module of the speech synthesis model may be used to obtain the first attribute feature sample h s,i of the speech data sample of the first speaker and the second attribute feature sample h s,j of the speech data sample of the second speaker.
And then, carrying out attribute prediction on the attribute difference sample and the first attribute feature sample by adopting an attribute adjustment model to obtain predicted attribute features.
Specifically, with continued reference to fig. 4, the attribute difference sample l i,j described above may be input to the language module of the attribute adjustment model, and the first attribute feature sample h s,i may be input to the attribute prediction module of the attribute adjustment model. The language module carries out semantic coding on the attribute difference sample l i,j to obtain an attribute difference characteristic sampleAnd sample the attribute difference characteristic/>The attribute prediction module inputs the attribute difference characteristic sample/>, and the attribute prediction module performs the attribute difference characteristic sample/>And performing attribute prediction on the first attribute characteristic sample h s,i to obtain a predicted attribute characteristic h' s,j. The predicted attribute of prediction h' s,j may be as close as possible to the voice attribute of speaker j, such as a timbre feature.
S23: based on the predicted attribute characteristics, network parameters of the speech synthesis system are adjusted to obtain the trained speech synthesis system.
According to the predicted attribute characteristics determined by the attribute adjustment model, a predicted loss value can be obtained, and network parameters of the attribute adjustment model of the speech synthesis system can be adjusted based on the predicted loss value to obtain the trained speech synthesis system. Or obtaining a total loss value according to the determined predicted attribute characteristics, and adjusting the attribute adjustment model of the voice synthesis system and the network parameters of the voice synthesis model based on the total loss value to obtain the trained voice synthesis system.
In some implementations, the predicted loss value may be determined using the difference between the predicted attribute feature h' s,j and the second attribute feature sample h s,j of the second speaker. The difference between the predicted attribute feature h' s,j and the second attribute feature sample h s,j of the second speaker may be determined, for example, by a minimization function, resulting in a predicted loss value. Wherein the minimizing function such as minimum mean square error, minimum difference, etc., the present application is not limited thereto. And then, the network parameters of the attribute adjustment model can be adjusted based on the predicted loss value, so that the training of the network parameters of the attribute adjustment model is realized, and the trained attribute adjustment model is obtained. By the method, the predicted attribute characteristics output by the trained attribute adjustment model according to the first attribute characteristic sample and the attribute difference sample can be closer to the second attribute characteristic sample, and the predicted attribute characteristics h' s,j predicted for the speaker i can be as close to the voice attribute of the speaker j as possible.
According to the scheme, the attribute difference samples among the voice data samples of the plurality of pairs of speakers and the attribute characteristic samples of the voice data samples of the speakers are obtained, attribute prediction is carried out on the attribute difference samples and the attribute characteristic samples of the speakers to obtain predicted attribute characteristics, network parameters of the voice synthesis system are adjusted based on the predicted attribute characteristics to obtain the trained voice synthesis system, so that the attribute characteristics of the speakers predicted by the attribute are closer, the trained voice synthesis system can carry out attribute adjustment on synthesized voices according to the attribute differences, and the adjusted synthesized voices are closer to synthesized voices corresponding to the attribute differences.
Referring to fig. 6, fig. 6 is a flowchart of a training method of the speech synthesis system according to a second embodiment of the application. This embodiment is described by taking training a speech synthesis model as an example, and the method may include the steps of:
s31: and performing voice synthesis on the attribute characteristic sample and the acoustic characteristic sample of the speaker by using a voice synthesis system to obtain predicted synthesized voice.
In this step, the specific implementation process of performing speech synthesis on the attribute feature sample and the acoustic feature sample of the speaker by using the speech synthesis system to obtain the predicted synthesized speech may refer to the implementation process of obtaining the initial synthesized speech or the adjusted synthesized speech in the above embodiment, which is not described herein.
In some embodiments, the predicted synthesized speech may be obtained by speech synthesis of the attribute feature sample and the acoustic feature sample of the speaker through a speech synthesis model.
In some embodiments, prior to step S31, text feature samples and voice data samples of a plurality of speakers, characterization feature samples of a speaker, and the like may be acquired. When the training data set is obtained, the pronunciation text corresponding to the voice data sample of the speaker can be used as the corresponding text feature sample.
In some embodiments, attribute encoding modules are employed to obtain attribute feature samples of voice data samples of each speaker separately. And processing the characteristic feature sample and the text feature sample of the speaker by adopting a voice synthesis model to obtain an acoustic feature sample. Performing voice synthesis on the attribute feature sample and the acoustic feature sample to obtain predicted synthesized voice and predicted duration features; the prediction duration feature is obtained by predicting duration of a text coding feature sample by using a characterization feature sample, the text coding feature sample is obtained by coding the text feature sample, and the acoustic feature sample is obtained by acoustic expansion of the text coding feature sample according to the prediction duration feature.
Specifically, with continued reference to fig. 3, the speech synthesis model includes a text encoding module, a duration prediction module, a duration adjustment module, an attribute encoding module, and a decoding module.
And the text coding module codes the text characteristic samples of the speaker to obtain the text coding characteristic samples. The duration prediction module predicts the duration of the text coding feature sample by using the characterization feature sample of the speaker to obtain a predicted duration feature. And the duration adjustment module acoustically expands the text coding feature samples according to the predicted duration features to obtain acoustic feature samples. And the attribute coding module codes the voice data sample of the speaker to obtain an attribute characteristic sample. The decoding module decodes the attribute feature samples and the acoustic feature samples to obtain predicted synthesized speech, so that the attribute of the synthesized speech is controlled.
The attribute coding module may be a model obtained by adopting a random initialization mode, or a trained speaker classification model, etc., which is not limited by the present application.
S32: based on the predicted synthesized voice, the network parameters of the voice synthesis system are adjusted to obtain the trained voice synthesis system.
According to the predicted synthesized voice determined by the voice synthesis model, a synthesis loss value can be obtained, and network parameters of the voice synthesis model of the voice synthesis system can be adjusted based on the synthesis loss value to obtain a trained voice synthesis model, so that the trained voice synthesis system is obtained. Or obtaining a total loss value according to the determined predicted synthesized voice, and adjusting an attribute adjustment model of the voice synthesis system and network parameters of the voice synthesis model based on the total loss value to obtain the trained voice synthesis system, so that the trained voice synthesis system can realize the controllability or the adjustment of voice attributes (such as tone).
In some embodiments, the synthesis loss value may be determined by predicting the synthesized speech, wherein the synthesis loss value is a spectral loss value derived using the predicted synthesized speech and the real speech data.
In some embodiments, since the duration prediction module is used to learn the duration of each input phoneme, one duration prediction module needs to be trained simultaneously when the speech synthesis model is trained. The synthesis loss value can be determined by comprehensively predicting the characteristics of the synthesized voice and the predicted duration.
Wherein the composite loss value comprises: the method comprises the steps of obtaining a spectrum loss value and a duration loss value by using predicted synthesized voice and real voice data, and obtaining the duration loss value by using predicted duration characteristics and real duration characteristics. In some application scenarios, the real voice data may be the voice data sample or other characterizing real voice data, which is not limited thereto. The real time length feature may be the time length information extracted from the voice data sample or other real voice data.
The loss function corresponding to the composite loss value includes a spectrum loss value L mel and a duration loss value L dur, where the composite loss value L may be expressed as follows:
L=Lmel+Ldur
Wherein, the spectrum loss value L mel can be expressed as:
In the above formula, T is the number of frames contained in the current sentence corresponding to the current text feature sample. y i The actual spectral features (i.e., actual speech data) and the predicted spectral features (i.e., predicted synthesized speech) of the ith frame of the current sentence, respectively.
The duration loss value L dur may be expressed as:
in the above formula, P is the number of phonemes contained in the current sentence corresponding to the current text feature sample. d j, The real time length frame number (i.e. real time length feature) and the predicted time length frame number (i.e. predicted time length feature) of the j-th phoneme of the current sentence are respectively.
And optimizing network parameters of the overall voice synthesis model by minimizing the loss function L of the synthesized loss value until the loss function converges to obtain a trained voice synthesis model, wherein the trained voice synthesis model can realize the controllability of voice attributes.
The specific implementation of this embodiment may refer to the implementation process of the foregoing embodiment, and the disclosure is not repeated herein.
If the technical scheme of the application relates to personal information, the product applying the technical scheme of the application clearly informs the personal information processing rule before processing the personal information and obtains the autonomous agreement of the individual. If the technical scheme of the application relates to sensitive personal information, the product applying the technical scheme of the application obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'explicit consent'. For example, a clear and remarkable mark is set at a personal information acquisition device such as a camera to inform that the personal information acquisition range is entered, personal information is acquired, and if the personal voluntarily enters the acquisition range, the personal information is considered as consent to be acquired; or on the device for processing the personal information, under the condition that obvious identification/information is utilized to inform the personal information processing rule, personal authorization is obtained by popup information or a person is requested to upload personal information and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing mode, and a type of personal information to be processed.
It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.
For the above embodiment, the present application further provides a device for adjusting a synthesized voice, which is used for implementing the above method for adjusting a synthesized voice.
Referring to fig. 7, fig. 7 is a schematic structural diagram of an adjusting device for synthesizing speech according to an embodiment of the application. The synthesized speech adjusting apparatus 40 includes an attribute adjusting unit 41, an attribute acquiring unit 42, an attribute predicting unit 43, and a speech synthesizing unit 44. Wherein the attribute adjustment unit 41, the attribute acquisition unit 42, the attribute prediction unit 43, and the speech synthesis unit 44 are connected to each other.
The attribute adjustment unit 41 is used to acquire an attribute adjustment text of the initial synthesized voice.
The attribute obtaining unit 42 is configured to obtain an original attribute feature of the target speaker, where the attribute adjustment text is used to characterize an attribute difference for performing voice attribute adjustment on an initial synthesized voice, and the initial synthesized voice is obtained by performing voice synthesis using the original attribute feature and the acoustic feature.
The attribute prediction unit 43 is configured to perform attribute prediction by using the attribute adjustment text and the original attribute feature, so as to obtain a new attribute feature.
The speech synthesis unit 44 is configured to perform speech synthesis based on the new attribute features and the acoustic features to obtain an adjusted synthesized speech.
The specific implementation of this embodiment may refer to the implementation process of the foregoing embodiment, and the disclosure is not repeated herein.
For the above embodiment, the present application further provides a training device of a speech synthesis system, which is configured to implement the training method of the speech synthesis system.
Referring to fig. 8, fig. 8 is a schematic structural diagram of an embodiment of a training device of a speech synthesis system according to the present application. The training device 50 of the speech synthesis system includes a difference acquisition unit 51, an attribute acquisition unit 52, a speech prediction unit 53, and a parameter adjustment unit 54. Wherein the above-described difference acquisition unit 51, attribute acquisition unit 52, voice prediction unit 53, and parameter adjustment unit 54 are connected to each other.
The difference acquisition unit 51 is configured to acquire attribute difference samples between voice data samples of a plurality of pairs of speakers.
The attribute acquisition unit 52 is configured to acquire attribute feature samples of each speaker.
The voice prediction unit 53 is configured to perform attribute prediction on the attribute difference sample and the attribute feature of the speaker, so as to obtain a predicted attribute feature.
The parameter adjustment unit 54 is configured to adjust network parameters of the speech synthesis system based on the predicted attribute feature, so as to obtain a trained speech synthesis system.
The specific implementation of this embodiment may refer to the implementation process of the foregoing embodiment, and the disclosure is not repeated herein.
For the above embodiment, the present application provides an electronic device, please refer to fig. 9, fig. 9 is a schematic structural diagram of an embodiment of the electronic device of the present application. The electronic device 60 comprises a memory 61 and a processor 62, wherein the memory 61 and the processor 62 are coupled to each other, and the memory 61 stores program data, and the processor 62 is configured to execute the program data to implement the above-mentioned method for adjusting the synthesized speech and/or the steps of any embodiment of the training method of the speech synthesis system.
In this embodiment, the processor 62 may also be referred to as a CPU (Central Processing Unit ). The processor 62 may be an integrated circuit chip having signal processing capabilities. The processor 62 may also be a general purpose processor, a digital signal processor (DSP, digital Signal Processing), an Application Specific Integrated Circuit (ASIC), a field programmable gate array (FPGA, field Programmable GATE ARRAY) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor 62 may be any conventional processor or the like.
For the method of the above embodiment, which may be implemented in the form of a computer program, the present application proposes a computer readable storage medium, please refer to fig. 10, fig. 10 is a schematic structural diagram of an embodiment of the computer readable storage medium of the present application. The computer readable storage medium 70 stores therein program data 71 capable of being executed by a processor, the program data 71 being executable by the processor to implement the steps of any one of the embodiments of the above-described synthetic speech adaptation method and/or training method of a speech synthesis system.
The computer-readable storage medium 70 of the present embodiment may be a medium such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, which may store the program data 71, or may be a server storing the program data 71, which may send the stored program data 71 to another device for operation, or may also run the stored program data 71 by itself.
In some embodiments, the functions or modules included in the apparatus provided by the foregoing embodiments of the present application may be used to perform the methods described in the foregoing method embodiments, and specific implementations thereof may refer to the descriptions of the foregoing method embodiments, which are not repeated herein for brevity.
The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.
In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer-readable storage medium, which is a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing an electronic device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform all or part of the steps of the methods of the embodiments of the present application.
It will be apparent to those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by computing devices, such that they may be stored in a computer readable storage medium for execution by computing devices, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.
The foregoing description is only illustrative of the present application and is not intended to limit the scope of the application, and all equivalent structures or equivalent processes or direct or indirect application in other related technical fields are included in the scope of the present application.

Claims (16)

1. A method for adjusting synthesized speech, comprising:
acquiring an attribute adjustment text of initial synthesized voice and acquiring original attribute characteristics of a target speaker, wherein the attribute adjustment text is used for representing attribute differences for performing voice attribute adjustment on the initial synthesized voice, and the initial synthesized voice is obtained by performing voice synthesis by utilizing the original attribute characteristics and acoustic characteristics;
Performing attribute prediction by utilizing the attribute adjustment text and the original attribute characteristics to obtain new attribute characteristics;
And performing voice synthesis based on the new attribute characteristics and the acoustic characteristics to obtain adjusted synthesized voice.
2. The method of claim 1, wherein performing attribute prediction using the attribute-adjusted text and the original attribute feature to obtain a new attribute feature comprises:
Carrying out semantic coding on the attribute adjustment text by adopting a language module to obtain attribute difference characteristics;
and carrying out attribute prediction on the attribute difference feature and the original attribute feature by adopting an attribute prediction module to obtain the new attribute feature.
3. The method of claim 1, wherein the step of determining the position of the substrate comprises,
The new attribute features are obtained by attribute prediction by using the attribute adjustment text and the original attribute features, wherein the attribute adjustment model is obtained by training a plurality of pairs of attribute difference samples of the speakers, each pair of speakers comprises a first speaker and a second speaker, and the attribute difference samples comprise attribute differences between voice data samples of the second speaker relative to the first speaker.
4. A method according to claim 3, wherein said predicting the attributes using said attribute-adjusted text and said original attribute features, before deriving new attribute features, comprises:
Respectively acquiring a first attribute characteristic sample of a voice data sample of a first speaker and a second attribute characteristic sample of a voice data sample of a second speaker, and acquiring an attribute difference sample of the second speaker relative to the first speaker;
Carrying out attribute prediction on the attribute difference sample and the first attribute feature sample by adopting the attribute adjustment model to obtain predicted attribute features;
determining a predicted loss value by using the predicted attribute characteristics and a second attribute characteristic sample of the second speaker;
and adjusting network parameters of the attribute adjustment model based on the predicted loss value to obtain a trained attribute adjustment model.
5. The method of claim 1, wherein the obtaining the attribute-adjusted text of the initial synthesized speech comprises:
and receiving an attribute difference description text input by a user to the initial synthesized voice to obtain the attribute adjustment text of the initial synthesized voice, wherein the attribute difference description text is obtained based on attribute difference input of voice attributes compared with the initial synthesized voice after the user listens to the initial synthesized voice.
6. The method of claim 1, wherein the initial synthesized speech is a speech synthesis of the original attribute features and the acoustic features by a speech synthesis model;
Before the text is adjusted according to the attribute of the initial synthesized voice, the method comprises the following steps:
Coding the text features to be synthesized to obtain text coding features;
predicting the duration of the text coding feature by using the characterization feature of the target speaker to obtain a duration feature;
Performing acoustic expansion on the text coding feature according to the duration feature to obtain the acoustic feature;
Encoding the voice data of the target speaker to obtain the original attribute characteristics;
and decoding by utilizing the original attribute characteristics and the acoustic characteristics to obtain the initial synthesized voice.
7. The method of claim 6, wherein the speech synthesis model is trained using attribute feature samples, text feature samples, and characterization samples of speech data samples of a plurality of speakers.
8. A method of training a speech synthesis system, comprising:
acquiring attribute difference samples among voice data samples of a plurality of pairs of speakers; obtaining attribute characteristic samples of each speaker by using a voice synthesis system;
carrying out attribute prediction on the attribute difference sample and the attribute feature sample of the speaker by utilizing the voice synthesis system to obtain predicted attribute features;
and adjusting network parameters of the voice synthesis system based on the predicted attribute characteristics to obtain the trained voice synthesis system.
9. The method of claim 8, wherein the speech synthesis system includes an attribute encoding module and an attribute adjustment model, each pair of speakers including a first speaker and a second speaker, the attribute difference samples including attribute differences between speech data samples of the second speaker relative to the first speaker;
the obtaining the attribute characteristic sample of each speaker by using the voice synthesis system comprises the following steps:
Respectively acquiring a first attribute characteristic sample of a voice data sample of a first speaker and a second attribute characteristic sample of a voice data sample of a second speaker by utilizing the attribute coding module;
the step of predicting the attribute of the attribute difference sample and the attribute feature sample of the speaker by using the voice synthesis system to obtain predicted attribute features comprises the following steps:
carrying out attribute prediction on the attribute difference sample and the first attribute feature sample by adopting the attribute adjustment model to obtain the predicted attribute feature;
The step of adjusting the network parameters of the speech synthesis system based on the predicted attribute features to obtain a trained speech synthesis system comprises the following steps:
determining a predicted loss value by using the predicted attribute characteristics and a second attribute characteristic sample of the second speaker;
and adjusting network parameters of the attribute adjustment model based on the predicted loss value to obtain a trained attribute adjustment model.
10. The method of claim 8, wherein the method further comprises:
Performing voice synthesis on the attribute feature sample and the acoustic feature sample of the speaker by using the voice synthesis system to obtain predicted synthesized voice;
And based on the predicted synthesized voice, adjusting network parameters of the voice synthesis system to obtain the trained voice synthesis system.
11. The method of claim 10, wherein the speech synthesis system comprises a speech synthesis model, the speech synthesis model comprising an attribute encoding module;
the obtaining the attribute characteristic sample of each speaker by using the voice synthesis system comprises the following steps:
respectively acquiring attribute characteristic samples of voice data samples of each speaker by adopting the attribute coding module;
The speech synthesis of the attribute feature sample and the acoustic feature sample of the speaker by using the speech synthesis model to obtain predicted synthesized speech includes:
processing the characteristic feature sample and the text feature sample of the speaker by adopting the voice synthesis model to obtain an acoustic feature sample;
Performing voice synthesis on the attribute feature sample and the acoustic feature sample to obtain predicted synthesized voice and predicted duration features; the prediction duration feature is obtained by predicting duration of a text coding feature sample by using the characterization feature sample, the text coding feature sample is obtained by coding the text feature sample, and the acoustic feature sample is obtained by acoustic expansion of the text coding feature sample according to the prediction duration feature;
the method for predicting the synthesized voice based on the network parameters of the voice synthesis system comprises the steps of:
determining a synthesis loss value by utilizing the predicted synthesized voice and the predicted duration characteristics;
And based on the synthesis loss value, adjusting network parameters of the voice synthesis model to obtain a trained voice synthesis model.
12. The method of claim 11, wherein the step of determining the position of the probe is performed,
The composite loss value includes: the frequency spectrum loss value and the duration loss value are obtained by utilizing the predicted synthesized voice and the real voice data, and the duration loss value is obtained by utilizing the predicted duration characteristic and the real duration characteristic.
13. An adjusting device for synthesizing speech, comprising:
the attribute adjustment unit is used for acquiring an attribute adjustment text of the initial synthesized voice;
The attribute acquisition unit is used for acquiring original attribute characteristics of a target speaker, wherein the attribute adjustment text is used for representing attribute differences for performing voice attribute adjustment on the initial synthesized voice, and the initial synthesized voice is obtained by performing voice synthesis by utilizing the original attribute characteristics and the acoustic characteristics;
the attribute prediction unit is used for predicting the attribute by utilizing the attribute adjustment text and the original attribute characteristics to obtain new attribute characteristics;
and the voice synthesis unit is used for carrying out voice synthesis based on the new attribute characteristics and the acoustic characteristics to obtain adjusted synthesized voice.
14. A training device for a speech synthesis system, comprising:
a difference acquisition unit for acquiring attribute difference samples between voice data samples of a plurality of pairs of speakers;
The attribute acquisition unit is used for acquiring attribute characteristic samples of each speaker;
The voice prediction unit is used for carrying out attribute prediction on the attribute difference sample and the attribute characteristics of the speaker to obtain predicted attribute characteristics;
and the parameter adjusting unit is used for adjusting the network parameters of the voice synthesis system based on the predicted attribute characteristics to obtain the trained voice synthesis system.
15. An electronic device comprising a memory and a processor coupled to each other, the memory having program data stored therein, the processor for executing the program data to implement the steps of the method of any of claims 1 to 7; and/or the steps of the method of any one of claims 8 to 12.
16. A computer readable storage medium, characterized in that program data executable by a processor are stored, said program data being for implementing the steps of the method according to any one of claims 1 to 7; and/or the steps of the method of any one of claims 8 to 12.
CN202410029165.5A 2024-01-08 2024-01-08 Synthetic voice adjusting method, training method and related device Pending CN117935770A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202410029165.5A CN117935770A (en) 2024-01-08 2024-01-08 Synthetic voice adjusting method, training method and related device
CN202410882124.0A CN118411979A (en) 2024-01-08 2024-07-03 Synthetic voice adjusting method, training method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410029165.5A CN117935770A (en) 2024-01-08 2024-01-08 Synthetic voice adjusting method, training method and related device

Publications (1)

Publication Number Publication Date
CN117935770A true CN117935770A (en) 2024-04-26

Family

ID=90758721

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202410029165.5A Pending CN117935770A (en) 2024-01-08 2024-01-08 Synthetic voice adjusting method, training method and related device
CN202410882124.0A Pending CN118411979A (en) 2024-01-08 2024-07-03 Synthetic voice adjusting method, training method and related device

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202410882124.0A Pending CN118411979A (en) 2024-01-08 2024-07-03 Synthetic voice adjusting method, training method and related device

Country Status (1)

Country Link
CN (2) CN117935770A (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019191251A1 (en) * 2018-03-28 2019-10-03 Telepathy Labs, Inc. Text-to-speech synthesis system and method
CN208924405U (en) * 2018-08-29 2019-05-31 北京云知声信息技术有限公司 Audio frequency broadcast system
CN114495956A (en) * 2022-02-08 2022-05-13 北京百度网讯科技有限公司 Voice processing method, device, equipment and storage medium
CN115762467A (en) * 2022-11-08 2023-03-07 科大讯飞股份有限公司 Speaker characteristic vector distribution space creation and voice synthesis method and related equipment
CN116092478A (en) * 2023-02-16 2023-05-09 平安科技(深圳)有限公司 Voice emotion conversion method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN118411979A (en) 2024-07-30

Similar Documents

Publication Publication Date Title
CN110211563B (en) Chinese speech synthesis method, device and storage medium for scenes and emotion
CN112735373B (en) Speech synthesis method, device, equipment and storage medium
JP2022107032A (en) Text-to-speech synthesis method using machine learning, device and computer-readable storage medium
CN111785246B (en) Virtual character voice processing method and device and computer equipment
CN111667812A (en) Voice synthesis method, device, equipment and storage medium
JP2015180966A (en) Speech processing system
CN113920977A (en) Speech synthesis model, model training method and speech synthesis method
CN112837669B (en) Speech synthesis method, device and server
JP2020034883A (en) Voice synthesizer and program
CN113314097A (en) Speech synthesis method, speech synthesis model processing device and electronic equipment
CN116994553A (en) Training method of speech synthesis model, speech synthesis method, device and equipment
CN116129852A (en) Training method of speech synthesis model, speech synthesis method and related equipment
CN116564269A (en) Voice data processing method and device, electronic equipment and readable storage medium
CN112908293B (en) Method and device for correcting pronunciations of polyphones based on semantic attention mechanism
CN112863476B (en) Personalized speech synthesis model construction, speech synthesis and test methods and devices
CN114708876B (en) Audio processing method, device, electronic equipment and storage medium
CN112885326A (en) Method and device for creating personalized speech synthesis model, method and device for synthesizing and testing speech
CN116665642A (en) Speech synthesis method, speech synthesis system, electronic device, and storage medium
CN113299270B (en) Method, device, equipment and storage medium for generating voice synthesis system
CN115132170A (en) Language classification method and device and computer readable storage medium
CN117935770A (en) Synthetic voice adjusting method, training method and related device
CN114464163A (en) Method, device, equipment, storage medium and product for training speech synthesis model
CN117746834B (en) Voice generation method and device based on large model, storage medium and electronic device
CN114360511B (en) Voice recognition and model training method and device
CN115620701A (en) Speech synthesis method, apparatus, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20240426