CN116312459A - Speech synthesis method, device, electronic equipment and storage medium - Google Patents

Speech synthesis method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116312459A
CN116312459A CN202310141134.4A CN202310141134A CN116312459A CN 116312459 A CN116312459 A CN 116312459A CN 202310141134 A CN202310141134 A CN 202310141134A CN 116312459 A CN116312459 A CN 116312459A
Authority
CN
China
Prior art keywords
style
language
vectorized
local
global
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310141134.4A
Other languages
Chinese (zh)
Inventor
丛亚欢
马泽君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youzhuju Network Technology Co Ltd
Original Assignee
Beijing Youzhuju Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Youzhuju Network Technology Co Ltd filed Critical Beijing Youzhuju Network Technology Co Ltd
Priority to CN202310141134.4A priority Critical patent/CN116312459A/en
Publication of CN116312459A publication Critical patent/CN116312459A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/086Detection of language

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present disclosure relates to the field of speech synthesis technology, and in particular, to a speech synthesis method, a device, an electronic apparatus, and a storage medium, where the method includes obtaining text features of a target language and an identifier of an original language; carrying out style prediction of the original language based on the text characteristics of the target language and the identification of the original language to obtain style characteristics, inquiring a codebook corresponding to the target language based on the style characteristics to obtain vectorized style characteristics, wherein the codebook corresponds to the languages one by one and is used for vectorizing the style characteristics; and performing encoding and decoding processing based on the vectorized style characteristics to determine the target voice of the target language. Different codebooks are adopted for vectorizing style characteristics aiming at different languages, and the codebook of the target language can be used during the style migration of cross languages so as to relieve accent phenomenon.

Description

Speech synthesis method, device, electronic equipment and storage medium
Technical Field
The disclosure relates to the technical field of speech synthesis, and in particular relates to a speech synthesis method, a device, electronic equipment and a storage medium.
Background
In the application scenario of speech synthesis, in order to improve the expressive force of speech synthesis, separate style modeling is often required. In order for the target timbre to exhibit different styles, style migration technical support is required. The current style migration method is mainly carried out in the same language, and the corresponding style data in the language must be provided to finish the corresponding style expression. This creates excessive reliance on data resources, which increases the cost of the speech synthesis task. In addition, in some application scenarios, the same style needs to be displayed on different languages, and traditional style migration tasks are difficult to achieve.
Disclosure of Invention
In view of this, the embodiments of the present disclosure provide a method, an apparatus, an electronic device, and a storage medium for speech synthesis, so as to solve the problem of cross-language style migration during speech synthesis.
According to a first aspect, an embodiment of the present disclosure provides a speech synthesis method, including:
acquiring text characteristics of a target language and an identification of an original language;
carrying out style prediction of the original language based on the text characteristics of the target language and the identification of the original language to obtain style characteristics, inquiring a codebook corresponding to the target language based on the style characteristics to obtain vectorized style characteristics, wherein the codebook corresponds to the languages one by one and is used for vectorizing the style characteristics;
And performing encoding and decoding processing based on the vectorized style characteristics to determine the target voice of the target language.
According to a second aspect, an embodiment of the present disclosure provides a speech synthesis apparatus, including:
the acquisition module is used for acquiring text characteristics of the target language and the identification of the original language;
the style vectorization module is used for carrying out style prediction of the original language based on the text characteristics of the target language and the identification of the original language to obtain style characteristics, inquiring a codebook corresponding to the target language based on the style characteristics to obtain vectorized style characteristics, wherein the codebook corresponds to the languages one by one and is used for vectorizing the style characteristics;
and the processing module is used for carrying out encoding and decoding processing based on the vectorized style characteristics so as to determine the target voice of the target language.
According to a third aspect, an embodiment of the present disclosure provides an electronic device, including: the voice synthesis system comprises a memory and a processor, wherein the memory and the processor are in communication connection, the memory stores computer instructions, and the processor executes the computer instructions, so that the voice synthesis method in the first aspect or any implementation manner of the first aspect is executed.
According to a fourth aspect, the disclosed embodiments provide a computer-readable storage medium storing computer instructions for causing a computer to perform the speech synthesis method of the first aspect or any implementation of the first aspect.
According to the voice synthesis method provided by the embodiment of the disclosure, different codebooks are adopted for vectorizing style characteristics aiming at different languages, and the codebook of the target language can still be used during the style migration of the cross-language so as to relieve the accent phenomenon. Meanwhile, when the style prediction of the original language is performed, the style prediction is performed based on the text characteristics of the target language and the identification of the original language, namely, the text characteristics are used as data to perform the style prediction, so that dependence on audio characteristics can be removed, the adaptability of the style to the language is enhanced, the style data on a certain language is migrated to other languages, and the data requirement is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the prior art, the drawings that are required in the detailed description or the prior art will be briefly described, it will be apparent that the drawings in the following description are some embodiments of the present disclosure, and other drawings may be obtained according to the drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a flow chart of a speech synthesis method according to an embodiment of the present disclosure;
FIG. 2 is a flow chart of a speech synthesis method according to an embodiment of the present disclosure;
FIG. 3 is a schematic structural diagram of speech synthesis according to an embodiment of the present disclosure;
FIG. 4 is a flow chart of a manner of determining a style model according to an embodiment of the present disclosure;
FIG. 5 is a schematic structural view of a style model according to an embodiment of the present disclosure;
fig. 6 is a block diagram of a voice synthesizing apparatus according to an embodiment of the present disclosure;
fig. 7 is a schematic hardware structure of an electronic device according to an embodiment of the disclosure.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. Based on the embodiments in this disclosure, all other embodiments that a person skilled in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.
It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.
For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.
As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.
It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.
It will be appreciated that the data (including but not limited to the data itself, the acquisition or use of the data) involved in the present technical solution should comply with the corresponding legal regulations and the requirements of the relevant regulations.
In the related art of speech synthesis, a multi-scale reference encoder is generally used for style modeling, where vector quantization is applied in a task of style modeling, but since the target of vector quantization is not cross-language migration of a style, and there is no strategy for realizing style adaptation for the target, style modeling in this manner cannot realize migration of styles in different languages.
Based on the above, the speech synthesis method provided by the embodiment of the present disclosure realizes the task of style migration in languages by quantizing style features based on language-dependent codebooks. In order to reduce the requirement of style data resources, the scheme provided by the embodiment of the disclosure is a cross-language migration scheme of styles, the styles in a certain language can be migrated to other languages, and the gateway of accents is maintained while high style similarity is realized. For example, the target language is English, the original language is Chinese, and the cross-language style migration is realized through the voice synthesis method of the embodiment of the disclosure.
In accordance with the disclosed embodiments, a speech synthesis method embodiment is provided, it being noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system, such as a set of computer executable instructions, and, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order other than that illustrated herein.
In this embodiment, a speech synthesis method is provided, which may be used in an electronic device, such as a mobile terminal, a server, a computer, etc., and fig. 1 is a flowchart of a speech synthesis method according to an embodiment of the disclosure, as shown in fig. 1, where the flowchart includes the following steps:
s11, obtaining text characteristics of the target language and identification of the original language.
The specific language represented by the target language and the original language is set according to the actual application requirement, and is not limited in any way. In speech synthesis, text in a target language is converted into speech output in the target language.
The text features of the target language are obtained by extracting or preprocessing the text features of the target language. For example, text in the target language may be obtained by passing it through an encoder, or may be obtained by using a number of pre-trained text processing modules, etc. Among other things, the pre-trained text processing module includes, but is not limited to, a BERT module, and the like.
The identification of the original language is used for representing the unique identity identification of the original language, and the identification corresponds to the language one by one. The identification includes, but is not limited to, numerals, characters or other forms, etc., and is specifically set according to actual requirements. For example, a corresponding relation table of languages and identifiers is maintained in the electronic device, and after the identifiers of the original languages are obtained, the corresponding relation table is queried to determine the languages corresponding to the original languages.
S12, carrying out style prediction of the original language based on the text characteristics of the target language and the identification of the original language to obtain style characteristics, and inquiring a codebook corresponding to the target language based on the style characteristics to obtain vectorized style characteristics.
The codebooks are in one-to-one correspondence with languages, and the codebooks are used for vectorizing style characteristics.
Style prediction is based on the text characteristics of the target language and the identification of the original language used to determine the original language, i.e., style prediction of the original language is performed using the text characteristics of the target language.
For example, style prediction is implemented by a style predictor, specifically, text in a target language is first passed through a pre-trained BERT model to obtain a word-level semantic token sequence, i.e., text features, which are then used as inputs to a multi-scale style predictor. The multi-scale style predictor first models the word-to-word and sentence-to-sentence relationships of the context using a hierarchical context encoder that includes two layers of attention networks, resulting in paragraph-level, sentence-level, and word-level contexts Wen Yuyi characterizations. Next, the model will predict the style characterization of the corresponding level of the original language from the different levels up and down Wen Yuyi characterization to restore the multi-scale speaking style in human speech. Furthermore, given that higher-level speech styles that are closer to the global scale will have an impact on lower-level speech styles, we model the speech styles sequentially from higher-level to lower-level based on residual connections during the prediction process. Specifically, a high-level style is predicted first and then used as a conditional input for a low-level style predictor. And after the processing of the style predictor, the style characteristics of the original language are obtained.
In the embodiment of the disclosure, different codebooks are used for different languages, and if N languages exist, N codebooks are corresponding. The codebook is used for vectorizing style characteristics, for example, the codebook is a VQ-VAE self-encoder, which can realize discretized coding capability and maintain the discretized codebook. When the codebook corresponding to the target language is queried based on the style characteristics, the input style characteristics are encoded into a certain vector in the codebook by searching the codebook nearest to the target language.
Through the language division codebook, the target language codebook can still be used when the cross languages of style characteristics are used, and the accent phenomenon is further relieved.
S13, performing encoding and decoding processing based on the vectorized style characteristics to determine target voice of the target language.
After the vectorized style features are obtained, adding the vectorized style features into an encoder, and realizing the processing of the encoder on the language types and the vectorized style features to obtain a coding result; the encoding result is input to a decoder, and the target speaker is combined to perform decoding processing, so as to output target speech of the target language. Because the cross-language style migration is realized in the speech synthesis, the style characteristics of the target language are still reserved in the target speech of the target language, so that accent information in the style is relieved.
According to the voice synthesis method provided by the embodiment, different codebooks are adopted for vectorizing style characteristics aiming at different languages, and the codebook of the target language can still be used during the style migration of the cross-language so as to relieve the accent phenomenon. Meanwhile, when the style prediction of the original language is performed, the style prediction is performed based on the text characteristics of the target language and the identification of the original language, namely, the text characteristics are used as data to perform the style prediction, so that dependence on audio characteristics can be removed, the adaptability of the style to the language is enhanced, the style data on a certain language is migrated to other languages, and the data requirement is reduced.
In this embodiment, a speech synthesis method is provided, which may be used in an electronic device, such as a mobile terminal, a server, a computer, etc., and fig. 2 is a flowchart of a speech synthesis method according to an embodiment of the disclosure, as shown in fig. 2, where the flowchart includes the following steps:
s21, obtaining text characteristics of the target language and the identification of the original language.
Please refer to S11 in the embodiment shown in fig. 1 in detail, which is not described herein.
S22, carrying out style prediction of the original language based on the text characteristics of the target language and the identification of the original language to obtain style characteristics, and inquiring a codebook corresponding to the target language based on the style characteristics to obtain vectorized style characteristics.
The codebooks are in one-to-one correspondence with languages, and the codebooks are used for vectorizing style characteristics.
Specifically, the step S22 includes:
s221, performing multi-scale style prediction of the original language based on the text characteristics of the target language and the identification of the original language, and obtaining style characteristics.
Wherein the style characteristics include local style characteristics and global style characteristics.
And in the style prediction, adopting a multi-scale style prediction mode to perform multi-scale style prediction of the original language from the text characteristics of the target language, and obtaining local style characteristics and global style characteristics. The multi-scale style prediction may be implemented by a text prediction style module, for example, a Test2style Predictor. If the original language is Chinese and the target language is English, the prediction of the Chinese style on the English text can be realized through a Test2style Predictor.
S222, determining a local codebook and a global codebook corresponding to the target language by using the target language.
The local codebook and the global codebook correspond to the result of the multi-scale style prediction, that is, the local style feature and the global style feature, respectively. Wherein, the local codebook and the global codebook are corresponding to the target language. The local codebook is used for vectorizing local style characteristics, and the global codebook is used for vectorizing global style characteristics.
S223, based on the local style characteristics and the global style characteristics, respectively inquiring the local codebook and the global codebook to obtain vectorized style characteristics.
Wherein the vectorized style characteristics include vectorized local style characteristics and vectorized global style characteristics.
Different languages use different VQ-VAEs, and assuming N languages, there are N codebooks for each of the VQ-VAE local style and VQ-VAE global style. Through this kind of language division VQ codebook, can still use the style codebook of target language when the cross language of style, further alleviate accent phenomenon. If the voice synthesis method supports the processing of N languages, the local codebook corresponding to the language 1 is VQ-vae1-local, and the global codebook is VQ-vae1-global; the local codebook corresponding to the language 2 is VQ-vae2-local, and the global codebook is VQ-vae2-global; … …; the local codebook corresponding to the language N is VQ-vaeN-local, and the global codebook is VQ-vaeN-global.
Continuing with the above example, if the original language is chinese, the target language is english, i.e., the input is an english text feature. If English is the Nth language, the corresponding local codebook is VQ-vaeN-local, and the global codebook is VQ-vaeN-global. And respectively searching the VQ-vaeN-local and the VQ-vaeN-global by utilizing the local style and global style obtained by prediction, and obtaining the closest style characteristics, namely the vectorized local style VQ local style and vectorized global style.
S23, performing encoding and decoding processing based on the vectorized style characteristics to determine target voice of the target language.
Specifically, the step S23 includes:
s231, splicing the vectorized local style features and vectorized global style features to obtain target style features.
The vectorized local style feature and vectorized global style feature correspond to style features, so that the vectorized local style feature and vectorized global style feature need to be spliced to obtain target style features before subsequent processing.
In some embodiments, the step S231 includes:
(1) A target dimension of the orientation quantized local style feature is obtained.
(2) The number of copies of the quantized global style feature is determined based on the target dimension.
(3) And copying and splicing the vectorized global style features based on the copy number to obtain copy features.
(4) And splicing the copy characteristic with the vectorized local style characteristic to obtain the target style characteristic.
If the target dimension of the vectorized local style feature is T and the dimension of the vectorized global style feature is 1, in order to achieve the concatenation of the vectorized local style feature and the vectorized global style feature, the vectorized global style feature needs to be copied and spliced with the vectorized local style feature in the form of repeated features.
Specifically, after copying the vectorized global style features by T parts, the features of the T parts are spliced to obtain the copy features of the T dimensions. And adding the T-dimensional replication characteristic and the T-dimensional vectorization local style characteristic to obtain the T-dimensional target style characteristic. Wherein the dimensions of the target style feature are consistent with the dimensions of the text feature.
And splicing the vectorized global style features with vectorized local style features in a repeated feature mode to obtain copied features, and splicing the copied features with vectorized local style features on the basis to ensure the alignment of the target style features and text features.
S232, performing encoding and decoding processing based on the target style characteristics to determine target voice of the target language.
For details of the codec process, please refer to S13 in the embodiment shown in fig. 1, and the description is omitted here.
In some embodiments, as shown in fig. 3, in the processing structure of speech synthesis, text features of a target language are taken as input, and the original language is identified as a condition, so that a style corresponding to the original language can be predicted on the target language, and a latest result is searched on a codebook corresponding to the target language, so as to obtain a vectorized local style feature VQ local style and a vectorized global style feature VQ global style, and further remove accent information in the style.
And adding the vectorized global style characteristic VQ global style with the vectorized local style characteristic VQ local style in the form of repeated characteristics to form a target style characteristic, wherein the length dimension of the target style characteristic is consistent with the length dimension of the text characteristic of the input target language. And adding the target style characteristics into the encoder to realize the processing of the language and the target style characteristics by the encoder so as to output target voice of the target language.
According to the voice synthesis method, the multi-scale style is quantized by carrying out multi-scale style prediction on the text characteristics of the target language, and style adaptation is realized by combining the multi-scale style with text characteristic style prediction on the basis, so that the self-adaptation problem of the multi-scale style on different languages is solved. Meanwhile, the vectorized local style features and vectorized global style features are spliced to realize fusion of multi-scale style features, so that encoding and decoding processing is carried out on the basis of the multi-scale style features, high-similarity cross-language style migration can be realized when speech is synthesized, and the geophone can still be kept.
In some embodiments, the processing of S22 described above is implemented based on a style model. Based on this, S22 includes:
(1) And inputting the text characteristics of the target language and the identification of the original language into a text prediction style module of the style model to perform style prediction of the original language, so as to obtain style characteristics.
(2) And determining a vectorization codebook module corresponding to the target language in the style model based on the target language, wherein the vectorization codebook module corresponds to the languages one by one.
(3) And inputting the style characteristics into a vectorization codebook module to obtain vectorization style characteristics.
The style model comprises a text prediction style module and a vectorization codebook module, wherein the text prediction style module is used for carrying out style prediction of original languages to obtain style characteristics. The vectorization codebook module is consistent with the codebook, is in one-to-one correspondence with languages, and is used for vectorizing the style characteristics to obtain vectorization style characteristics.
The text prediction style module may be the Test2style Predictor, the multi-scale style Predictor, or the like, which are not limited in any way, and may be specifically set according to actual requirements.
The style model is adopted to realize cross-language style prediction and vectorization of style characteristics, and in the processing process of the style model, the vectorization codebook module corresponding to languages one by one is adopted to vectorize the style characteristics, so that the vectorization style characteristics are related to the languages, the style of the target languages can be reserved, and the cross-language style migration is realized on the basis of reserving accents of the target languages.
In the above description, it was described that the style characteristics were derived based on a style model. Based on this, the manner of determining the style model is described in detail below. As shown in fig. 4, the style model determining method includes:
s31, obtaining the reference audio characteristics of the reference audio of the first language.
The reference audio features of the reference audio are obtained by feature extraction of the reference audio, such as mel features and the like.
S32, inputting the audio features into a multi-scale encoder of a preset style model for multi-scale feature encoding to obtain reference local features and reference global features.
The multi-scale encoder multi-scale reference encoder is used for extracting local and global features of the input audio features to obtain reference local features and reference global features.
S33, the reference local feature and the reference global feature are respectively input into a local vectorization codebook module and a global vectorization codebook module corresponding to the first language, so that vectorized reference local feature and vectorized reference global feature are obtained.
The vectorization codebook module corresponds to the language, and because the reference local feature and the reference global feature are obtained in the step S32, the vectorization codebook module corresponding to the first language includes a local vectorization codebook module and a global vectorization codebook module. Utilizing the reference local feature to perform nearest neighbor query in the local vectorization codebook module to obtain vectorization reference local feature; and carrying out nearest neighbor query in the global vectorization codebook module by utilizing the reference global feature to obtain the vectorized reference global feature.
S34, performing loss calculation based on the reference local feature and the reference global feature, and the vectorized reference local feature and the vectorized reference global feature to obtain vectorized loss.
Because the vectorized codebook module is the nearest neighbor search, vectorized correlation losses are used in order to train the vectorized codebook module to bring the vectorized reference local features Reference VQ local style closer to the reference local features. For example, to achieve the goal that the vectorized reference local feature approaches the reference local feature referecne local style, the local vectorization loss vq_loss_local is calculated using the following formula:
vq_loss_local=MSE(referenceVQlocalstyle,sg(referencelocalstyle))+β 1 *MSE(sg(referenceVQlocalstyle),referencelocalstyle)
wherein beta is 1 Is constant.
Accordingly, to achieve the goal that the vectorized reference global feature Reference VQ Global style approaches the reference global feature referecne global style, the global vectorization penalty vq_loss_global is calculated using the following formula:
vq_loss_global=MSE(referenceVQglobalstyle,sg(referenceglobalstyle))+β 2 *MSE(sg(referenceVQglobalstyle),referenceglobalstyle)
wherein beta is 2 Is constant.
After obtaining the local vectorization penalty and the global vectorization penalty, the two are fused, for example, weighted average, to obtain the vectorization penalty. Of course, the local vectorization loss and the global vectorization loss may be fused in other manners, and the implementation manner is not limited in any way, and may be specifically set according to actual requirements.
And S35, updating parameters of the preset style model based on the vectorization loss to determine the style model.
Based on the vectorization loss, updating parameters of a preset style model; after multiple iterations, parameters of a preset style model are fixed to determine the style model. The stopping condition of the iteration may be that the iteration number reaches a preset number, or that the vectorization loss is smaller than a preset loss value, or the like.
When the preset style model is trained, vectorized reference features (namely vectorized reference local features and vectorized reference global features) are enabled to be closer to reference features (the reference local features and the reference global features), the preset style model is updated by adopting vector loss, and reliability of the preset style model obtained through training is guaranteed.
In some embodiments, fig. 5 shows a schematic diagram of a style model, based on which the above S32 includes:
(1) Inputting the reference audio features into a multi-scale encoder of a preset style model for multi-scale feature coding to obtain local features and reference global features.
(2) The local features are input into an attention module for aligning the reference local features with the reference text features, resulting in reference local features.
In the training stage of the style model, the audio features are firstly passed through a multi-scale encoder multi-scale reference encoder to obtain local features and reference global features. On the basis, the local feature is input into the attention module in a mode of using the attention module to obtain the reference local feature, so that the alignment of the reference local feature and the reference text feature is realized. Specifically, the implementation principle of the attention module is as follows: and taking the reference text feature as a query, taking the local feature as a key and a value, and outputting a result, wherein the length of the reference local feature is consistent with that of the reference text feature.
In some embodiments, text2style predictors, conditioned on the identity of the original language, are modeled with reference Text features as data, to remove dependence on reference audio features at the time of use, and to enhance style adaptation to the language, as well as to predict local style features and predict global style features. The targets of the two features correspond to a reference local feature and a reference global feature respectively, and based on the targets, style prediction loss is introduced in loss calculation. Fig. 5 shows a schematic diagram of a style model, based on which S35 described above includes:
(1) And acquiring the reference text characteristics of the second language and the identification of the first language.
(2) And inputting the reference text characteristics and the identification of the first language into a text prediction style module to obtain the prediction local style characteristics and the prediction global style characteristics of the first language.
(3) And carrying out loss calculation based on the predicted local style characteristics and the predicted global style characteristics of the first language and the reference local characteristics and the reference global characteristics to obtain style prediction loss.
(4) Based on the vectorization loss and the style prediction loss, updating parameters of a preset style model to determine the style model.
As shown in fig. 5, the reference Text feature and the identifier of the first language are input into the Text prediction style module Text2style Predictor, and the predicted local style feature localstyle and the predicted global style feature globalstyle of the first language are output. As described above, the targets of the predicted local style feature and the predicted global style feature in the first language correspond to the reference local feature referenceocellular style and the reference global feature referenceglobal style, respectively, and are therefore based on the calculation of the style prediction loss.
For example, style prediction loss style_predictor_loss is expressed by the following formula:
style_predictor_loss=MSE(localstyle,referencelocalstyle)
+MSE(globalstyle,referenceglobalstyle)
of course, the above formula is only an example of the calculation formula of vectorization loss and style prediction loss, and does not limit the protection scope of the present disclosure, and the specific loss function is set according to actual requirements.
The text prediction style module taking the reference text feature as data and the identification of the first language as a condition obtains the prediction local style feature and the prediction global segmentation feature, and targets of the two features respectively correspond to the reference local feature and the reference global feature, so that style prediction loss is introduced, dependence on audio features can be removed when the text prediction style module is used, and meanwhile, the adaptability of styles to languages is enhanced.
As a specific application example of speech synthesis in the embodiments of the present disclosure, there is a need in the field of AI video translation to migrate a user's style in one language to another non-native language. Video translation generally refers to translating speech in a video from an original language to a target language and ensuring consistency of the translated speech with a picture. Typically, video translation may be composed by a cascade of multiple systems, including speech recognition, machine translation, and speech synthesis. To ensure that the translated speech corresponds to the original video, the text length is usually controlled in the machine translation stage and then the length of the synthesized speech is adjusted in the speech synthesis stage. For example, if a chinese language is required to be translated into an english language, first, a speech recognition is required to be performed, the chinese language is converted into a chinese text, and then, a machine translation is combined to obtain the english text. The subsequent processing of the english text is the speech synthesis processing described in the embodiments of the disclosure. And performing feature processing on the English text to obtain English text features, and inputting the English text features and Chinese identifiers into a style model, so that the prediction of Chinese styles on the English text is realized by using a text prediction style module in the style model. If the N-th language in the style model corresponds to English, local style characteristics and global style characteristics obtained through prediction are searched for VQ-vaeN-local and VQ-vaeN-global respectively, and the local style characteristics VQ local style and the global style characteristics VQ global style with the closest style characteristics are obtained. And adding the vectorized global style features with vectorized local style features in the form of repeated features to obtain target style features. And adding the target style characteristics into an encoder for encoding, and then performing decoding on the basis of the encoding result, so as to output English target voice.
In this embodiment, a speech synthesis apparatus is further provided, and the speech synthesis apparatus is used to implement the foregoing embodiments and preferred embodiments, and will not be described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
The present embodiment provides a speech synthesis apparatus, as shown in fig. 6, including:
an obtaining module 41, configured to obtain text features of a target language and an identifier of an original language;
the style vectorization module 42 is configured to predict a style of the original language based on the text feature of the target language and the identifier of the original language, obtain a style feature, and query a codebook corresponding to the target language based on the style feature, obtain vectorized style features, where the codebook corresponds to the languages one by one, and the codebook is used for vectorizing the style features;
and the processing module 43 is configured to perform a codec process based on the vectorized style feature to determine a target voice of the target language.
In some implementations, the style vectorization module 42 includes:
the prediction unit is used for carrying out multi-scale style prediction of the original language based on the text characteristics of the target language and the identification of the original language to obtain the style characteristics, wherein the style characteristics comprise local style characteristics and global style characteristics;
the first determining unit is used for determining a local codebook and a global codebook corresponding to the target language by utilizing the target language;
and the query unit is used for respectively querying the local codebook and the global codebook based on the local style characteristic and the global style characteristic to obtain the vectorized style characteristic, wherein the vectorized style characteristic comprises vectorized local style characteristic and vectorized global style characteristic.
In some embodiments, the processing module 43 includes:
the splicing unit is used for splicing the vectorized local style characteristics and vectorized global style characteristics to obtain target style characteristics;
and the processing unit is used for carrying out encoding and decoding processing based on the target style characteristics so as to determine target voice of the target language.
In some embodiments, the splice unit comprises:
The acquisition subunit is used for acquiring the target dimension of the vectorized local style characteristic;
a determining subunit, configured to determine a copy number of the vectorized global style feature based on the target dimension;
the first splicing subunit is used for copying and splicing the vectorized global style characteristics based on the copy number to obtain copy characteristics;
and the second splicing subunit is used for splicing the replication characteristic and the vectorized local style characteristic to obtain the target style characteristic.
In some implementations, the style vectorization module 42 includes:
the input unit is used for inputting the text characteristics of the target language and the marks of the original language into the text prediction style module of the style model to perform style prediction of the original language so as to obtain the style characteristics;
the second determining unit is used for determining a vectorization codebook module corresponding to the target language in the style model based on the target language, wherein the vectorization codebook module corresponds to the languages one by one;
the first vectorization unit is used for inputting the style characteristics into the vectorization codebook module to obtain the vectorized style characteristics.
In some embodiments, the model of style determination module includes:
the acquisition unit is used for acquiring the reference audio characteristics of the reference audio of the first language;
the coding unit is used for inputting the audio features into a multi-scale coder of a preset style model to carry out multi-scale feature coding to obtain reference local features and reference global features;
the second vectorization unit is used for inputting the reference local feature and the reference global feature into the local vectorization codebook module and the global vectorization codebook module corresponding to the first language respectively to obtain vectorized reference local feature and vectorized reference global feature;
the loss unit is used for carrying out loss calculation to obtain vectorization loss based on the reference local feature, the reference global feature, the vectorized reference local feature and the vectorized reference global feature;
and the updating unit is used for updating the parameters of the preset style model based on the vectorization loss so as to determine the style model.
In some embodiments, the updating unit comprises:
the obtaining subunit is used for obtaining the reference text characteristics of the second language and the identification of the first language;
The prediction subunit is used for inputting the reference text characteristics and the identification of the first language into the text prediction style module to obtain the predicted local style characteristics and the predicted global style characteristics of the first language;
the loss subunit is used for carrying out loss calculation to obtain style prediction loss based on the predicted local style characteristic and the predicted global style characteristic of the first language and the reference local characteristic and the reference global characteristic;
and the updating subunit is used for updating the parameters of the preset style model based on the vectorization loss and the style prediction loss so as to determine the style model.
In some embodiments, the encoding unit includes:
the coding subunit is used for inputting the reference audio features into a multi-scale coder of a preset style model to perform multi-scale feature coding to obtain local features and reference global features;
and the attention subunit is used for inputting the local feature into an attention module to obtain the reference local feature, and the attention module is used for aligning the reference local feature with the reference text feature.
The speech synthesis apparatus in this embodiment is presented in the form of functional units, where the units refer to ASIC circuits, processors and memories executing one or more software or firmware programs, and/or other devices that can provide the functionality described above.
Further functional descriptions of the above respective modules are the same as those of the above corresponding embodiments, and are not repeated here.
The embodiment of the disclosure also provides an electronic device, which is provided with the voice synthesis device shown in fig. 6.
Referring to fig. 7, fig. 7 is a schematic structural diagram of an electronic device according to an alternative embodiment of the disclosure, as shown in fig. 7, the electronic device may include: at least one processor 51, such as a CPU (Central Processing Unit ), at least one communication interface 53, a memory 54, at least one communication bus 52. Wherein the communication bus 52 is used to enable connected communication between these components. The communication interface 53 may include a Display screen (Display) and a Keyboard (Keyboard), and the selectable communication interface 53 may further include a standard wired interface and a wireless interface. The memory 54 may be a high-speed RAM memory (Random Access Memory, volatile random access memory) or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 54 may alternatively be at least one memory device located remotely from the aforementioned processor 51. Wherein the processor 51 may be in conjunction with the apparatus described in fig. 6, the memory 54 stores an application program, and the processor 51 invokes the program code stored in the memory 54 for performing any of the method steps described above.
The communication bus 52 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The communication bus 52 may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one thick line is shown in fig. 7, but not only one bus or one type of bus.
Wherein the memory 54 may include volatile memory (english) such as random-access memory (RAM); the memory may also include a nonvolatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviated as HDD) or a solid state disk (english: solid-state drive, abbreviated as SSD); memory 54 may also include a combination of the types of memory described above.
The processor 51 may be a central processor (English: central processing unit, abbreviated: CPU), a network processor (English: network processor, abbreviated: NP) or a combination of CPU and NP.
The processor 51 may further include a hardware chip, among others. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof (English: programmable logic device). The PLD may be a complex programmable logic device (English: complex programmable logic device, abbreviated: CPLD), a field programmable gate array (English: field-programmable gate array, abbreviated: FPGA), a general-purpose array logic (English: generic array logic, abbreviated: GAL), or any combination thereof.
Optionally, the memory 54 is also used for storing program instructions. The processor 51 may invoke program instructions to implement the speech synthesis method as shown in any of the embodiments of the present application.
The disclosed embodiments also provide a non-transitory computer storage medium storing computer executable instructions that can perform the speech synthesis method of any of the method embodiments described above. Wherein the storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.
Although embodiments of the present disclosure have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the disclosure, and such modifications and variations are within the scope defined by the appended claims.

Claims (11)

1. A method of speech synthesis, comprising:
acquiring text characteristics of a target language and an identification of an original language;
Carrying out style prediction of the original language based on the text characteristics of the target language and the identification of the original language to obtain style characteristics, inquiring a codebook corresponding to the target language based on the style characteristics to obtain vectorized style characteristics, wherein the codebook corresponds to the languages one by one and is used for vectorizing the style characteristics;
and performing encoding and decoding processing based on the vectorized style characteristics to determine the target voice of the target language.
2. The method according to claim 1, wherein the performing style prediction of the original language based on the text feature of the target language and the identification of the original language to obtain style features, and querying a codebook corresponding to the target language based on the style features to obtain vectorized style features includes:
performing multi-scale style prediction of the original language based on the text features of the target language and the marks of the original language to obtain the style features, wherein the style features comprise local style features and global style features;
determining a local codebook and a global codebook corresponding to the target language by utilizing the target language;
And respectively inquiring the local codebook and the global codebook based on the local style characteristic and the global style characteristic to obtain the vectorized style characteristic, wherein the vectorized style characteristic comprises vectorized local style characteristic and vectorized global style characteristic.
3. The method of claim 2, wherein the performing a codec process based on the vectorized style characteristics to determine the target speech in the target language comprises:
splicing the vectorized local style characteristics and vectorized global style characteristics to obtain target style characteristics;
and performing encoding and decoding processing based on the target style characteristics to determine target voice of the target language.
4. The method of claim 3, wherein the stitching the vectorized local style feature with the vectorized global style feature to obtain a target style feature comprises:
obtaining a target dimension of the vectorized local style feature;
determining the copy number of the vectorized global style feature based on the target dimension;
copying and splicing the vectorized global style features based on the copy number to obtain copy features;
And splicing the copy characteristic with the vectorized local style characteristic to obtain the target style characteristic.
5. The method according to claim 1, wherein the performing style prediction of the original language based on the text feature of the target language and the identification of the original language to obtain style features, and querying a codebook corresponding to the target language based on the style features to obtain vectorized style features includes:
inputting the text characteristics of the target language and the marks of the original language into a text prediction style module of a style model to perform style prediction of the original language, so as to obtain the style characteristics;
determining a vectorization codebook module corresponding to the target language in the style model based on the target language, wherein the vectorization codebook module corresponds to the language one by one;
and inputting the style characteristics into the vectorization codebook module to obtain the vectorization style characteristics.
6. The method of claim 5, wherein the manner in which the style model is determined comprises:
acquiring reference audio characteristics of reference audio of a first language;
inputting the audio features into a multi-scale encoder of a preset style model for multi-scale feature encoding to obtain reference local features and reference global features;
Respectively inputting the reference local feature and the reference global feature into a local vectorization codebook module and a global vectorization codebook module corresponding to the first language to obtain vectorized reference local feature and vectorized reference global feature;
based on the reference local feature and the reference global feature, and the vectorized reference local feature and the vectorized reference global feature, performing loss calculation to obtain vectorized loss;
and updating parameters of the preset style model based on the vectorization loss to determine the style model.
7. The method of claim 6, wherein updating parameters of the pre-set style model based on the vectorization penalty to determine the style model comprises:
acquiring the reference text characteristics of the second language and the identification of the first language;
inputting the reference text characteristics and the identification of the first language into the text prediction style module to obtain the prediction local style characteristics and the prediction global style characteristics of the first language;
based on the predicted local style characteristics and the predicted global style characteristics of the first language, and the reference local characteristics and the reference global characteristics, carrying out loss calculation to obtain style prediction loss;
And updating parameters of the preset style model based on the vectorization loss and the style prediction loss to determine the style model.
8. The method of claim 7, wherein inputting the audio features into a multi-scale encoder of a preset style model for multi-scale feature encoding to obtain reference local features and reference global features, comprises:
inputting the reference audio features into a multi-scale encoder of a preset style model for multi-scale feature coding to obtain local features and reference global features;
and inputting the local feature into an attention module, and obtaining the reference local feature, wherein the attention module is used for aligning the reference local feature with the reference text feature.
9. A speech synthesis apparatus, comprising:
the acquisition module is used for acquiring text characteristics of the target language and the identification of the original language;
the style vectorization module is used for carrying out style prediction of the original language based on the text characteristics of the target language and the identification of the original language to obtain style characteristics, inquiring a codebook corresponding to the target language based on the style characteristics to obtain vectorized style characteristics, wherein the codebook corresponds to the languages one by one and is used for vectorizing the style characteristics;
And the processing module is used for carrying out encoding and decoding processing based on the vectorized style characteristics so as to determine the target voice of the target language.
10. An electronic device, comprising:
a memory and a processor, the memory and the processor being communicatively coupled to each other, the memory having stored therein computer instructions that, when executed, perform the speech synthesis method of any of claims 1-8.
11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer instructions for causing a computer to perform the speech synthesis method of any one of claims 1-8.
CN202310141134.4A 2023-02-16 2023-02-16 Speech synthesis method, device, electronic equipment and storage medium Pending CN116312459A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310141134.4A CN116312459A (en) 2023-02-16 2023-02-16 Speech synthesis method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310141134.4A CN116312459A (en) 2023-02-16 2023-02-16 Speech synthesis method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116312459A true CN116312459A (en) 2023-06-23

Family

ID=86788008

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310141134.4A Pending CN116312459A (en) 2023-02-16 2023-02-16 Speech synthesis method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116312459A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727288A (en) * 2024-02-07 2024-03-19 翌东寰球(深圳)数字科技有限公司 Speech synthesis method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727288A (en) * 2024-02-07 2024-03-19 翌东寰球(深圳)数字科技有限公司 Speech synthesis method, device, equipment and storage medium
CN117727288B (en) * 2024-02-07 2024-04-30 翌东寰球(深圳)数字科技有限公司 Speech synthesis method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
KR102565274B1 (en) Automatic interpretation method and apparatus, and machine translation method and apparatus
US20230025317A1 (en) Text classification model training method, text classification method, apparatus, device, storage medium and computer program product
JP5901001B1 (en) Method and device for acoustic language model training
CN110797026A (en) Voice recognition method, device and storage medium
JP2020004382A (en) Method and device for voice interaction
CN116312459A (en) Speech synthesis method, device, electronic equipment and storage medium
CN112017643A (en) Speech recognition model training method, speech recognition method and related device
CN112395888A (en) Machine translation apparatus and method
CN112464655A (en) Word vector representation method, device and medium combining Chinese characters and pinyin
CN110287498B (en) Hierarchical translation method, device and storage medium
CN110913229A (en) RNN-based decoder hidden state determination method, device and storage medium
CN111160036A (en) Method and device for updating machine translation model based on neural network
CN110852063B (en) Word vector generation method and device based on bidirectional LSTM neural network
CN111178097B (en) Method and device for generating Zhongtai bilingual corpus based on multistage translation model
CN111985251B (en) Translation quality evaluation method and device
KR102183284B1 (en) System and method for tracking dialog state in a cross-language
CN116825084A (en) Cross-language speech synthesis method and device, electronic equipment and storage medium
JP7194759B2 (en) Translation data generation system
KR20190031840A (en) Method and apparatus for generating oos(out-of-service) sentence
US11580968B1 (en) Contextual natural language understanding for conversational agents
CN112836526A (en) Multi-language neural machine translation method and device based on gating mechanism
CN112926334A (en) Method and device for determining word expression vector and electronic equipment
CN111666774A (en) Machine translation method and device based on document context
CN110866404A (en) Word vector generation method and device based on LSTM neural network
CN113343716B (en) Multilingual translation method, device, storage medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination