CN116189652A

CN116189652A - Speech synthesis method and device, readable medium and electronic equipment

Info

Publication number: CN116189652A
Application number: CN202310186648.1A
Authority: CN
Inventors: 何爽爽; 马泽君
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2023-02-20
Filing date: 2023-02-20
Publication date: 2023-05-30

Abstract

The disclosure relates to a voice synthesis method, a device, a readable medium and an electronic device, which can carry out auxiliary language labeling on a target text to be processed, wherein the auxiliary language comprises a plurality of types, and different types of auxiliary languages correspond to different labeling information; acquiring a first acoustic annotation corresponding to the target text, wherein the first acoustic annotation comprises first annotation information of multiple auxiliary languages corresponding to the target text and second annotation information of other languages except the auxiliary languages in the target text; and generating target audio corresponding to the target text according to the first annotation information and the second annotation information.

Description

Speech synthesis method and device, readable medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to a method and apparatus for synthesizing speech, a readable medium, and an electronic device.

Background

Currently, speech synthesis technology brings great convenience To people's life, such as TTS (Text To Speech) technology, which can intelligently convert Text into a Speech stream. In the speech synthesis technology based on deep learning represented by Tacotron, analysis and modeling of a speech part have been well achieved. However, in the spoken dialogue scene, because the real person can have hesitation, mismouth, mouth-sucking and other auxiliary language phenomena in the dialogue process, the current TTS technology can not well perform audio synthesis on similar auxiliary language phenomena, and the naturalness and anthropomorphic degree of the synthesized audio still have a lifting space.

Disclosure of Invention

This section is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This section is not intended to represent a critical or essential feature of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a method of speech synthesis, the method comprising:

performing auxiliary language marking on the target text to be processed, wherein the auxiliary language comprises a plurality of types, and different types of auxiliary languages correspond to different marking information;

acquiring a first acoustic annotation corresponding to the target text, wherein the first acoustic annotation comprises first annotation information of multiple auxiliary languages corresponding to the target text and second annotation information of other languages except the auxiliary languages in the target text;

and generating target audio corresponding to the target text according to the first annotation information and the second annotation information.

In a second aspect, the present disclosure provides a speech synthesis apparatus, the apparatus comprising:

the marking module is used for marking the target text to be processed in a secondary language, wherein the secondary language comprises a plurality of types, and different types of secondary languages correspond to different marking information;

The acquisition module is used for acquiring a first acoustic annotation corresponding to the target text, wherein the first acoustic annotation comprises first annotation information of multiple auxiliary languages corresponding to the target text and second annotation information of other languages except the auxiliary languages in the target text;

and the voice synthesis module is used for generating target audio corresponding to the target text according to the first annotation information and the second annotation information.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which when executed by a processing device performs the steps of the method of the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device comprising:

a storage device having at least one computer program stored thereon;

at least one processing means for executing the at least one computer program in the storage means to carry out the steps of the method of the first aspect of the present disclosure.

According to the technical scheme, the target text is marked by the auxiliary language, so that the target audio corresponding to the target text can be generated according to the first marking information of the auxiliary language in the target text and the second marking information of other languages except the auxiliary language in the target text, the audio synthesis effect of the auxiliary language phenomenon in the target audio is good, the problem of poor audio synthesis effect in the existing voice synthesis method is avoided, and the naturalness and anthropomorphic degree of the synthesized voice are improved.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale. In the drawings:

fig. 1 is a flow chart illustrating a method of speech synthesis according to an exemplary embodiment.

Fig. 2 is a flow chart of a speech synthesis method according to the embodiment shown in fig. 1.

FIG. 3 is a schematic diagram of a secondary language annotation, shown according to an exemplary embodiment.

Fig. 4a and 4b are schematic diagrams illustrating yet another secondary language annotation according to an exemplary embodiment.

Fig. 5 is a flow chart of a speech synthesis method according to the embodiment shown in fig. 1.

Fig. 6 is a flow chart illustrating a method of speech synthesis according to the embodiment shown in fig. 5.

Fig. 7 is a block diagram illustrating a speech synthesis apparatus according to an exemplary embodiment.

Fig. 8 is a block diagram illustrating a configuration of an electronic device according to an exemplary embodiment.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

All actions in this disclosure to obtain signals, information or data are performed in compliance with the corresponding data protection legislation policies of the country of location and to obtain authorization granted by the owner of the corresponding device.

It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.

For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.

As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.

It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.

Meanwhile, it can be understood that the data (including but not limited to the data itself, the acquisition or the use of the data) related to the technical scheme should conform to the requirements of the corresponding laws and regulations and related regulations.

The present disclosure finds primary application in speech synthesis scenarios where text is converted into a speech stream. Taking a spoken dialogue scene as an example, a real person can generate a plurality of auxiliary language phenomena in the dialogue process, wherein the auxiliary language refers to a voice modification component generated in the normal speech interaction process, for example, the auxiliary language comprises rhythm characteristics such as accent, pause, extension and the like, and pronunciation phenomena such as laugh, cry, inspiration and the like, and people can transmit emotion through the auxiliary language.

In the existing speech synthesis, the analysis and modeling of the speech part can be well realized based on the TTS technology, but the synthesis of the auxiliary language phenomenon is poor, so that the synthesized audio is too mechanized, and the naturalness and anthropomorphic degree of the synthesized audio still have a lifting space.

In order to solve the above-mentioned problems, the present disclosure provides a method and apparatus for synthesizing speech, a readable medium and an electronic device, and detailed description of specific embodiments of the present disclosure is given below with reference to the accompanying drawings.

Fig. 1 is a flow chart illustrating a method of speech synthesis according to an exemplary embodiment, as shown in fig. 1, the method comprising the steps of:

in step S101, a target text to be processed is labeled with a secondary language, where the secondary language includes multiple types, and different types of secondary languages correspond to different labeling information.

The secondary languages may include, for example, ventilation sounds, non-fluency pauses caused by mismouths, inhalation sounds, mouth-sucking sounds, phoneme mistakes caused by mismouths, pronunciation blurriness, syllable insufficiency caused by mismouths, laughter, and the like.

For example, the non-fluency pause refers to an abnormal pause caused by hesitation of a speaker, which may occur in prosodic words or at prosodic word boundaries, and if occurring at prosodic word boundaries, the end of the pre-word is perceived as abrupt. Inspiration refers to the "hissing" like sound that is produced by the inward airflow as it passes through the cusp and gingival crevices. The sound of the mouth is the sound of the tongue after the tongue tip stays at the upper teeth and gums, such as "nozzle" sound in Chinese. Phoneme errors are the influence of the beginning/end of syllables by the incompletely uttered words, resulting in a phoneme pronunciation error in the syllables. The pronunciation ambiguity refers to the fact that the pronunciation of a speaker is between the correct pronunciation and other sounds due to the reasons of incorrect mouth, faster speed of speech and the like, and cannot be correctly represented by a pinyin system. The incomplete syllable caused by the misstatement means that an incomplete syllable is added, and only consonant or vowel parts are possible. Laughter may include, for example, the pronunciation of laughter such as humming, yawning, haunching, hey, hiccup, o.

In the process of labeling the target text by the secondary language, different symbols can be preset to represent different types of secondary languages. For example, ventilatory sounds may be labeled uvd, non-fluency pauses may be labeled dp, aspiration sounds (or "sibilants") may be labeled igr, mouth sounds may be labeled dc, phoneme errors may add a sign post-error phoneme, pronunciation ambiguities may add a sign $ post-ambiguous phoneme, syllables may not fully label pronunciation incomplete syllables as near-pronunciation initials or finals, and the labels of various side-language phenomena herein are merely illustrative and not limiting to this disclosure.

In step S102, a first acoustic label corresponding to the target text is obtained, where the first acoustic label includes first label information of multiple languages corresponding to the target text and second label information of other languages except for the languages in the target text.

Taking the target text "{ haha }, [ i know ] <3+ > < uvd >, sibilance < igr >, and certainly have his theory as examples, wherein { haha } represents laughing sound, [ i know ] represents laughing cavity, <3+ > -represents extension after intonation phrase (belonging to prosodic phenomenon, different prosodic boundaries can be represented by < num >, mark symbol+represents extension at prosodic boundary, prosodic boundary can include sentence boundary, intonation phrase boundary, prosodic phrase boundary and prosodic word boundary, different prosodic boundaries correspond to different pause times, for example, <1> represents prosodic word, <2> represents prosodic phrase, <3> represents intonation phrase, <4> represents sentence), < uvd > represents ventilation sound, < igr > represents inhalation sound, thus the first markup information is acoustic markup corresponding to the auxiliary language phenomenon, laughing cavity, non-fluency, sound, inhalation sound, and sound in the target text, and the second prosodic boundary corresponds to acoustic markup information in other than the acoustic theory of the target text.

In addition, the first labeling information and the second labeling information here each include first phoneme labeling information, first intonation labeling information, and first prosody labeling information. The first phoneme label information is label information related to phonemes of each word and the auxiliary language in the target text. The tone marking information is used for reflecting content related to the tone, the first tone marking information of the target text can be tone types of each word in the target text, and the tone types can include, but are not limited to, a first sound (also called a level, a level tone), a second sound (also called a level, a rising tone), a third sound (also called a rising tone), a fourth sound (also called a destone), a light sound and a third sound needing tone change. The first prosody annotation information may be used to reflect prosody-related content, which may include, but is not limited to, prosody boundary information. Wherein, the prosodic Boundary (BRK) can also be called as break index, which is used to describe the form of information organized and clause in the speech stream. Prosodic boundary information may include, but is not limited to, sentence boundaries, intonation phrase boundaries, prosodic phrase boundaries, and prosodic word boundaries.

Each tone label in the first tone label information corresponds to each phoneme label in the first phoneme label information one by one, and each prosody label in the first prosody label information corresponds to each phoneme label in the first phoneme label information one by one. And, different phonemes of the same syllable share the same tone marks and prosodic marks, i.e., the tone marks of the phonemes constituting a syllable are identical to the tone marks of the syllable, and the prosodic marks of the phonemes constituting a syllable are identical to the prosodic marks of the syllable.

In step S103, a target audio corresponding to the target text is generated according to the first labeling information and the second labeling information.

In this step, acoustic feature information corresponding to a target text may be determined according to the first phoneme label information, the first intonation label information, and the first prosody label information through a target speech synthesis model obtained by training in advance, and then the target audio may be generated according to the acoustic feature information. The acoustic feature information may be, for example, mel spectrum (Mel spectrum), linear spectrum, etc., so that the target audio may be generated through a preset vocoder based on the Mel spectrum or the linear spectrum corresponding to the target text.

By adopting the method, the target text is marked by the secondary language, and the target audio corresponding to the target text can be generated according to the first marking information of the secondary language in the target text and the second marking information of other languages except the secondary language in the target text, so that the audio synthesis effect of the secondary language phenomenon in the target audio is better, the problem of poor audio synthesis effect in the existing voice synthesis method is avoided, and the naturalness and anthropomorphic degree of the synthesized voice are improved.

Fig. 2 is a flowchart of a speech synthesis method according to the embodiment shown in fig. 1, and as shown in fig. 2, step S101 includes the following sub-steps:

In step S1011, a sub-language marking instruction message corresponding to the target text is determined, where the sub-language marking instruction message is used to indicate at least one target location in the target text where a sub-language needs to be marked, and a type of the sub-language that needs to be marked at each target location.

As described above, the types of the sub-language may include any of a ventilation sound, a non-fluency pause caused by a mouth error, an inhalation sound, a mouth-sucking sound, a phoneme error caused by a mouth error, a pronunciation blur, a syllable insufficiency caused by a mouth error, and a laugh sound.

In one possible implementation manner of the present disclosure, the target text may be labeled in a sub-language manner, that is, a target location in the target text and a type of sub-language to be added at the target location may be manually specified, so that a labeling person may directly label the sub-language in the target text, label a sub-language phenomenon expected to be heard from the synthesized audio into the target text, so that the terminal may receive the sub-language labeling operation for the target text, determine the target location according to the sub-language labeling operation, and determine the type of sub-language to be labeled at the target location.

In another possible implementation manner of the present disclosure, at least one target location in the target text, which needs to be labeled with a secondary language, and a type of the secondary language that needs to be labeled at each target location may be predicted by a label prediction model according to text information of the target text.

The labeling prediction model is obtained based on the second training text. The second training text may include an input training text and a training sample label, the input training text may be a text which is extracted from a truly existing voice (such as spoken dialogue audio) and filters a sub-language phenomenon in the text, the training sample label may be various sub-languages actually existing in the voice and a position of each sub-language in the input training text, so that the input training text may be used as an input of a neural network model, the training sample label may be used as a target of the model, the neural network model is trained, after training, at least one target position of the sub-language required in the text and a marking prediction model of a type of the sub-language required in each target position may be obtained, so that after the target text is input into the marking prediction model, at least one target position of the sub-language required in the target text and a type of the sub-language required in each target position may be automatically obtained, so that a subsequent automatic marking may be performed without artificial marking, thereby being beneficial to improving efficiency of sub-language marking of the target text.

In step S1012, for each target location, a labeling rule corresponding to the secondary language is determined according to the type of the secondary language to be labeled for the target location.

In this disclosure, different types of sub-languages correspond to different labeling rules.

In step S1013, the secondary language is acoustically labeled according to the labeling rule.

In this step, the target acoustic attribute hierarchy of the secondary language may be acoustically labeled according to a labeling rule, where the target acoustic attribute hierarchy includes at least one of a word layer, a syllable layer, a phoneme layer, a prosody layer, and a part-of-speech layer.

For example, if the type of the sub-language is ventilation sound, the corresponding labeling rule may be:

in the labeling file, the target positions of the ventilation sounds in the word layer, the pinyin layer and the phoneme layer corresponding to the target text are labeled as symbols uvd, and when the rhythm layer of the target text is labeled, if the target positions of the ventilation sounds belong to non-laughter sounds, the rhythm boundaries of the ventilation sounds are consistent with the previous word, and the parts of speech of the ventilation sounds in the part-of-speech layer corresponding to the target text are labeled as symbols None.

In the marking of the prosodic layers, in one possible implementation of the disclosure, prosodic word boundaries, prosodic phrase boundaries, intonation phrase boundaries, and sentence boundaries, which are several different prosodic level boundaries, may be used for the marking of the prosodic boundaries. Different prosodic boundaries correspond to different pause times and/or intonation information, and may be represented by < num >, for example, <1> for prosodic words, <2> for prosodic phrases, <3> for intonation phrases, <4> for sentences, in this disclosure, the larger the number the longer the corresponding pause time, so that, when a target position where a ventilation sound is located is marked for prosodic layers, if the ventilation sound belongs to a non-laughing sound, since the ventilation sound is generally a ventilation action by a speaker due to insufficient speech and smell for a long time, the speech stream generates a pause due to the occurrence of the ventilation action, the preceding speech segment is at least a prosodic hierarchy above the prosodic phrase level, and therefore, the prosodic boundary of the preceding word followed by the ventilation sound is generally marked as 2 or more, and the prosodic boundary of the following word may be the same as the prosodic boundary of the preceding word. The specific labeling of prosodic boundaries herein is merely illustrative, and the disclosure is not limited thereto.

For example, fig. 3 is a schematic diagram of a sub-language labeling, as shown in fig. 3, where when a ventilation sound is labeled in a sub-language, the sub-language labeling may be performed according to the labeling rule of the ventilation sound described above, where the ventilation sound is labeled as a symbol uvd at a target position (i.e. a position between "s" and "that") where a word layer, a pinyin layer, and a phoneme layer are located, the part of speech of the ventilation sound is labeled as None at a part of speech layer, and the prosodic boundary of the ventilation sound is labeled as 3 (prosodic boundary corresponding to the level of the intonation phrase), which is exemplified herein only and is not limited by the present disclosure.

If the type of the secondary language is non-fluency pause caused by misstatement, the corresponding labeling rule may be:

in the labeling file, the target positions of the ventilation sounds in the word layer, the pinyin layer and the phoneme layer corresponding to the target text are labeled as a symbol dp, when the prosody layer of the target text is labeled, the prosody boundary of the target position of the non-fluency pause is kept consistent with the previous word, and the part of speech of the non-fluency pause in the part of speech layer corresponding to the target text is labeled as a symbol None.

If the type of the auxiliary language is inspiration sound or mouth sucking sound, the corresponding labeling rule can be:

The pronunciation of the inspiration sound is generally the same as that of a "sibilant" sound, when the inspiration sound is marked, the target position of the inspiration sound in a word layer can be marked as a "sibilant" word, and the target positions of the inspiration sound in a pinyin layer and a phoneme layer are marked as igr; the sound of the mouth sucking sound is generally the same as the 'nozzle' sound, when the mouth sucking sound is marked, the target position of the mouth sucking sound in the word layer can be marked as the 'nozzle' word, the target positions of the air sucking sound in the pinyin layer and the phoneme layer are marked as dc, and when the air sucking sound or the rhythm layer of the mouth sucking sound is marked, the rhythm boundary of the former word can be kept consistent. The part-of-speech layer of the inhalation or the mouthpiece is labeled as the symbol None.

If the type of the secondary language is a phoneme error or a pronunciation ambiguity, the corresponding labeling rule may be: at the soundThe label symbol is added after the phoneme of the target position where the phoneme error or pronunciation ambiguity is located in the phoneme layer, for example, the label symbol corresponding to the phoneme error can be ^∧ The sign symbol corresponding to the pronunciation ambiguity may be $.

For example, fig. 4a and 4b are schematic diagrams illustrating a sub-linguistic labeling according to an exemplary embodiment, and as shown in fig. 4a, when sub-linguistic labeling is performed on a phoneme error caused by a mouth error, a labeling symbol may be added after a phoneme n of a target position where the phoneme error is located in a phoneme layer, to obtain the phoneme n ^∧ . As shown in fig. 4b, when the pronunciation ambiguity is labeled in a sub-language, a label symbol $ may be added to the phoneme r at the target position where the pronunciation ambiguity is located in the phoneme layer, so as to obtain a phoneme r$. Thus, the subsequent speech synthesis model can recognize that one of the target positions has a phoneme error according to the phoneme label information, and can label n according to the sub-language of the phoneme error when performing speech synthesis ^∧ And carrying out the sub-language audio synthesis of the phoneme error, recognizing that the other target position has pronunciation ambiguity according to the phoneme label information, and carrying out the sub-language audio synthesis of the pronunciation ambiguity according to the sub-language label r$ of the pronunciation ambiguity when carrying out the voice synthesis, thereby improving the anthropomorphic degree of the audio synthesis. The above examples are merely illustrative, and the present disclosure is not limited thereto.

If the type of the secondary language is syllable insufficiency caused by misstatement, the corresponding labeling rule can be:

the syllable with incomplete pronunciation is marked as the initial or final of similar pronunciation, i.e. the initial or final of similar pronunciation is added in the phoneme layer, and the front prosody boundary or the rear prosody boundary of the syllable is changed into 0 in the prosody layer according to the connection tightness (which can be understood as that the syllable is not loaded with accent and has no pause, and is not enough to be a complete step alone, and the front syllable and the rear syllable need to be attached). The word layer, the pinyin layer and the part-of-speech layer are all labeled None.

If the type of the sub-language is laughing, for example, the constituent elements of laughing may include 8 main syllables and 2 variants (the near-pronunciation Chinese character of laughing includes hum, ha, hip, hey, hi, drink, hiccup, o), 2 onsets and 2 offsets, and according to the characteristics of laughing, the end of laughing in the sentence at the beginning of the sentence generally includes an inspiration segment (vocal cord vibration vd/non-vibration uvd), so as to ensure that the speaker continues speaking after ventilation, i.e. offset. The beginning of a laugh may have a vowel start segment, i.e., onset. Labeling rules for onset and plain text-like laughter may be:

laughter positions of the word layers are labeled None, pinyin and phoneme layers are labeled with preset phoneme symbols of corresponding laughters (for example, preset phoneme symbols corresponding to humming are hg, preset phoneme symbols corresponding to ha are ha, and preset phoneme symbols corresponding to hip-hop are hi), and labeling of the prosody layers is performed: a laugh segment (comprising a single laugh or multiple laughs) is a intonation phrase unit, the prosodic boundary at the end of the laugh may be marked as intonation phrase boundary <3> or sentence boundary <4>, and the prosodic boundary of a syllable within a laugh segment may be marked as 0 (i.e., no speech stream pauses exist); when the part-of-speech layer of the laugh is marked, a laugh fragment can be used as a part-of-speech unit and is marked as None in a unified way.

For the laughter of the offset class, i.e. the inspiration segment at the end of the laughter, including the ventilation sound of the vocal cord vibration and the ventilation sound of the vocal cord non-vibration, the labeling rules may be:

the word layer, the pinyin layer and the phoneme layer can be marked as vd for the ventilation sound of the vocal cord vibration, and as uvd for the ventilation sound of the vocal cord non-vibration; when the rhythm layer is marked, one laugh segment (single laugh or a plurality of laughs) is a intonation phrase unit, the rhythm boundary at the tail of the laugh can be marked as intonation phrase boundary <3> or sentence boundary <4>, and the rhythm boundary of each syllable in one laugh segment can be marked as 0 (i.e. no stream pause exists); when the part-of-speech layer of the laugh is marked, a laugh fragment can be used as a part-of-speech unit and is marked as None in a unified way.

In this way, for each target position in the target text, the labeling rule corresponding to the auxiliary language can be determined according to the type of the auxiliary language to be labeled at the target position, then the acoustic labeling of the auxiliary language is performed according to the labeling rule, so that the first labeling information of the auxiliary language is obtained, and different labeling symbols are adopted by the auxiliary languages of different types, so that in the subsequent process of performing voice synthesis on the target text labeled with the auxiliary language, the voice synthesis model can respectively correspond to the auxiliary languages at each target position in the target text identified according to the first labeling information, and the audio synthesis of the auxiliary language is realized according to the labeling information of the auxiliary language at each target position, thereby improving the anthropomorphic degree of the audio synthesis.

Fig. 5 is a flowchart of a speech synthesis method according to the embodiment shown in fig. 1, and as shown in fig. 5, step S103 includes the following sub-steps:

in step S1031, acoustic feature information corresponding to the target text is determined according to the first phoneme label information, the first intonation label information, and the first prosody label information through a target speech synthesis model obtained by training in advance.

The target speech synthesis model may be, for example, a Tacotron model, and the target speech synthesis model may include a coding sub-model, an attention sub-model, and a decoding sub-model, where the coding sub-model is configured to generate a text token vector (TE) according to a concatenation vector corresponding to the first phoneme label information, the first tone label information, and the first prosody label information; the attention sub-model is used for generating a semantic representation vector of the target text according to the text representation vector; the decoding sub-model is used for outputting acoustic feature information (such as Mel frequency spectrum) corresponding to the target text according to the semantic representation vector.

Fig. 6 is a flowchart of a speech synthesis method according to the embodiment shown in fig. 5, and as shown in fig. 6, step S1031 includes the following sub-steps:

In step S10311, vector concatenation is performed on the first phoneme label information, the first intonation label information, and the first prosody label information, so as to obtain an acoustic label characterization vector.

In this step, the first phoneme label information may be vectorized (vectorized) to obtain a first vector, the first tone label information may be vectorized to obtain a second vector, the first prosody label information may be vectorized to obtain a third vector, and then the first vector, the second vector and the third vector may be spliced to obtain the acoustic label characterization vector.

As input to the encoding module. And then, the coding module correspondingly outputs a text representation sequence (TE) of the target text. The text characterization sequence output by the coding module passes through the attention module to generate a context vector C as semantic characterization of the target text. The voice representation generated by the attention module enters a decoding module, and the decoding module outputs acoustic characteristic information corresponding to the target text.

In step S10312, after the acoustic annotation representation vector is input into the coding sub-model, a text representation vector of the target text is output through the coding sub-model.

In step S10313, after the text token vector is input into the attention sub-model, the semantic token vector of the target text is output through the attention sub-model.

In step S10314, after the semantic representation vector is input into a decoding sub-model, acoustic feature information corresponding to the target text is output through the decoding sub-model.

After the acoustic feature information corresponding to the target text is obtained, the target audio corresponding to the target text may be obtained by performing step S1032.

In step S1032, target audio is generated from the acoustic feature information.

In this step, the target audio may be generated by a preset vocoder based on the acoustic feature information (such as Mel spectrum or linear spectrum) corresponding to the target text. The predetermined vocoder may be, for example, a Wavenet vocoder, a Griffin-Lim vocoder, or the like.

In addition, the target speech synthesis model described above is trained in advance by:

acquiring second acoustic labels and acoustic feature label information corresponding to a first training text, wherein the second acoustic labels comprise third label information of a secondary language corresponding to the first training text and fourth label information of other languages except the secondary language in the first training text; the third annotation information and the fourth annotation information comprise second phoneme annotation information, second voice annotation information and second rhythm annotation information corresponding to the first training text;

The acoustic feature tag information is target acoustic feature information of actual audio corresponding to the first training text, the actual audio contains a secondary language phenomenon, and the target acoustic feature information also comprises acoustic feature information of the secondary language phenomenon in the target audio;

according to the second phoneme label information, the second voice annotation information and the second prosody label information, obtaining acoustic feature model output information through a preset voice synthesis model;

and carrying out model training on the preset voice synthesis model according to the acoustic feature tag information and the acoustic feature model output information to obtain the target voice synthesis model.

The training of the speech synthesis model aims at enabling the audio synthesized by the model output to be infinitely close to the actual audio of the first training text, even if the acoustic feature information (i.e. acoustic feature model output information) output by the model is infinitely close to the acoustic feature tag information. Therefore, the loss value of the model can be calculated based on the acoustic feature tag information and the acoustic feature model output information, and the internal parameters of the current model can be adjusted using the loss value. And then, using the adjusted model for the next training, and repeating the steps until the condition of stopping training is met, so as to obtain the target speech synthesis model after training.

The target voice synthesis model trained by the training step can be used for synthesizing a voice synthesis scene with a secondary language phenomenon, and the naturalness and anthropomorphic degree of voice synthesis are improved.

Fig. 7 is a block diagram of a speech synthesis apparatus according to an exemplary embodiment, as shown in fig. 7, the apparatus comprising:

the labeling module 701 is configured to perform a secondary language labeling on a target text to be processed, where the secondary language includes multiple types, and different types of secondary languages correspond to different labeling information;

an obtaining module 702, configured to obtain a first acoustic label corresponding to the target text, where the first acoustic label includes first label information of multiple languages corresponding to the target text and second label information of other languages in the target text except for the languages;

the speech synthesis module 703 is configured to generate target audio corresponding to the target text according to the first labeling information and the second labeling information.

Optionally, the labeling module 701 is configured to determine a secondary language labeling instruction message corresponding to the target text, where the secondary language labeling instruction message is used to indicate at least one target location in the target text where a secondary language needs to be labeled, and a type of the secondary language that needs to be labeled at each target location;

Determining a labeling rule corresponding to the auxiliary language according to the type of the auxiliary language to be labeled at each target position;

and carrying out acoustic labeling on the auxiliary language according to the labeling rule.

Optionally, the labeling module 701 is configured to acoustically label the secondary language at a target acoustic attribute level according to the labeling rule, where the target acoustic attribute level includes at least one of a word layer, a syllable layer, a phoneme layer, a prosody layer, and a part-of-speech layer.

Optionally, the first annotation information and the second annotation information each include first phoneme annotation information, first tone annotation information and first prosody annotation information, each tone annotation in the first tone annotation information corresponds to each phoneme annotation in the first phoneme annotation information one to one, and each prosody annotation in the first prosody annotation information corresponds to each phoneme annotation in the first phoneme annotation information one to one; the speech synthesis module 703 is configured to determine acoustic feature information corresponding to the target text according to the first phoneme label information, the first tone label information, and the first prosody label information through a target speech synthesis model obtained by training in advance; and generating the target audio according to the acoustic characteristic information.

Optionally, the target speech synthesis model comprises a coding sub-model, a attention sub-model and a decoding sub-model,

the speech synthesis module 703 is configured to vector-splice the first phoneme label information, the first tone label information, and the first prosody label information to obtain an acoustic label characterization vector; after the acoustic annotation representation vector is input into the coding sub-model, outputting a text representation vector of the target text through the coding sub-model; after the text token vector is input into the attention sub-model, outputting the semantic token vector of the target text through the attention sub-model; and after the semantic representation vector is input into the decoding sub-model, outputting the acoustic characteristic information corresponding to the target text through the decoding sub-model.

Optionally, the target speech synthesis model is pre-trained by:

Optionally, the secondary language includes at least two of: ventilation sounds, non-fluency pauses caused by mismouths, inspiration sounds, mouth sounds, phoneme mistakes caused by mismouths, pronunciation blurriness, syllable insufficiency caused by mismouths, and laughter.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Referring now to fig. 8, a schematic diagram of an electronic device 800 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 8 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 8, the electronic device 800 may include a processing means (e.g., a central processor, a graphics processor, etc.) 801, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic device 800 are also stored. The processing device 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

In general, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, etc.; storage 808 including, for example, magnetic tape, hard disk, etc.; communication means 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 8 shows an electronic device 800 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication device 809, or installed from storage device 808, or installed from ROM 802. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 801.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: performing auxiliary language marking on the target text to be processed, wherein the auxiliary language comprises a plurality of types, and different types of auxiliary languages correspond to different marking information; acquiring a first acoustic annotation corresponding to the target text, wherein the first acoustic annotation comprises first annotation information of multiple auxiliary languages corresponding to the target text and second annotation information of other languages except the auxiliary languages in the target text; and generating target audio corresponding to the target text according to the first annotation information and the second annotation information.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of a module is not limited to the module itself in some cases, and, for example, the acquisition module may also be described as "a module that acquires acoustic annotations".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, example 1 provides a speech synthesis method, comprising: performing auxiliary language marking on the target text to be processed, wherein the auxiliary language comprises a plurality of types, and different types of auxiliary languages correspond to different marking information; acquiring a first acoustic annotation corresponding to the target text, wherein the first acoustic annotation comprises first annotation information of multiple auxiliary languages corresponding to the target text and second annotation information of other languages except the auxiliary languages in the target text; and generating target audio corresponding to the target text according to the first annotation information and the second annotation information.

In accordance with one or more embodiments of the present disclosure, example 2 provides the method of example 1, the sub-linguistic labeling of the target text to be processed comprising:

determining a secondary language marking indication message corresponding to the target text, wherein the secondary language marking indication message is used for indicating at least one target position needing marking of a secondary language in the target text and the type of the secondary language needing marking of each target position;

According to one or more embodiments of the present disclosure, example 3 provides the method of example 2, the acoustically labeling the secondary language according to the labeling rules comprising:

and acoustically labeling the auxiliary language in a target acoustic attribute level according to the labeling rule, wherein the target acoustic attribute level comprises at least one of a word layer, a syllable layer, a phoneme layer, a prosody layer and a part-of-speech layer.

According to one or more embodiments of the present disclosure, example 4 provides the method of example 1, wherein the first labeling information and the second labeling information each include first phoneme labeling information, first intonation labeling information, and first prosody labeling information, each intonation labeling in the first intonation labeling information corresponds to each phoneme labeling in the first phoneme labeling information one-to-one, and each prosody labeling in the first prosody labeling information corresponds to each phoneme labeling in the first phoneme labeling information one-to-one; the generating the target audio corresponding to the target text according to the first annotation information and the second annotation information comprises:

determining acoustic feature information corresponding to the target text through a target voice synthesis model obtained through pre-training according to the first phoneme label information, the first tone label information and the first rhythm label information;

And generating the target audio according to the acoustic characteristic information.

According to one or more embodiments of the present disclosure, example 5 provides the method of example 4, the target speech synthesis model includes a coding sub-model, an attention sub-model, and a decoding sub-model, and determining acoustic feature information corresponding to the target text according to the first phoneme label information, the first tone label information, and the first prosody label information through a speech synthesis model obtained by training in advance includes:

vector stitching is carried out on the first phoneme annotation information, the first tone annotation information and the first prosody annotation information to obtain an acoustic annotation representation vector;

after the acoustic annotation representation vector is input into the coding sub-model, outputting a text representation vector of the target text through the coding sub-model;

after the text token vector is input into the attention sub-model, outputting the semantic token vector of the target text through the attention sub-model;

and after the semantic representation vector is input into the decoding sub-model, outputting the acoustic characteristic information corresponding to the target text through the decoding sub-model.

In accordance with one or more embodiments of the present disclosure, example 6 provides the method of example 4, the target speech synthesis model being pre-trained by:

According to one or more embodiments of the present disclosure, example 7 provides the method of any one of examples 1-6, the secondary language comprising at least two of: ventilation sounds, non-fluency pauses caused by mismouths, inspiration sounds, mouth sounds, phoneme mistakes caused by mismouths, pronunciation blurriness, syllable insufficiency caused by mismouths, and laughter.

According to one or more embodiments of the present disclosure, example 8 provides a speech synthesis apparatus, the apparatus comprising:

According to one or more embodiments of the present disclosure, example 9 provides the apparatus of example 8, wherein the labeling module is configured to determine a sub-language labeling instruction message corresponding to the target text, where the sub-language labeling instruction message is used to indicate at least one target location in the target text where a sub-language needs to be labeled, and a type of the sub-language that needs to be labeled for each target location; determining a labeling rule corresponding to the auxiliary language according to the type of the auxiliary language to be labeled at each target position; and carrying out acoustic labeling on the auxiliary language according to the labeling rule.

According to one or more embodiments of the present disclosure, example 10 provides the apparatus of example 9, wherein the labeling module is configured to acoustically label the secondary language at a target acoustic attribute level according to the labeling rule, where the target acoustic attribute level includes at least one of a word layer, a syllable layer, a phoneme layer, a prosody layer, and a part-of-speech layer.

According to one or more embodiments of the present disclosure, example 11 provides the apparatus of example 8, the first annotation information and the second annotation information each include first phoneme annotation information, first tone annotation information, and first prosody annotation information, each tone annotation in the first tone annotation information is in one-to-one correspondence with each phoneme annotation in the first phoneme annotation information, and each prosody annotation in the first prosody annotation information is in one-to-one correspondence with each phoneme annotation in the first phoneme annotation information; the voice synthesis module is used for determining acoustic characteristic information corresponding to the target text through a target voice synthesis model obtained through pre-training according to the first phoneme label information, the first tone label information and the first rhythm label information; and generating the target audio according to the acoustic characteristic information.

According to one or more embodiments of the present disclosure, example 12 provides the apparatus of example 11, the target speech synthesis model includes an encoding sub-model, an attention sub-model, and a decoding sub-model, and the speech synthesis module is configured to vector splice the first phoneme label information, the first tone label information, and the first prosody label information to obtain an acoustic label characterization vector; after the acoustic annotation representation vector is input into the coding sub-model, outputting a text representation vector of the target text through the coding sub-model; after the text token vector is input into the attention sub-model, outputting the semantic token vector of the target text through the attention sub-model; and after the semantic representation vector is input into the decoding sub-model, outputting the acoustic characteristic information corresponding to the target text through the decoding sub-model.

In accordance with one or more embodiments of the present disclosure, example 13 provides the apparatus of example 11, the target speech synthesis model is pre-trained by:

Example 14 provides the apparatus of any one of examples 8-13, the secondary language comprising at least two of: ventilation sounds, non-fluency pauses caused by mismouths, inspiration sounds, mouth sounds, phoneme mistakes caused by mismouths, pronunciation blurriness, syllable insufficiency caused by mismouths, and laughter.

According to one or more embodiments of the present disclosure, example 15 provides a computer-readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the method of any of examples 1-7.

Example 16 provides an electronic device according to one or more embodiments of the present disclosure, comprising:

a storage device having at least one computer program stored thereon;

at least one processing means for executing the at least one computer program in the storage means to implement the steps of the method of any one of examples 1-7.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims. The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Claims

1. A method of speech synthesis, the method comprising:

2. The method of claim 1, wherein the sub-linguistic labeling of the target text to be processed comprises:

3. The method of claim 2, wherein acoustically labeling the secondary language according to the labeling rules comprises:

4. The method of claim 1, wherein the first annotation information and the second annotation information each comprise first phoneme annotation information, first tone annotation information, and first prosody annotation information, each tone annotation in the first tone annotation information is in one-to-one correspondence with each phoneme annotation in the first phoneme annotation information, and each prosody annotation in the first prosody annotation information is in one-to-one correspondence with each phoneme annotation in the first phoneme annotation information; the generating the target audio corresponding to the target text according to the first annotation information and the second annotation information comprises:

5. The method of claim 4, wherein the target speech synthesis model includes a coding sub-model, a attention sub-model, and a decoding sub-model, and wherein determining acoustic feature information corresponding to the target text from the speech synthesis model obtained by pre-training the first phoneme label information, the first tone label information, and the first prosody label information includes:

6. The method of claim 4, wherein the target speech synthesis model is pre-trained by:

7. The method of any one of claims 1-6, wherein the secondary language comprises at least one of: ventilation sounds, non-fluency pauses caused by mismouths, inspiration sounds, mouth sounds, phoneme mistakes caused by mismouths, pronunciation blurriness, syllable insufficiency caused by mismouths, and laughter.

8. A speech synthesis apparatus, the apparatus comprising:

9. A computer readable medium on which a computer program is stored, characterized in that the program, when being executed by a processing device, carries out the steps of the method according to any one of claims 1-7.

10. An electronic device, comprising:

a storage device having at least one computer program stored thereon;

at least one processing means for executing said at least one computer program in said storage means to carry out the steps of the method according to any one of claims 1-7.