CN114093342B

CN114093342B - Fine-grained rhythm modeling voice generation device, fine-grained rhythm modeling voice generation equipment and fine-grained rhythm modeling voice generation storage medium

Info

Publication number: CN114093342B
Application number: CN202210078586.8A
Authority: CN
Inventors: 陶建华; 王诗明; 傅睿博; 易江燕
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2022-01-24
Filing date: 2022-01-24
Publication date: 2022-05-03
Anticipated expiration: 2042-01-24
Also published as: CN114093342A

Abstract

The invention provides a speech generation model, equipment and a storage medium for fine-grained prosody modeling, wherein the model comprises the following components: inputting text information into a text encoder module to obtain the encoding characteristics of the text; inputting the frequency spectrum information into a prosody coding module to obtain prosody characteristics of the phoneme-level voice; the prosody feature input decoupling module of the voice decouples text content information and prosody information contained in the prosody feature of the voice and only leaves the prosody information of the prosody feature of the voice; the encoding characteristics of the text and the prosodic information of the prosodic characteristics of the speech are input to a decoder to generate synthesized speech. The scheme provided by the invention can divide the time domain of the audio by using the time-length information, the prosody encoder can model the local prosody information of the voice to describe the change trend of the prosody, and meanwhile, the prosody decoupling module ensures that the prosody encoder model only learns the prosody information of the voice and does not contain text information.

Description

Fine-grained rhythm modeling voice generation device, fine-grained rhythm modeling voice generation equipment and fine-grained rhythm modeling voice generation storage medium

Technical Field

The invention belongs to the field of voice generation, and particularly relates to a voice generation device for fine-grained prosody modeling.

Background

With the continuous abundance and diversity of applications of speech synthesis, the user's standards for intelligibility, stability, naturalness, and speech expressiveness of synthesized speech are increasing. Application scenarios such as audio books, voice assistants, dialogue interaction, etc. expect that synthesized speech can have a naturalness similar to that of real human speech. Therefore, the robustness, real-time rate, and the influence of intonation, rereading, emotion, style, semantic and other information on the naturalness of the synthesized speech need to be considered. With the rapid development of speech generation technology in recent years, a non-autoregressive speech generation learning-oriented model framework has become a mainstream research trend. Compared with the prior autoregressive network, the non-autoregressive network has the advantages of a training time period, high generation speed, strong speech robustness, strong controllability and the like, but meanwhile, due to the characteristic of parallelization generation, the generation of speech completely depends on the input text characteristic information and cannot be modeled by using historical speech information. Since speech synthesis is a highly up-sampling process, the mapping between text-speech data pairs is a one-to-many timbre, and the modeling process of speech generation loses much text without information, namely speech prosody information, after modeling of historical information is lacked. A method for representing learning can be flexibly introduced based on a speech synthesis model from a sequence to a sequence framework, and more accurate prosodic representation is extracted, so that more effective and controllable speech synthesis acoustic modeling is realized, and the naturalness of speech generation is improved.

Disadvantages of the prior art

(1) Since speech synthesis is a highly up-sampling process, the mapping between text-speech data pairs is a one-to-many timbre, and the modeling process of speech generation loses much text without information, namely speech prosody information, after modeling of historical information is lacked. Therefore, the generated speech of the current speech synthesis system has the defects of unclear energy of high-frequency parts, flat tone style and the like.

(2) Most of the current speech prosody modeling is global information, that is, the whole speech is coded by a hidden vector corresponding to only one prosody. However, it is known that prosody of speech is a time-varying process, global prosodic information can well model the whole emotion of speech, but is ineffective to local variation and expression.

Disclosure of Invention

In order to solve the technical problems, the invention provides a technical scheme of a speech generating device with fine-grained prosody modeling, so as to solve the technical problems.

The invention discloses a voice generating device for fine-grained prosody modeling in a first aspect; the device comprises:

a text encoder, a prosodic encoder, and a decoder;

the prosody encoder includes: a rhythm coding module and a decoupling module;

inputting text information into the text encoder to obtain the encoding characteristics of the text;

inputting the frequency spectrum information into the rhythm coding module to obtain the rhythm characteristics of the phoneme-level voice;

inputting the prosodic features of the voice into the decoupling module, decoupling text content information and prosodic information contained in the prosodic features of the voice, and only leaving the prosodic information of the prosodic features of the voice;

and inputting the encoding characteristics of the text and prosody information of the prosody characteristics of the voice into the decoder to generate synthetic voice.

According to an aspect of the first aspect of the present invention, the text encoder includes:

a word embedding layer, a precoding layer and a recurrent neural network;

and the text information is sequentially input into the word embedding layer, the pre-coding layer and the cyclic neural network to obtain the coding characteristics of the text.

According to the technical scheme of the first aspect of the present invention, the specific method for obtaining the coding features of the text by sequentially inputting the text information into the word embedding layer, the pre-coding layer and the recurrent neural network comprises:

and inputting the text information into the word embedding layer to obtain a representation in a high-dimensional continuous space, inputting the representation in the high-dimensional continuous space into the pre-coding layer, compressing dimensions and information of the representation in the high-dimensional continuous space to obtain a representation in the compressed continuous space, and inputting the representation in the compressed continuous space into a recurrent neural network to obtain the coding characteristics of the text.

According to the technical solution of the first aspect of the present invention, the prosody encoding module includes:

the system comprises a frequency spectrum preprocessing network, a multi-head self-attention module and a full connection layer;

and the frequency spectrum information is sequentially input into the frequency spectrum preprocessing network, the multi-head self-attention module and the full-connection layer to obtain the prosodic features of the phoneme-level voice.

According to the technical scheme of the first aspect of the present invention, the specific method for obtaining the prosodic features of the phoneme-level speech by sequentially inputting the spectrum information into the spectrum preprocessing network, the multi-head self-attention module and the full-link layer includes:

the frequency spectrum preprocessing network expands the frequency spectrum information to high-dimensional frequency spectrum characteristics;

inputting the high-dimensional spectral features into the multi-head self-attention module, and calculating the weighted sum of the high-dimensional spectral features in the time dimension through a self-attention mechanism to obtain attention features;

and the full connection layer calculates the weighted sum of the attention features in feature dimensions, and finally averages the output of the full connection layer in a time period corresponding to the phoneme to obtain the final prosodic features of the phoneme-level speech.

According to an aspect of the first aspect of the invention, the decoupling module is designed to generate text content information and prosody information contained in prosody features that decouple the voice against the network.

According to the technical solution of the first aspect of the present invention, the decoupling module needs to perform iterative optimization separately from the text encoder and the decoder in turn during the network optimization process.

According to the technical solution of the first aspect of the present invention, the decoder includes: a multi-layer multi-head self-attention module and a post-processing network;

inputting the encoding characteristics of the text and the prosodic information of the prosodic characteristics of the voice into the multi-head attention module for characteristic transformation to obtain a fused characteristic frequency spectrum, inputting the fused characteristic frequency spectrum into the post-processing network, and finely adjusting the fused characteristic frequency spectrum by using local frequency spectrum information to generate synthetic voice; the local spectrum information is prosodic information of prosodic features of the voice.

A second aspect of the invention provides an electronic device comprising a memory and a processor, the memory having stored thereon a computer program which, when executed by the processor, performs a speech generating means in a fine-grained prosody modeling speech generating means as described in the first aspect of the invention.

The scheme provided by the invention can divide the time domain of the audio by using the time-length information, the prosody encoder can model the local prosody information of the voice to describe the change trend of the prosody, and meanwhile, the prosody decoupling module ensures that the prosody encoder device only learns the prosody information of the voice and does not contain text information.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a block diagram of a fine-grained prosody modeled speech generation apparatus according to an embodiment of the present invention;

fig. 2 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

Example 1:

a first aspect of the present invention discloses a fine-grained prosody modeling speech generating device, where fig. 1 is a structural diagram of a fine-grained prosody modeling speech generating device according to an embodiment of the present invention, specifically as shown in fig. 1, the device includes:

a text encoder, a prosodic encoder, and a decoder;

the prosody encoder includes: a rhythm coding module and a decoupling module;

inputting text information into the text encoder module to obtain the encoding characteristics of the text;

in some embodiments, the text encoder comprises:

a word embedding layer, a precoding layer and a recurrent neural network;

the text information is sequentially input into the word embedding layer, the pre-coding layer and the cyclic neural network to obtain the coding characteristics of the text;

in some embodiments, the text information is sequentially input into the word embedding layer, the pre-coding layer and the recurrent neural network, and the specific method for obtaining the coding features of the text includes:

inputting the text information into the word embedding layer to obtain a representation in a 256-dimensional continuous space, inputting the representation in the high-dimensional continuous space into the pre-coding layer, compressing dimensions and information of the representation in the high-dimensional continuous space to obtain a representation in a 128-dimensional continuous space, and inputting the compressed representation in the continuous space into a recurrent neural network to obtain a coding feature of the text;

in some embodiments, the pre-coding layer is a fully connected layer of two layers, and removing some minor information and dimensions in the representation in the 256-dimensional continuous space by the pre-coding layer helps to improve the stability and generalization capability of the device and reduce the training difficulty;

in some embodiments, the prosody encoding module includes:

the frequency spectrum information is sequentially input into the frequency spectrum preprocessing network, the multi-head self-attention module and the full-connection layer to obtain the prosodic features of the phoneme-level voice;

in some embodiments, the specific method for obtaining the prosodic features of the phoneme-level speech by sequentially inputting the spectrum information into the spectrum preprocessing network, the multi-head self-attention module and the full-connection layer includes:

the spectrum information refers to implicit vector characterization of the spectrum, and in some embodiments, specifically, the spectrum information may be 80-dimensionel spectrum features, and the prosodic features at the phoneme level are obtained through an averaging pooling operation on phoneme intervals.

The frequency spectrum preprocessing network expands the 80-dimensional Meyer frequency spectrum characteristic to a high-dimensional frequency spectrum characteristic;

the full-connection layer calculates the weighted sum of the attention features in feature dimensions, and finally averages the output of the full-connection layer in a time period corresponding to the phoneme to obtain the final prosodic features of the phoneme-level speech;

because the prosodic features of the speech simultaneously comprise prosodic information and text content information, if the two kinds of information are not decoupled, the output result of a prosodic encoder comprises text information, which conflicts with the result of a text encoder, and various serious problems such as wrong pronunciation of the generated speech, low speech intelligibility and the like are caused;

in some embodiments, the decoupling module is designed to generate textual content information and prosodic information contained in prosodic features that decouple the speech against a network;

in some embodiments, the decoupling module needs to separately perform alternate iterative optimization with the text encoder and decoder during network optimization;

inputting the coding features of the text and the prosody information of the prosody features of the voice into the decoder to generate synthetic voice;

in some embodiments, the decoder comprises: 6 layers of multi-head self-attention modules and a post-processing network;

inputting the encoding characteristics of the text and the prosodic information of the prosodic characteristics of the voice into the 6-head attention module for characteristic transformation to obtain a fused characteristic frequency spectrum, inputting the fused characteristic frequency spectrum into the post-processing network, and finely adjusting the fused characteristic frequency spectrum by using local frequency spectrum information to generate synthetic voice; the local spectrum information is prosodic information of prosodic features of the voice.

The method comprises the steps of using a local self-attention mechanism of a small window, wherein the window length is 100ms, and modeling residual errors between frequency spectrum information generated by a decoder and real frequency spectrum information by using information of adjacent frames.

And (3) a voice generation process: the input of the text encoder is the text of the forged voice to be generated, the input of the prosody encoder is the guide voice corresponding to the prosody characteristics contained in the expected forged voice, and the decoder can output the forged voice containing the tone of the guide voice. The forged voice has the similar tone color and rhythm with the guiding voice, so that the characteristic of flat and light rhythm can be avoided, and the attack to the voice identification device is realized.

In summary, the technical solutions of the aspects of the present invention have the following advantages compared with the prior art: the time domain division is carried out on the audio by using the time-length information, the prosody encoder can model the local prosody information of the voice to describe the change trend of the prosody, and meanwhile, the prosody decoupling module ensures that the prosody encoder device only learns the prosody information of the voice and does not contain text information.

Example 2:

a second aspect of the present invention discloses an electronic device, which includes a memory and a processor, where the memory stores a computer program, and when the processor executes the computer program, the speech generating apparatus for fine-grained prosody modeling according to any one of embodiments 1 of the present invention is implemented.

Fig. 2 is a block diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 2, the electronic device includes a processor, a memory, a communication interface, a display screen, and an input device, which are connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the electronic device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, Near Field Communication (NFC) or other technologies. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the electronic equipment, an external keyboard, a touch pad or a mouse and the like.

It will be understood by those skilled in the art that the structure shown in fig. 2 is only a partial block diagram related to the technical solution of the present disclosure, and does not constitute a limitation of the electronic device to which the solution of the present application is applied, and a specific electronic device may include more or less components than those shown in the drawings, or combine some components, or have a different arrangement of components.

It should be noted that the technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, the scope of the present description should be considered. The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not to be construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An apparatus for fine-grained prosody modeled speech generation, the apparatus comprising:

a text encoder, a prosodic encoder, and a decoder;

the prosody encoder includes: a rhythm coding module and a decoupling module;

inputting the encoding characteristics of the text and the prosody information of the prosody characteristics of the voice into the decoder to generate synthetic voice;

the prosody encoding module includes:

the specific method for obtaining the prosodic features of the phoneme-level speech by sequentially inputting the frequency spectrum information into the frequency spectrum preprocessing network, the multi-head self-attention module and the full-connection layer comprises the following steps:

2. The apparatus of claim 1, wherein the text encoder comprises:

a word embedding layer, a precoding layer and a recurrent neural network;

3. The fine-grained prosody modeling speech generating device according to claim 2, wherein the text information is sequentially input into the word embedding layer, the pre-coding layer and the recurrent neural network, and the specific method for obtaining the coding features of the text comprises:

4. A fine-grained prosodic modeled speech generating device according to claim 1, wherein said decoupling module is configured to generate textual content information and prosodic information contained in prosodic features that counter-network decouple said speech.

5. The apparatus of claim 4, wherein the decoupling module is configured to perform iterative optimization with the text encoder and decoder separately during a network optimization procedure.

6. The apparatus of claim 1, wherein the decoder comprises: a multi-layer multi-head self-attention module and a post-processing network;

7. An electronic device comprising a fine-grained prosody modeled speech generating apparatus according to any one of claims 1 to 6.