WO2022121158A1

WO2022121158A1 - Speech synthesis method and apparatus, and electronic device and storage medium

Info

Publication number: WO2022121158A1
Application number: PCT/CN2021/083186
Authority: WO
Inventors: 孙奥兰; 王健宗; 程宁
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-12-11
Filing date: 2021-03-26
Publication date: 2022-06-16
Also published as: CN112509554A

Abstract

A speech synthesis method and a speech synthesis apparatus (100), and an electronic device (1) and a storage medium. The method comprises: obtaining a character vector, and performing attention calculation on the character vector by using a multi-head attention network to obtain an attention vector (S4); performing a residual connection on the attention vector and the character vector to obtain a character attention vector (S5); performing feature extraction on the character attention vector by using a character feature extraction network to obtain a character feature sequence (S6); and inputting the character vector into a pre-built pronunciation pause prediction model to obtain a pronunciation pause sequence (S7); performing a residual connection on the character feature sequence and the pronunciation pause sequence to obtain a speech sequence, and performing speech synthesis on the speech sequence by using a pre-built vocoder to obtain a synthesized speech of a character text (S8). The problem that the synthesized speech is not smooth and natural enough can be solved.

Description

Speech synthesis method, device, electronic device and storage medium

This application claims the priority of the Chinese patent application filed on December 11, 2020 with the application number CN202011452787.7 and the title of the invention is "Speech Synthesis Method, Device, Electronic Device and Storage Medium", the entire content of which is approved by Reference is incorporated in this application.

technical field

The present application relates to the field of speech synthesis, and in particular, to a speech synthesis method, apparatus, electronic device, and computer-readable storage medium.

Background technique

With the rapid development of deep learning, speech synthesis methods based on deep learning networks have sprung up. The inventor realized that the commonly used speech synthesis methods include LSTM synthesis method, BERT synthesis method, etc. Although these methods can realize speech synthesis However, due to the lack of improvement of speech naturalness and fluency, the synthesized speech is not smooth and natural.

SUMMARY OF THE INVENTION

A speech synthesis method, comprising:

Receive character text, carry out pinyin replacement of described character text, obtain character pinyin, utilize pre-built alphabet, calculate the character position of described character pinyin in described alphabet;

performing encoding operation on the character position and the character pinyin to obtain a character vector;

Inputting the character vector into the pre-trained attention feature model, wherein the attention feature model includes a multi-head attention network and a character feature extraction network;

Use the multi-head attention network to perform attention calculation on the character vector to obtain an attention vector;

performing residual connection on the attention vector and the character vector to obtain a character attention vector;

Using the character feature extraction network to perform feature extraction on the character attention vector to obtain a character feature sequence;

The character vector is input into a pre-built pronunciation pause prediction model to obtain a pronunciation pause sequence;

Residual connection is performed on the character feature sequence and the pronunciation pause sequence to obtain a speech sequence, and a pre-built vocode synthesizer is used to perform speech synthesis on the speech sequence to obtain the synthesized speech of the character text.

A speech synthesis device, the device comprising:

A character vector building module is used to receive character text, replace the character text with pinyin, obtain character pinyin, and use a pre-built alphabet to calculate the character position of the character pinyin in the alphabet, and to calculate the character position of the character And described character pinyin performs encoding operation, obtains character vector;

A character feature sequence extraction module for inputting the character vector into the pre-trained attention feature model, wherein the attention feature model includes a multi-head attention network and a character feature extraction network, using the multi-head attention network Perform attention calculation on the character vector to obtain an attention vector, perform residual connection on the attention vector and the character vector to obtain a character attention vector, and use the character feature extraction network to pay attention to the character The force vector performs feature extraction to obtain character feature sequences;

A pronunciation pause sequence extraction module, for inputting the character vector into a pre-built pronunciation pause prediction model to obtain a pronunciation pause sequence;

A speech synthesis module for performing residual connection on the character feature sequence and the pronunciation pause sequence to obtain a speech sequence, and using a pre-built vocoder to perform speech synthesis on the speech sequence to obtain the character text synthesized speech.

An electronic device comprising:

a memory that stores at least one instruction; and

A processor that executes the instructions stored in the memory to achieve the following steps:

A computer-readable storage medium, comprising a storage data area and a storage program area, the storage data area stores data created, and the storage program area stores a computer program; wherein, the computer program is executed by a processor The following steps are implemented:

The present application can solve the problem that the synthesized speech is not smooth and natural enough.

Description of drawings

1 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application;

2 is a detailed schematic flowchart of S6 in a speech synthesis method provided by an embodiment of the present application;

3 is a schematic block diagram of a speech synthesis apparatus provided by an embodiment of the present application;

4 is a schematic diagram of an internal structure of an electronic device for implementing a speech synthesis method provided by an embodiment of the present application;

The realization, functional characteristics and advantages of the purpose of the present application will be further described with reference to the accompanying drawings in conjunction with the embodiments.

Detailed ways

It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

The embodiments of the present application provide a speech synthesis method, and the execution subject of the speech synthesis method includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server and a terminal. In other words, the speech synthesis method can be executed by software or hardware installed in a terminal device or a server device, and the software can be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.

Referring to FIG. 1 , a schematic flowchart of a speech synthesis method provided by an embodiment of the present application is shown. In this embodiment, the speech synthesis method includes:

S1. Receive character text, perform pinyin substitution on the character text to obtain the character pinyin, and use a pre-built alphabet to calculate the character position of the character pinyin in the alphabet.

In the preferred embodiment of the present application, the character text input by the user is acceptable, for example, the user input character text A: "Hello, today's trip is accompanied by heavy rain and strong wind, please pay attention to safety". Then described character text A is carried out phonetic replacement, obtain character phonetic B: " nihao, jintianchuxingbanyoubaoyukuangfeng, qingzhuyianquan ", wherein in the embodiment of the application, described character text is carried out phonetic replacement, obtain character phonetic, comprise: utilize JAVA Pinyin4j in the language, builds a pinyin replacement program; uses the pinyin replacement program to perform pinyin replacement on the character text to obtain the character pinyin.

Wherein pinyin4j is located in net.sourceforge.pinyin4j in JAVA language, so use import net.sourceforge.pinyin4j to import pinyin4j to obtain the pinyin replacement program.

In the embodiment of the present application, the alphabet is constructed by using pinyin. For example, in the alphabet, a corresponds to 1, b corresponds to 2, and c corresponds to 3, then the above-mentioned character pinyin B: "nihao, jintianchuxingbanyoubaoyukuangfeng, qingzhuyianquan" uses the The alphabet is constructed to obtain character positions including numbers.

S2. Perform an encoding operation on the character position and the character pinyin to obtain a character vector.

In detail, the embodiment of the present application adopts a one-hot encoding method to perform encoding operations on the character position and the character pinyin to obtain a character vector.

S3. Input the character vector into the pre-trained attention feature model, where the attention feature model includes a multi-head attention network and a character feature extraction network.

In this embodiment of the present application, before performing the S3, the attention feature model needs to be trained. In detail, the training of the attention feature model includes:

Step A: constructing an attention feature model to be trained including the multi-head attention network and the character feature extraction network.

In detail, the step A includes: constructing the multi-head attention network according to a multi-head attention mechanism; constructing the character feature extraction network according to a convolutional neural network; combining the multi-head attention network and the character feature extraction network, The attention feature model to be trained is obtained.

Wherein, constructing the multi-head attention network according to the multi-head attention mechanism includes: receiving a trained Transform model, extracting an encoder from the Transform model, and using the multi-head attention mechanism in the encoder to construct Get the multi-head attention network.

In the embodiment of the present application, the user can train and complete the Transform model in advance. The Transform model is a deep learning model that can realize classification or fitting, including an encoder and a decoder, wherein the encoder includes a multi-head attention mechanism. In the application embodiment, the network layer where the multi-head attention mechanism is located is extracted to construct the multi-head attention network.

Further, in the embodiment of the present application, according to the principle that the multi-head attention network is in front and the character feature extraction network is in the back, the attention feature model to be trained is obtained by combining.

Step B: Receive a training text set and a training label set, input the training text set into the attention feature model to be trained for feature extraction, and obtain a feature sequence training set.

In the embodiment of the present application, the training text set is a text set collected and sorted out by a user in advance, and the training label set is a voice set corresponding to the training text set. For example, the training text set contains a training text X ₁ : "bad environment, not suitable for outing", then there is corresponding speech Y ₁ =(y ₁ ,y ₂ ,..,y _n ) in the training label set, where y _n represents the speech sequence of speech Y ₁ .

Further, after the training text set is obtained, the attention feature model to be trained is used for feature extraction, and in detail, the training text set is input into the attention feature model to be trained for feature extraction, Obtaining a feature sequence training set includes: performing pinyin replacement on the training text set to obtain a pinyin training set, calculating the character positions of the pinyin training set in the alphabet, obtaining a position training set, and comparing the pinyin training set and Perform an encoding operation on the position training set to obtain a vector training set, and use the multi-head attention network to perform an attention calculation on the vector training set to obtain an attention vector set; train the attention vector set and the vector training set Perform residual connection on the set to obtain an attention vector training set; use the character feature extraction network to perform feature extraction on the attention vector training set to obtain the feature sequence training set.

In detail, the process of obtaining the vector training set by performing pinyin replacement, character position calculation and encoding operations on the training text set is similar to the above S1 and S2, and will not be repeated here.

In the embodiment of the present application, according to the principle of the multi-head attention mechanism of the encoder in the Transform model, attention calculation is performed on the vector training set to obtain the attention vector set.

Further, the present application uses the following formula to perform residual connection on the attention vector set and the vector training set:

result _attention = s+p

Wherein, result _attention represents the attention vector training set, s represents the attention vector set, and p represents the vector training set.

In the embodiment of the present application, the convolution operation in the character feature extraction network is used to sequentially perform feature extraction on each attention vector in the attention vector training set, and then the feature sequence training set is obtained. The convolution operation is a convolution calculation operation based on a convolution kernel, and the size of the convolution kernel is set to 3*3 in this application, so as to obtain the feature sequence training set.

Step C: Build multiple linear activation layers.

After obtaining the attention feature model to be trained, and using the attention feature model to be trained to perform feature extraction to obtain a feature sequence training set, the present application constructs a linear activation layer to help the attention feature model to be trained for model training, Wherein the linear activation layer includes normalization and activation function, and the activation function can use a Gaussian distribution function.

Step D: use the multi-layer linear activation layer to perform an activation operation on the feature sequence training set to obtain a prediction sequence set.

In detail, using the multi-layer linear activation layer to perform an activation operation on the feature sequence training set to obtain a prediction sequence set includes: performing normalization on the feature sequence training set to obtain a feature sequence normalized set , using the Gaussian distribution function to calculate the Gaussian distribution of the normalized set of feature sequences, and obtain the predicted sequence set according to the Gaussian distribution.

Specifically, the normalization is an operation of mapping the values in the feature sequence training set to a specified range. For example, mapping the values in the feature sequence training set to the [0,1] range, it can Scale down the values to reduce computational stress.

Further, calculating the Gaussian distribution of the normalized set of feature sequences by using the Gaussian distribution function includes: using the Gaussian distribution function to calculate the mean and variance of the normalized set of feature sequences, and using the Gaussian distribution function to calculate the mean and variance of the normalized set of feature sequences. The mean and variance of the normalized set of feature sequences are calculated, and the Gaussian distribution of the normalized set of feature sequences is obtained.

Since the Gaussian distribution represents the probability distribution of data within a specified range, in the embodiment of the present application, the maximum probability distribution of the training set of feature sequences is found from the Gaussian distribution, that is, the set of prediction sequences is obtained.

Step E: Calculate the error value between the predicted sequence set and the training label set, and determine the magnitude relationship between the error value and a preset error threshold.

In the embodiment of the present application, the squared difference formula is used to calculate the error value between the predicted sequence set and the training label set.

Step F: If the error value is greater than the error threshold, adjust the internal parameters of the attention feature model to be trained, and return to Step B.

Step G: If the error value is less than or equal to the error threshold, obtain the attention feature models of the multi-head attention network and the character feature extraction network.

Specifically, when the error value is less than or equal to the error threshold, it indicates that the attention feature model to be trained has strong character feature extraction capability, and the training is completed to obtain the attention feature model.

In the embodiment of the present application, when steps A to G are performed to obtain the trained attention feature model, further, the character vector can be input into the pre-trained attention feature model.

S4. Use the multi-head attention network to perform attention calculation on the character vector to obtain an attention vector.

In the embodiment of the present application, the training stages in S4 and S3 are similar, and both use the principle of the multi-head attention mechanism of the encoder in the Transform model to perform the attention calculation to obtain the attention vector.

S5. Perform residual connection on the attention vector and the character vector to obtain a character attention vector.

In the embodiment of the present application, the following formula is used to perform residual connection on the attention vector and the character vector to obtain a character attention vector:

character _attention =m+u

Wherein, character _attention represents the character attention vector, m represents the attention vector, and u represents the character vector.

S6. Using the character feature extraction network, perform feature extraction on the character attention vector to obtain a character feature sequence.

In the embodiment of the present application, referring to FIG. 2 , the S6 includes:

S61, performing normalization on the character attention vector to obtain a character normalization vector;

S62, perform a convolution operation on the normalized vector to obtain a character convolution vector;

S63. Perform residual connection on the character convolution vector and the character attention vector to obtain the character feature sequence.

The normalization is as described above, the operation of mapping the value in the character attention vector to a specified range. In this embodiment of the present application, the value in the character attention vector is mapped to the range of [0, 1]. .

In detail, performing a convolution operation on the normalized vector to obtain a character convolution vector includes: constructing a convolution kernel according to a preset convolution kernel dimension; using the convolution kernel to perform a convolution operation on the normalized vector Convolution operation to obtain the character convolution vector.

Further, the residual connection is the same as the above, and the character convolution vector and the character attention vector are correspondingly added to obtain the character feature sequence.

S7. Input the character vector into a pre-built pronunciation pause prediction model to obtain a pronunciation pause sequence.

In detail, the pronunciation pause prediction model is formed based on a plurality of fast Fourier transform modules. In the embodiment of the present application, 10 fast Fourier transform modules are used to form the pronunciation pause prediction model.

In detail, the S7 includes: transforming the character pinyin into a word vector to obtain a pinyin vector; inputting the pinyin vector and the character vector into the pronunciation pause prediction model, and using the pronunciation pause prediction model for all Perform Fourier transform on the pinyin vector and the character vector to obtain a Fourier transform sequence; perform pronunciation pause prediction on the Fourier transform sequence to obtain the pronunciation pause sequence.

The fast Fourier transform is a fast algorithm of discrete Fourier transform (DFT), which can predict the Fourier transform sequence corresponding to the character vector and the pinyin vector, wherein the Fourier transform sequence includes speech frequency, Amplitude and phase, and the articulation pause sequence can be obtained through the Fourier transform sequence.

S8, performing residual connection on the character feature sequence and the pronunciation pause sequence to obtain a speech sequence, and using a pre-built vocoder synthesizer to perform speech synthesis on the speech sequence to obtain the synthesized speech of the character text.

In the embodiment of the present application, the vocoder is a decoder that can realize speech synthesis, including a channel vocoder, a formant vocoder, a pattern vocoder, a linear prediction vocoder, encoder, quadrature function vocoder, etc. In the embodiment of the present application, the synthesized speech of the character text can be obtained by inputting the speech sequence into the vocoding synthesizer.

In this embodiment of the present application, speech synthesis is performed in two parts. First, a pre-trained attention feature model is used to perform feature extraction on character text to obtain character feature sequences. Second, a pronunciation pause prediction model is used to predict the pronunciation pause sequence of character text. Finally, the character feature sequence and the pronunciation pause sequence are performed residual connection to obtain a speech sequence, and a pre-built vocode synthesizer is used to perform speech synthesis on the speech sequence to obtain the synthesized speech of the character text. Compared with simply using LSTM, BERT and other models for synthesis in the background technology, the present application not only predicts the character feature sequence, but also adds the prediction process of the pronunciation pause sequence, so the synthesized speech is closer to natural in frequency amplitude, etc. Human voice, so the speech synthesis method, device and computer-readable storage medium proposed in this application can solve the problem that the synthesized speech is not smooth and natural enough.

As shown in FIG. 3 , it is a block diagram of the speech synthesis apparatus of the present application.

The speech synthesis apparatus 100 described in this application can be installed in an electronic device. According to the realized functions, the speech synthesis apparatus may include a character vector construction module 101 , a character feature sequence extraction module 102 , a pronunciation pause sequence extraction module 103 and a speech synthesis module 104 . The modules described in the present invention can also be called units, which refer to a series of computer program segments that can be executed by the electronic device processor and can perform fixed functions, and are stored in the memory of the electronic device.

In this embodiment, the functions of each module/unit are as follows:

The character vector construction module 101 is used for receiving character text, performing pinyin substitution on the character text to obtain the character pinyin, and calculating the character position of the character pinyin in the alphabet using a pre-built alphabet, and for all the characters in the alphabet. Describe character position and described character pinyin to carry out encoding operation, obtain character vector;

The character feature sequence extraction module 102 is configured to input the character vector into a pre-trained attention feature model, wherein the attention feature model includes a multi-head attention network and a character feature extraction network, using the multi-head attention network The attention network performs attention calculation on the character vector, obtains the attention vector, performs residual connection on the attention vector and the character vector, obtains the character attention vector, and uses the character feature extraction network to extract the character. Perform feature extraction on the character attention vector to obtain a character feature sequence;

The pronunciation pause sequence extraction module 103 is used to input the character vector into a pre-built pronunciation pause prediction model to obtain a pronunciation pause sequence;

The speech synthesis module 104 is used to perform residual connection on the character feature sequence and the pronunciation pause sequence to obtain a speech sequence, and use a pre-built vocoder to perform speech synthesis on the speech sequence to obtain the speech sequence. Synthesized speech for character text.

Each module in the speech synthesis apparatus 100 provided by the embodiment of the present application can use the same means as the above-mentioned speech synthesis method, and the specific implementation steps will not be repeated here. The technical effect is the same as that of the above-mentioned speech synthesis method, that is, the problem that the synthesized speech is not smooth and natural is solved.

As shown in FIG. 4 , it is a schematic structural diagram of an electronic device implementing the speech synthesis method of the present application.

The electronic device 1 may include a processor 10, a memory 11 and a bus, and may also include a computer program stored in the memory 11 and executable on the processor 10, such as a speech synthesis program 12.

Wherein, the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc. The memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a mobile hard disk of the electronic device 1 . In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash memory card (Flash Card), etc. Further, the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device. The memory 11 can not only be used to store application software installed in the electronic device 1 and various types of data, such as the code of the speech synthesis program 12, etc., but also can be used to temporarily store data that has been output or will be output.

In some embodiments, the processor 10 may be composed of integrated circuits, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one or more integrated circuits. Central Processing Unit (CPU), microprocessor, digital processing chip, graphics processor and combination of various control chips, etc. The processor 10 is the control core (Control Unit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing the program or module (for example, executing the program) stored in the memory 11. speech synthesis program, etc.), and call data stored in the memory 11 to execute various functions of the electronic device 1 and process data.

The bus may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (Extended industry standard architecture, EISA for short) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. The bus is configured to implement connection communication between the memory 11 and at least one processor 10 and the like.

FIG. 4 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 4 does not constitute a limitation on the electronic device 1, and may include fewer or more components than those shown in the drawings. components, or a combination of certain components, or a different arrangement of components.

For example, although not shown, the electronic device 1 may also include a power supply (such as a battery) for powering the various components, preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that the power management The device implements functions such as charge management, discharge management, and power consumption management. The power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components. The electronic device 1 may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.

Further, the electronic device 1 may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.

Optionally, the electronic device 1 may further include a user interface, and the user interface may be a display (Display), an input unit (eg, a keyboard (Keyboard)), optionally, the user interface may also be a standard wired interface or a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like. The display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.

It should be understood that the embodiments are only used for illustration, and are not limited by this structure in the scope of the patent application.

The speech synthesis program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple instructions, and when running in the processor 10, it can realize:

Further, if the modules/units integrated in the electronic device 1 are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) .

Further, the computer usable storage medium may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function, and the like; using the created data, etc.

The present application also provides a computer-readable storage medium. The computer-readable storage medium may be volatile or non-volatile. The readable storage medium stores a computer program, and the computer program is stored in the When executed by the processor of the electronic device, it can achieve:

In the several embodiments provided in this application, it should be understood that the disclosed apparatus, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division manners in actual implementation.

The modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.

It will be apparent to those skilled in the art that the present application is not limited to the details of the above-described exemplary embodiments, but that the present application can be implemented in other specific forms without departing from the spirit or essential characteristics of the present application.

Accordingly, the embodiments are to be regarded in all respects as illustrative and not restrictive, and the scope of the application is to be defined by the appended claims rather than the foregoing description, which is therefore intended to fall within the scope of the claims. All changes within the meaning and scope of the equivalents of , are included in this application. Any accompanying reference signs in the claims should not be construed as limiting the involved claims.

The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Furthermore, it is clear that the word "comprising" does not exclude other units or steps and the singular does not exclude the plural. Several units or means recited in the system claims can also be realized by one unit or means by means of software or hardware. Second-class terms are used to denote names and do not denote any particular order.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application rather than limitations. Although the present application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present application can be Modifications or equivalent substitutions can be made without departing from the spirit and scope of the technical solutions of the present application.

Claims

A speech synthesis method, wherein the method comprises:

Receive character text, carry out pinyin replacement of described character text, obtain character pinyin, utilize pre-built alphabet, calculate the character position of described character pinyin in described alphabet;

performing encoding operation on the character position and the character pinyin to obtain a character vector;

Inputting the character vector into the pre-trained attention feature model, wherein the attention feature model includes a multi-head attention network and a character feature extraction network;

Use the multi-head attention network to perform attention calculation on the character vector to obtain an attention vector;

performing residual connection on the attention vector and the character vector to obtain a character attention vector;

Using the character feature extraction network to perform feature extraction on the character attention vector to obtain a character feature sequence;

The character vector is input into a pre-built pronunciation pause prediction model to obtain a pronunciation pause sequence;

Residual connection is performed on the character feature sequence and the pronunciation pause sequence to obtain a speech sequence, and a pre-built vocode synthesizer is used to perform speech synthesis on the speech sequence to obtain the synthesized speech of the character text.
The speech synthesis method according to claim 1, wherein the feature extraction is performed on the character attention vector by using the character feature extraction network to obtain a character feature sequence, comprising:

performing normalization on the character attention vector to obtain a character normalization vector;

Perform a convolution operation on the normalized vector to obtain a character convolution vector;

A residual connection is performed on the character convolution vector and the character attention vector to obtain the character feature sequence.
The speech synthesis method according to claim 2, wherein, performing a convolution operation on the normalized vector to obtain a character convolution vector, comprising:

Construct the convolution kernel according to the preset convolution kernel dimension;

Perform a convolution operation on the normalized vector by using the convolution kernel to obtain the character convolution vector.
The speech synthesis method according to claim 1, wherein the character vector is input into a pre-built pronunciation pause prediction model to obtain a pronunciation pause sequence, comprising:

The character pinyin is transformed into a word vector to obtain a pinyin vector;

The pinyin vector and the character vector are input into the pronunciation pause prediction model, and the pronunciation pause prediction model is utilized to perform Fourier transform on the pinyin vector and the character vector to obtain a Fourier transform sequence;

Perform pronunciation pause prediction on the Fourier transform sequence to obtain the pronunciation pause sequence.
The speech synthesis method of claim 1, wherein the pre-trained attention feature model comprises:

Step A: constructing an attention feature model to be trained including the multi-head attention network and the character feature extraction network;

Step B: receiving a training text set and a training label set, inputting the training text set to the attention feature model to be trained for feature extraction, and obtaining a feature sequence training set;

Step C: Build a multi-layer linear activation layer;

Step D: using the multi-layer linear activation layer to perform an activation operation on the feature sequence training set to obtain a prediction sequence set;

Step E: Calculate the error value between the predicted sequence set and the training label set, and determine the magnitude relationship between the error value and a preset error threshold;

Step F: if the error value is greater than the error threshold, adjust the internal parameters of the attention feature model to be trained, and return to Step B;

Step G: If the error value is less than or equal to the error threshold, obtain the attention feature model.
The speech synthesis method according to claim 5, wherein the inputting the training text set into the attention feature model to be trained for feature extraction to obtain a feature sequence training set, comprising:

Pinyin replacement is carried out to the training text set to obtain a pinyin training set;

Calculate the character positions of the pinyin training set in the alphabet to obtain the position training set;

performing encoding operations on the pinyin training set and the position training set to obtain a vector training set;

Use the multi-head attention network to perform attention calculation on the vector training set to obtain an attention vector set;

performing a residual connection on the attention vector set and the vector training set to obtain an attention vector training set;

Using the character feature extraction network, feature extraction is performed on the attention vector training set to obtain the feature sequence training set.
The speech synthesis method according to any one of claims 1 to 6, wherein the using the multi-layer linear activation layer to perform an activation operation on the feature sequence training set to obtain a prediction sequence set, comprising:

Performing normalization on the feature sequence training set to obtain a feature sequence normalization set;

Calculate the Gaussian distribution of the normalized set of feature sequences, and calculate the predicted sequence set according to the Gaussian distribution.
A speech synthesis device, wherein the device comprises:

A character vector building module is used to receive character text, replace the character text with pinyin, obtain character pinyin, and use a pre-built alphabet to calculate the character position of the character pinyin in the alphabet, and to calculate the character position of the character And described character pinyin performs encoding operation, obtains character vector;

A character feature sequence extraction module for inputting the character vector into the pre-trained attention feature model, wherein the attention feature model includes a multi-head attention network and a character feature extraction network, using the multi-head attention network Perform attention calculation on the character vector to obtain an attention vector, perform residual connection on the attention vector and the character vector to obtain a character attention vector, and use the character feature extraction network to pay attention to the character The force vector performs feature extraction to obtain character feature sequences;

A pronunciation pause sequence extraction module, for inputting the character vector into a pre-built pronunciation pause prediction model to obtain a pronunciation pause sequence;

A speech synthesis module for performing residual connection on the character feature sequence and the pronunciation pause sequence to obtain a speech sequence, and using a pre-built vocoder to perform speech synthesis on the speech sequence to obtain the character text synthesized speech.
An electronic device, wherein the electronic device comprises:

at least one processor; and,

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the steps of:

Receive character text, carry out pinyin replacement of described character text, obtain character pinyin, utilize pre-built alphabet, calculate the character position of described character pinyin in described alphabet;

performing encoding operation on the character position and the character pinyin to obtain a character vector;

Inputting the character vector into the pre-trained attention feature model, wherein the attention feature model includes a multi-head attention network and a character feature extraction network;

Use the multi-head attention network to perform attention calculation on the character vector to obtain an attention vector;

performing residual connection on the attention vector and the character vector to obtain a character attention vector;

Using the character feature extraction network to perform feature extraction on the character attention vector to obtain a character feature sequence;

The character vector is input into a pre-built pronunciation pause prediction model to obtain a pronunciation pause sequence;

Residual connection is performed on the character feature sequence and the pronunciation pause sequence to obtain a speech sequence, and a pre-built vocode synthesizer is used to perform speech synthesis on the speech sequence to obtain the synthesized speech of the character text.
The electronic device according to claim 9, wherein, performing feature extraction on the character attention vector by using the character feature extraction network to obtain a character feature sequence, comprising:

performing normalization on the character attention vector to obtain a character normalization vector;

Perform a convolution operation on the normalized vector to obtain a character convolution vector;

A residual connection is performed on the character convolution vector and the character attention vector to obtain the character feature sequence.
The electronic device according to claim 10, wherein, performing a convolution operation on the normalized vector to obtain a character convolution vector, comprising:

Construct the convolution kernel according to the preset convolution kernel dimension;

Perform a convolution operation on the normalized vector by using the convolution kernel to obtain the character convolution vector.
The electronic device according to claim 9, wherein the character vector is input into a pre-built pronunciation pause prediction model to obtain a pronunciation pause sequence, comprising:

The character pinyin is transformed into a word vector to obtain a pinyin vector;

The pinyin vector and the character vector are input into the pronunciation pause prediction model, and the pronunciation pause prediction model is utilized to perform Fourier transform on the pinyin vector and the character vector to obtain a Fourier transform sequence;

Perform pronunciation pause prediction on the Fourier transform sequence to obtain the pronunciation pause sequence.
The electronic device according to claim 9, wherein the pre-trained attention feature model comprises:

Step A: constructing an attention feature model to be trained including the multi-head attention network and the character feature extraction network;

Step B: receiving a training text set and a training label set, inputting the training text set to the attention feature model to be trained for feature extraction, and obtaining a feature sequence training set;

Step C: Build a multi-layer linear activation layer;

Step D: using the multi-layer linear activation layer to perform an activation operation on the feature sequence training set to obtain a prediction sequence set;

Step E: Calculate the error value between the predicted sequence set and the training label set, and determine the magnitude relationship between the error value and a preset error threshold;

Step F: if the error value is greater than the error threshold, adjust the internal parameters of the attention feature model to be trained, and return to Step B;

Step G: If the error value is less than or equal to the error threshold, obtain the attention feature model.
The electronic device according to claim 13, wherein the inputting the training text set into the attention feature model to be trained to perform feature extraction to obtain a feature sequence training set, comprising:

Pinyin replacement is carried out to the training text set to obtain a pinyin training set;

Calculate the character positions of the pinyin training set in the alphabet to obtain the position training set;

performing encoding operations on the pinyin training set and the position training set to obtain a vector training set;

Use the multi-head attention network to perform attention calculation on the vector training set to obtain an attention vector set;

performing a residual connection on the attention vector set and the vector training set to obtain an attention vector training set;

Using the character feature extraction network, feature extraction is performed on the attention vector training set to obtain the feature sequence training set.
The electronic device according to any one of claims 9 to 14, wherein, using the multi-layer linear activation layer to perform an activation operation on the feature sequence training set to obtain a prediction sequence set, comprising:

Performing normalization on the feature sequence training set to obtain a feature sequence normalization set;

Calculate the Gaussian distribution of the normalized set of feature sequences, and calculate the predicted sequence set according to the Gaussian distribution.
A computer-readable storage medium, comprising a storage data area and a storage program area, wherein the storage data area stores created data, and the storage program area stores a computer program; wherein, when the computer program is executed by a processor Implement the following steps:

Receive character text, carry out pinyin replacement of described character text, obtain character pinyin, utilize pre-built alphabet, calculate the character position of described character pinyin in described alphabet;

performing encoding operation on the character position and the character pinyin to obtain a character vector;

Inputting the character vector into the pre-trained attention feature model, wherein the attention feature model includes a multi-head attention network and a character feature extraction network;

Use the multi-head attention network to perform attention calculation on the character vector to obtain an attention vector;

performing residual connection on the attention vector and the character vector to obtain a character attention vector;

Using the character feature extraction network to perform feature extraction on the character attention vector to obtain a character feature sequence;

The character vector is input into a pre-built pronunciation pause prediction model to obtain a pronunciation pause sequence;

Residual connection is performed on the character feature sequence and the pronunciation pause sequence to obtain a speech sequence, and a pre-built vocode synthesizer is used to perform speech synthesis on the speech sequence to obtain the synthesized speech of the character text.
The computer-readable storage medium according to claim 16, wherein, performing feature extraction on the character attention vector by using the character feature extraction network to obtain a character feature sequence, comprising:

performing normalization on the character attention vector to obtain a character normalization vector;

Perform a convolution operation on the normalized vector to obtain a character convolution vector;

A residual connection is performed on the character convolution vector and the character attention vector to obtain the character feature sequence.
The computer-readable storage medium of claim 17, wherein the performing a convolution operation on the normalized vector to obtain a character convolution vector, comprising:

Construct the convolution kernel according to the preset convolution kernel dimension;

Perform a convolution operation on the normalized vector by using the convolution kernel to obtain the character convolution vector.
The computer-readable storage medium of claim 16, wherein the inputting the character vector into a pre-built pronunciation pause prediction model to obtain a pronunciation pause sequence, comprising:

The character pinyin is transformed into a word vector to obtain a pinyin vector;

The pinyin vector and the character vector are input into the pronunciation pause prediction model, and the pronunciation pause prediction model is utilized to perform Fourier transform on the pinyin vector and the character vector to obtain a Fourier transform sequence;

Perform pronunciation pause prediction on the Fourier transform sequence to obtain the pronunciation pause sequence.
The computer-readable storage medium of claim 16, wherein the pre-trained attention feature model comprises:

Step A: constructing an attention feature model to be trained including the multi-head attention network and the character feature extraction network;

Step B: receiving a training text set and a training label set, inputting the training text set into the attention feature model to be trained for feature extraction, and obtaining a feature sequence training set;

Step C: Build a multi-layer linear activation layer;

Step D: using the multi-layer linear activation layer to perform an activation operation on the feature sequence training set to obtain a prediction sequence set;

Step E: Calculate the error value between the predicted sequence set and the training label set, and determine the magnitude relationship between the error value and a preset error threshold;

Step F: if the error value is greater than the error threshold, adjust the internal parameters of the attention feature model to be trained, and return to Step B;

Step G: If the error value is less than or equal to the error threshold, obtain the attention feature model.