WO2020248393A1

WO2020248393A1 - Speech synthesis method and system, terminal device, and readable storage medium

Info

Publication number: WO2020248393A1
Application number: PCT/CN2019/103582
Authority: WO
Inventors: 彭话易; 王健宗
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-06-14
Filing date: 2019-08-30
Publication date: 2020-12-17
Also published as: CN110335587B; CN110335587A

Abstract

A speech synthesis method, comprising: acquiring text data, and generating a text vector on the basis of the text data (S102); acquiring a speech recording of a real person, and modeling the prosody of the speech recording and generating a prosody vector (S104); combining the text vector with the prosody vector and generating a Mel spectrogram (S106); and generating target speech audio on the basis of the Mel spectrogram (S108). In the method, the prosody of the speech recording of a real person is used to implement modeling, and the technique created on the basis of global conditional probability is employed to make the prosody of the synthesized speech more similar to that of the input speech recording of a real person, such that the synthesized speech has high fidelity and high naturalness.

Description

Speech synthesis method, system, terminal equipment and readable storage medium

This application claims the priority of the Chinese patent application filed with the Chinese Patent Office on June 14, 2019, the application number is 201910515578.3, and the invention title is "speech synthesis method, system, terminal equipment and readable storage medium", the entire content of which is approved The reference is incorporated in this application.

Technical field

This application relates to the field of artificial intelligence technology, in particular to the field of speech semantics, and specifically to a speech synthesis method, system, terminal device and readable storage medium.

Background technique

With the development of technology, machines can already speak through speech synthesis technology. The so-called speech synthesis technology is also known as Text to Speech (TTS). Its goal is to allow machines to recognize and understand to turn text information into artificial speech output. It is an important branch of modern artificial intelligence development. Speech synthesis can play a great role in quality inspection, machine question and answer, and disability assistance, which is convenient for people's lives.

However, the existing machine can synthesize speech often has a fixed pattern, the generated speech is more rigid in terms of prosody, and has obvious differences with real people. Therefore, in certain scenarios where the artificiality of synthesized speech is relatively high (such as : Smart outbound calls), end users often cannot accept such a rigid rhythm. Therefore, there is an urgent need for a speech synthesis method based on deep learning.

Summary of the invention

In order to solve at least one of the above technical problems, this application proposes a speech synthesis method, system, terminal device and readable storage medium, which can transfer the prosody in the recording of a real person to the synthesized speech, and realize the fidelity of the synthesized speech. The promotion.

In order to achieve the foregoing objective, the first aspect of the present application provides a speech synthesis method, including:

Acquiring text data, and generating a text vector according to the text data;

Obtain real-person recordings, and model the prosody of the real-person recordings to generate a prosody vector;

Combining the text vector and the prosody vector to generate a Mel language spectrogram;

The target voice is generated according to the Mel language spectrogram.

The second aspect of the present application also provides a speech synthesis system, including:

The text embedding module is used to obtain text data and generate a text vector according to the text data;

The prosody extraction module is used to obtain real-person recordings, and to model the prosody of the real-person recordings to generate a prosody vector;

A mel language spectrum generating module, configured to combine the text vector and the prosody vector to generate a mel language spectrum map;

The voice generation module is used to generate the target voice according to the Mel language spectrogram.

The third aspect of the present application also provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the computer program, The steps of the above speech synthesis method.

The fourth aspect of the present application also provides a computer non-volatile readable storage medium. The computer non-volatile readable storage medium includes a computer program. When the computer program is executed by a processor, Steps of synthetic method.

This application obtains text data and real-person recordings, and generates text vectors from the text data, and then models the prosody of the real-person recordings to generate the prosody vector; then combines the text vector and the prosody vector to generate mel Spectrogram; and then generate the target speech according to the Mel spectrogram, so as to realize the transfer of the prosody in the real-person recording to the synthesized speech. At the same time, this application also models the prosody in real-person recordings, and uses a method based on global conditional probability generation to make the synthesized speech more similar to the input real-person recordings, and further make the synthesized speech have high fidelity and high fidelity. The effect of naturalness.

The additional aspects and advantages of this application will be given in the following description, and some will become obvious from the following description, or be understood through the practice of this application.

Description of the drawings

Fig. 1 shows a flowchart of a speech synthesis method of the present application.

Fig. 2 shows a flowchart of a method for generating a text vector according to an embodiment of the present application.

Fig. 3 shows a flow chart of a method for generating a prosody vector according to an embodiment of the present application.

Fig. 4 shows a block diagram of a speech synthesis system of the present application.

Fig. 5 shows a block diagram of a text embedding module according to an embodiment of the present application.

Fig. 6 shows a block diagram of a prosody extraction module according to an embodiment of the present application.

Fig. 7 shows a schematic diagram of the operation of a speech synthesis system of the present application.

Figure 8 shows a schematic diagram of the operation of a text embedding module of the present application.

Figure 9 shows a schematic diagram of the operation of a prosody extraction module of the present application.

Fig. 10 shows a schematic diagram of a terminal device of the present application.

Detailed ways

In order to be able to understand the above objectives, features and advantages of the application more clearly, the application will be further described in detail below in conjunction with the accompanying drawings and specific implementations. It should be noted that the embodiments of the application and the features in the embodiments can be combined with each other if there is no conflict.

In the following description, many specific details are set forth in order to fully understand this application. However, this application can also be implemented in other ways different from those described here. Therefore, the scope of protection of this application is not covered by the specific details disclosed below. Limitations of the embodiment.

There are three mainstream technical solutions for speech synthesis: parameter synthesis, waveform splicing, and end-to-end. In comparison, the end-to-end technical solutions can make the generated speech have extremely excellent quality. The speech synthesis method, system and terminal equipment proposed in this application are also based on an end-to-end technical solution.

Figure 1 is a flowchart of a speech synthesis method according to this application.

As shown in Figure 1, the first aspect of this application provides a speech synthesis method, including:

S102: Acquire text data, and generate a text vector according to the text data;

S104: Acquire a real person recording, and model the rhythm of the real person recording to generate a prosody vector;

S106: Combine the text vector and the prosody vector to generate a Mel language spectrogram;

S108: Generate a target voice according to the Mel language spectrogram.

It should be noted that the above step S106, combining the text vector and the prosody vector to generate a Mel language spectrogram, specifically includes:

The text vector is used as a local condition, and the prosody vector is used as a global condition, and the sequence-to-sequence model mapping is used to generate the Mel language spectrogram (also called the Mel spectrogram).

Further, the text vector and the prosody vector will be input into a sequence to the sequence model (seq2seq). The sequence-to-sequence model is a neural network model generated based on conditional probability. The input text vector will serve as the local condition, and the input prosody vector will serve as the global condition. Finally, after mapping the pre-trained sequence to the sequence model, the Mel language spectrogram can be obtained.

It should be noted that, after acquiring the real person recording, it also includes performing pre-enhancement processing on the real person recording; the pre-enhancement is performed in units of frames, with the purpose of enhancing high frequencies and increasing the high frequency resolution of the speech. Because the high-frequency end is attenuated by 6dB/oct (octave) above 800Hz, the higher the frequency, the smaller the corresponding component. For this reason, the high-frequency part of the live recording must be improved before the analysis of the live recording, which can also improve the high-frequency signal. Noise ratio.

As shown in Figure 2, acquiring text data and generating a text vector from the text data specifically includes:

S202: Acquire Chinese character data, and perform word segmentation processing on the Chinese character data;

S204: Translate the Chinese character data after word segmentation processing into Chinese pinyin with tones;

S206: Convert the tonal Chinese pinyin obtained by the translation into one-dimensional vector data;

S208: Convert the one-dimensional vector data into two-dimensional vector data according to the time sequence.

It should be noted that the tones include the first tone, second tone, third tone, fourth tone and soft tone of Mandarin. Arabic numerals 1, 2, 3, 4, and 5 are used to represent the first tone and the first tone of Mandarin. The tone codes of the second tone, the third tone, the fourth tone, and the soft tone, but are not limited to these. In other embodiments, other numbers may also be used to represent the four tones and the soft tone of Mandarin.

As shown in Figure 3, acquiring a real person recording, and modeling the prosody of the real person recording to generate a prosody vector, which specifically includes:

S302: Perform short-time Fourier transform on the acquired real person recording to obtain a corresponding spectrogram;

S304, performing mel filtering on the spectrogram to obtain a mel language spectrogram;

S306: Perform time sequence compression and feature representation optimization on the Mel language spectrogram;

S308: Perform cyclic neural network processing on the Mel language spectrogram, and output according to time sequence;

S310: Obtain the output at each moment, and convert all the outputs of the recurrent neural network into a two-dimensional prosody vector.

It should be noted that the Short Time Fourier Transform (STFT) is a mathematical transformation related to the Fourier Transform, which is used to determine the frequency and phase of a sine wave in a local area of a time-varying signal. Specifically, the short-time Fourier transform is to cut the original Fourier transform into multiple segments in the time domain to perform Fourier transform separately, each segment is recorded as time t, and the frequency domain characteristics are obtained corresponding to the Fourier transform. The frequency domain characteristics at time t can be roughly estimated (that is, the correspondence between the time domain and the frequency domain is also guided). The tool used for signal truncation is called a window function (the width is equivalent to the length of time). The smaller the window, the more obvious the time domain characteristics, but at this time, the FFT reduces the accuracy due to too few points, resulting in insignificant frequency domain characteristics.

It can be understood that in other embodiments, wavelet transform or Wigner distribution may also be used to obtain the spectrogram, but it is not limited thereto.

Specifically, the real person recording is a one-dimensional signal; the spectrogram is a two-dimensional signal.

According to the embodiment of the present application, before the short-time Fourier transform is performed on the acquired real person recording to obtain the corresponding spectrogram, random noise needs to be added to the acquired real person recording. In data enhancement, some audio is usually synthesized manually, but some artificially synthesized (using software) audio may cause some digital errors, such as underflow, overflow, etc. This application can effectively avoid the above-mentioned digital error problem by adding random noise to the audio.

As shown in Figure 4, the second aspect of the present application also provides a speech synthesis system 4, the speech synthesis system 4 includes:

The text embedding module 41 is used to obtain text data and generate a text vector according to the text data;

The prosody extraction module 42 is used to obtain real-person recordings, and to model the prosody of the real-person recordings to generate a prosody vector;

The Mel language spectrum generating module 43 is configured to combine the text vector and the prosody vector to generate a Mel language spectrum map;

The speech generation module 44 is used to generate a target speech according to the Mel language spectrogram.

In the embodiment of the present application, the Mel spectrum generation module 43 is a sequence-to-sequence model (seq2seq), and the sequence-to-sequence model is a neural network model generated based on conditional probability. Specifically, the text vector and the prosody vector will be input into a sequence to the sequence model, the input text vector will be the local condition, and the input prosody vector will be the global condition. Finally, after mapping the pre-trained sequence to the sequence model, the Mel language spectrogram can be obtained.

It should be noted that the sequence-to-sequence model used in the Mel language spectrum generation module 43 and the prosody extraction module use the same undisclosed speech database for joint training. The voice database contains a male/female speaker (i.e. the source speaker) in a quiet environment, with a total length of about 30 hours of voice files recorded with a special recording device, and the text file corresponding to each voice.

It should be noted that after acquiring the real person recording, it also includes performing pre-enhancement processing on the real person recording; the pre-enhancement is performed in units of frames, with the purpose of enhancing high frequencies and increasing the high frequency resolution of speech. Because the high-frequency end is attenuated by 6dB/oct (octave) above 800Hz, the higher the frequency, the smaller the corresponding component. For this reason, the high-frequency part of the live recording must be improved before the analysis of the live recording, which can also improve the high-frequency signal. Noise ratio.

As shown in FIG. 5, the text embedding module 41 includes:

The word segmentation unit 411 is used to obtain Chinese character data and perform word segmentation processing on the Chinese character data;

The language model unit 412 is used for translating the Chinese character data after word segmentation processing into tonal Chinese pinyin;

The one-hot encoding unit 413 is used to convert the tonal Chinese pinyin obtained by the translation into one-dimensional vector data;

The text vector generating unit 414 is configured to convert one-dimensional vector data into two-dimensional vector data according to a time sequence;

Wherein, the tones include the first tones, the second tones, the third tones, the fourth tones, and the soft tones. The Arabic numerals 1, 2, 3, 4, and 5 are used to represent the first, second, Tone codes for the third, fourth, and soft tone.

In the embodiment of the present application, the one-hot encoding unit performs One-Hot-coding (One-Hot-coding) method is: use N-bit status register to encode N states, each state has its own independent register bit, And at any time, only one of them is valid. For example, to encode six states:

The natural sequence code is 000,001,010,011,100,101;

The one-hot code is 000001,000010,000100,001000,010000,100000.

As shown in FIG. 6, the prosody extraction module 42 includes:

The short-time Fourier transform unit 421 is configured to perform short-time Fourier transform on the acquired real person recordings to obtain a corresponding spectrogram;

A mel filtering unit 422, configured to perform mel filtering on the spectrogram to obtain a mel language spectrogram;

The convolutional neural network unit 423 is configured to perform time sequence compression and feature representation optimization on the Mel language spectrogram;

The GRU unit 424 is configured to perform cyclic neural network processing on the Mel language spectrogram, and output according to time sequence;

The prosody vector generating unit 425 is used to obtain the output at each moment and convert all the outputs of the recurrent neural network into a two-dimensional prosody vector.

It can be understood that in other embodiments, the short-time Fourier transform unit 421 may also be replaced by a wavelet transform unit or a Wigner distribution unit, but it is not limited thereto.

According to the embodiment of the present application, before the short-time Fourier transform is performed on the acquired real-person recording to obtain the corresponding spectrogram, random noise needs to be added to the acquired real-person recording. In data enhancement, some audio is usually synthesized manually, but some artificially synthesized (using software) audio may cause some digital errors, such as underflow, overflow, etc. This application can effectively avoid the above-mentioned digital error problem by adding random noise to the audio.

It should be noted that Convolutional Neural Networks (CNN) is a type of feedforward neural network that includes convolution calculations and has a deep structure. The convolutional neural network includes an input layer, a hidden layer, and an output layer.

The input layer of the convolutional neural network can process multi-dimensional data. Among them, the input layer of the one-dimensional convolutional neural network receives a one-dimensional or two-dimensional array. The one-dimensional array is usually time or spectrum sampling; the two-dimensional array may contain multiple channels; The input layer of the two-dimensional convolutional neural network receives a two-dimensional or three-dimensional array; the input layer of the three-dimensional convolutional neural network receives a four-dimensional array. In the embodiment of the present application, the convolutional neural network unit adopts a two-dimensional convolutional neural network to perform a six-layer two-dimensional convolution operation on a real-person recorded Mel language spectrogram.

The hidden layer of a convolutional neural network includes a convolutional layer, a pooling layer, and a fully connected layer. The function of the convolutional layer is to extract features from the input data. It contains multiple convolution kernels, which make up each of the convolution kernels. Each element corresponds to a weight coefficient and a deviation, which is similar to a neuron of a feedforward neural network; after feature extraction in the convolutional layer, the output feature map will be passed to the pooling layer for feature selection and information filtering. The pooling layer contains a preset pooling function whose function is to replace the result of a single point in the feature map with the feature map statistics of its neighboring regions. The fully connected layer is built in the last part of the hidden layer of the convolutional neural network and only transmits signals to other fully connected layers.

Recurrent Neural Network (RNN) is a type of recurrent neural network that takes sequence data as input, recursively in the evolution direction of the sequence, and all nodes (recurrent units) are connected in a chain to form a closed loop. Recurrent neural network has memory, parameter sharing and Turing completeness, so it can learn the nonlinear characteristics of the sequence with high efficiency.

As shown in Figure 7, the operating process of the speech synthesis system is:

Input the text content to be synthesized (for example: hello) into the speech synthesis system, and the text embedding module embeds the text content into a text vector.

Input the real person recording to the speech synthesis system, and the prosody extraction module will model the prosody of the real person recording to form a prosody vector.

Input the generated text vector and the prosody vector into the trained Mel language spectrum generation module to generate a Mel language spectrum map.

A trained speech generation module is used to synthesize the Mel language spectrogram into a high-fidelity speech file. Preferably, the speech generation module is a speech vocoder.

As shown in Figure 8, in the text embedding module, the input Chinese characters (Hello) will be segmented first, and then a trained language model will translate the Chinese characters into tonal Chinese pinyin; through one-hot encoding, Convert the translated Pinyin letters and tone codes (numbers 1 to 5) into one-dimensional vector data, and then convert them into a two-dimensional vector data according to the time series.

After the voice generation module obtains the Mel spectrogram, it uses the Mel spectrogram as a conditional input to generate the voice of the target speaker. In this embodiment, the voice generation module is a WaveNet vocoder, which is determined by A non-public speech database is trained, and the database is the same as the speech database used for training the Mel language spectrum generation module.

As shown in Figure 9, the prosody extraction module uses a cyclic neural network to realize the conversion between real-person recordings and prosody vectors. The specific steps are as follows:

First, perform a short-time Fourier transform on the input real person recording, and then use a mel filter to obtain its mel language spectrogram. The Mel language spectrogram will be input into a six-layer pre-trained convolutional neural network, the purpose is to compress the time sequence and better represent the features in the Mel language spectrum. The processed Mel language spectrogram will be input into the GRU unit-based recurrent neural network for processing according to time sequence, and the recurrent neural network will output according to the time sequence. After getting the output of each moment, a fully connected network converts the output of the recurrent neural network into a two-dimensional vector, which is the prosody vector.

As shown in FIG. 10, the third aspect of the present application further provides a terminal device 7. The terminal device 7 includes: a processor 71, a memory 72, and a computer stored in the memory 72 and running on the processor 71 Program 73, such as a program. When the processor 71 executes the computer program 73, the steps in the foregoing embodiments of the speech synthesis method are implemented.

In the embodiment of the present application, the computer program 73 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 72 and executed by the processor 71 To complete this application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer program 73 in the terminal device 7. For example, the computer program 73 can be divided into a text embedding module, a prosody extraction module, a Mel language spectrum generation module, and a speech generation module. The specific functions of each module are as follows:

The text embedding module is used to obtain text data and generate a text vector from the text data;

The terminal device 7 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud management server. The terminal device 7 may include, but is not limited to, a processor 71 and a memory 72. Those skilled in the art can understand that FIG. 7 is only an example of the terminal device 7 and does not constitute a limitation on the terminal device 7. It may include more or less components than shown in the figure, or a combination of certain components, or different components. For example, the terminal device may also include input and output devices, network access devices, buses, etc.

The so-called processor 71 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.

The memory 72 may be an internal storage unit of the terminal device 7, such as a hard disk or memory of the terminal device 7. The memory 72 may also be an external storage device of the terminal device 7, such as a plug-in hard disk equipped on the terminal device 7, a smart memory card (Smart Media Card, SMC), or a Secure Digital (SD). Card, Flash Card, etc. Further, the memory 72 may also include both an internal storage unit of the terminal device 7 and an external storage device. The memory 72 is used to store the computer program and other programs and data required by the terminal device. The memory 72 can also be used to temporarily store data that has been output or will be output.

The fourth aspect of the present application also provides a computer non-volatile readable storage medium, the computer non-volatile readable storage medium includes a computer program, and when the computer program is executed by a processor, the above-mentioned voice is realized Steps of synthetic method.

In the several embodiments provided in this application, it should be understood that the disclosed device and method may be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, such as: multiple units or components can be combined, or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be electrical, mechanical or other forms of.

The units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units; they may be located in one place or distributed on multiple network units; Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, the functional units in the embodiments of the present application can all be integrated into one processing unit, or each unit can be individually used as a unit, or two or more units can be integrated into one unit; The unit can be implemented in the form of hardware, or in the form of hardware plus software functional units.

Those of ordinary skill in the art can understand that all or part of the steps in the above method embodiments can be implemented by a program instructing relevant hardware. The foregoing program can be stored in a computer readable storage medium. When the program is executed, the execution includes The steps of the foregoing method embodiment; and the foregoing storage medium includes: removable storage devices, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks, etc. The medium storing the program code.

Alternatively, if the above-mentioned integrated unit of this application is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present application essentially or the part that contributes to the prior art can be embodied in the form of a software product. The computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application. The aforementioned storage media include: removable storage devices, ROM, RAM, magnetic disks or optical disks and other media that can store program codes.

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

A method for speech synthesis, characterized in that it comprises:

Acquiring text data, and generating a text vector according to the text data;

Obtain real-person recordings, and model the prosody of the real-person recordings to generate a prosody vector;

Combining the text vector and the prosody vector to generate a Mel language spectrogram;

The target voice is generated according to the Mel language spectrogram.
The speech synthesis method according to claim 1, wherein obtaining text data and generating a text vector from the text data comprises:

Acquire Chinese character data, and perform word segmentation processing on the Chinese character data;

Translate the Chinese character data after word segmentation into Chinese Pinyin with tones;

Convert the tonal Chinese pinyin obtained by the translation into one-dimensional vector data;

Convert one-dimensional vector data into two-dimensional vector data according to time series.
The speech synthesis method according to claim 1, characterized in that acquiring a real person recording and modeling the prosody of the real person recording to generate a prosody vector comprises:

Perform short-time Fourier transform on the acquired real person recordings to obtain the corresponding spectrogram;

Performing mel filtering on the spectrogram to obtain a mel language spectrogram;

Compress the sequence of the Mel language spectrogram and optimize the feature representation;

Perform cyclic neural network processing on the Mel language spectrogram, and output according to time sequence;

Obtain the output at each moment, and convert all the outputs of the recurrent neural network into a two-dimensional prosody vector.
The speech synthesis method according to claim 1, wherein generating a Mel language spectrogram in combination with the text vector and the prosodic vector comprises:

The text vector will be used as a local condition, and the prosody vector will be used as a global condition, and after the sequence-to-sequence model is mapped, the Mel language spectrogram is generated.
The speech synthesis method according to claim 2, wherein the tones include the first, second, third, fourth and soft sounds of Mandarin, using Arabic numerals 1, 2, 3, 4, 5 respectively represent the tone codes of the first, second, third, fourth and soft tone of Mandarin.
A speech synthesis system is characterized in that it comprises:

The text embedding module is used to obtain text data and generate a text vector according to the text data;

The prosody extraction module is used to obtain real-person recordings, and to model the prosody of the real-person recordings to generate a prosody vector;

A mel language spectrum generating module, configured to combine the text vector and the prosody vector to generate a mel language spectrum map;

The voice generation module is used to generate the target voice according to the Mel language spectrogram.
The speech synthesis system according to claim 6, wherein the text embedding module comprises:

The word segmentation unit is used to obtain Chinese character data and perform word segmentation processing on the Chinese character data;

The language model unit is used to translate the Chinese character data after word segmentation into tonal Chinese pinyin;

One-hot encoding unit for converting the tonal Chinese pinyin obtained by translation into one-dimensional vector data;

The text vector generating unit is used to convert one-dimensional vector data into two-dimensional vector data according to time series.
The speech synthesis system according to claim 7, wherein the tones include the first, second, third, fourth and soft sounds of Mandarin, using Arabic numerals 1, 2, 3, 4, 5 respectively represent the tone codes of the first, second, third, fourth and soft tone of Mandarin.
The speech synthesis system according to claim 6, wherein the prosody extraction module comprises:

The short-time Fourier transform unit is used to perform short-time Fourier transform on the acquired real person recordings to obtain the corresponding spectrogram;

A mel filtering unit, configured to perform mel filtering on the spectrogram to obtain a mel language spectrogram;

The convolutional neural network unit is used to perform time-series compression and feature representation optimization on the Mel language spectrogram;

The GRU unit is used to perform cyclic neural network processing on the Mel language spectrogram and output according to time sequence;

The prosody vector generating unit is used to obtain the output at each moment and convert all the outputs of the recurrent neural network into a two-dimensional prosody vector.
The speech synthesis system according to claim 6, wherein the Mel language spectrum generation module uses the text vector as a local condition and the prosody vector as a global condition, and passes the sequence to sequence model After the mapping, the Mel language spectrogram is generated.
A terminal device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that the processor implements the following steps when the processor executes the computer program:

Acquiring text data, and generating a text vector according to the text data;

Obtain real-person recordings, and model the prosody of the real-person recordings to generate a prosody vector;

Combining the text vector and the prosody vector to generate a Mel language spectrogram;

The target voice is generated according to the Mel language spectrogram.
The terminal device according to claim 11, wherein obtaining text data and generating a text vector from the text data comprises:

Acquire Chinese character data, and perform word segmentation processing on the Chinese character data;

Translate the Chinese character data after word segmentation into Chinese Pinyin with tones;

Convert the tonal Chinese pinyin obtained by the translation into one-dimensional vector data;

Convert one-dimensional vector data into two-dimensional vector data according to time series.
The terminal device according to claim 11, wherein obtaining a real person recording and modeling the prosody of the real person recording to generate a prosody vector comprises:

Perform short-time Fourier transform on the acquired real person recordings to obtain the corresponding spectrogram;

Performing mel filtering on the spectrogram to obtain a mel language spectrogram;

Compress the sequence of the Mel language spectrogram and optimize the feature representation;

Perform cyclic neural network processing on the Mel language spectrogram, and output according to time sequence;

Obtain the output at each moment, and convert all the outputs of the recurrent neural network into a two-dimensional prosody vector.
The terminal device according to claim 11, wherein generating a Mel language spectrogram in combination with the text vector and the prosodic vector comprises:

The text vector will be used as a local condition, and the prosody vector will be used as a global condition, and after the sequence-to-sequence model is mapped, the Mel language spectrogram is generated.
The terminal device according to claim 12, wherein the tones include the first, second, third, fourth and soft sounds of Mandarin, using Arabic numerals 1, 2, 3, 4, and 5. They represent the tone codes of the first tone, second tone, third tone, fourth tone, and soft tone of Mandarin.
A computer non-volatile readable storage medium, wherein the computer non-volatile readable storage medium includes a computer program, and when the computer program is executed by a processor, the following steps are implemented:

Acquiring text data, and generating a text vector according to the text data;

Obtain real-person recordings, and model the prosody of the real-person recordings to generate a prosody vector;

Combining the text vector and the prosody vector to generate a Mel language spectrogram;

The target voice is generated according to the Mel language spectrogram.
The computer non-volatile readable storage medium according to claim 16, wherein obtaining text data and generating a text vector from the text data comprises:

Acquire Chinese character data, and perform word segmentation processing on the Chinese character data;

Translate the Chinese character data after word segmentation into Chinese Pinyin with tones;

Convert the tonal Chinese pinyin obtained by the translation into one-dimensional vector data;

Convert one-dimensional vector data into two-dimensional vector data according to time series.
The computer non-volatile readable storage medium according to claim 16, wherein acquiring a real person recording and modeling the prosody of the real person recording to generate a prosody vector comprises:

Perform short-time Fourier transform on the acquired real person recordings to obtain the corresponding spectrogram;

Performing mel filtering on the spectrogram to obtain a mel language spectrogram;

Compress the sequence of the Mel language spectrogram and optimize the feature representation;

Perform cyclic neural network processing on the Mel language spectrogram, and output according to time sequence;

Obtain the output at each moment, and convert all the outputs of the recurrent neural network into a two-dimensional prosody vector.
The computer non-volatile readable storage medium according to claim 16, wherein generating a Mel language spectrogram in combination with the text vector and the prosodic vector comprises:

The text vector will be used as a local condition, and the prosody vector will be used as a global condition, and after the sequence-to-sequence model is mapped, the Mel language spectrogram is generated.
The computer non-volatile readable storage medium according to claim 17, wherein the tones include the first, second, third, fourth and soft sounds of Mandarin, using Arabic numerals 1. 2, 3, 4, and 5 respectively represent the tone codes of the first, second, third, fourth, and soft tone of Putonghua.