CN109616093A

CN109616093A - End-to-end phoneme synthesizing method, device, equipment and storage medium

Info

Publication number: CN109616093A
Application number: CN201811482781.7A
Authority: CN
Inventors: 房树明; 程宁; 王健宗; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-12-05
Filing date: 2018-12-05
Publication date: 2019-04-12
Anticipated expiration: 2038-12-05
Also published as: CN109616093B

Abstract

The invention discloses a kind of end-to-end phoneme synthesizing methods, belong to speech synthesis technique field.Speech samples and pre-processed this method comprises: obtaining, with obtain include the audio of preset audio length voice training sample, and convert Audio Matrix for the audio in the voice training sample；To samples of text corresponding with the voice training sample carry out Regularization and vectorization processing, with obtain include the text vector of pre-set text length text training sample；Speech synthesis training is carried out using the voice training sample and the corresponding text training sample as the input of the self-consciou power mechanism end-to-end model based on feed-forward neural network, to obtain optimal speech synthesis model.The present invention uses the attention mechanism based on DNN, compared to CNN and RNN training pattern, also accelerates convergence rate while reducing model complexity.

Description

End-to-end phoneme synthesizing method, device, equipment and storage medium

Technical field

The present invention relates to speech synthesis technique field, it is related to a kind of end-to-end phoneme synthesizing method, device, equipment and storage Medium.

Background technique

In speech synthesis field, there has been no the end-to-end technological frames that input is that text output is voice signal.For example, WaveNet (Aaron van den Oord et al., 2016) need to multiple frequency domain parameters such as mel cepstrum coefficients coefficient, Fundamental frequency F0 etc. predicted, while its Text Pretreatment is also relatively complicated (Jonathan Shen et al., 2017).Recently Deep learning frame such as Tacotron (Yuxuan Wang et al., 2017), the Tacotron2 for speech synthesis (Jonathan Shen et al., 2017), Deep Voice3 (Wei Ping et al., 2017) use vocoder (Griffin-Lim, WORLD or WaveNet) is used as speech synthesis post-processing module, so that these models are not real End-to-end study, and the complexity of model is also higher.

Summary of the invention

The technical problem to be solved by the present invention is to answer to overcome the pretreatment of speech synthesis in the prior art and post-process The high problem of miscellaneous degree proposes a kind of end-to-end phoneme synthesizing method, device, equipment and storage medium, by using being based on DNN The attention mechanism of (deep neural network) reduces model complexity, while also accelerating convergence rate.

The present invention is to solve above-mentioned technical problem by following technical proposals:

A kind of end-to-end phoneme synthesizing method, comprising the following steps:

Obtain speech samples simultaneously pre-processed, with obtain include the audio of preset audio length voice training sample This, and Audio Matrix is converted by the audio in the voice training sample；

Regularization and vectorization processing are carried out to samples of text corresponding with the voice training sample, to obtain It include the text training sample of the text vector of pre-set text length；

Using the voice training sample and the corresponding text training sample as based on feed-forward neural network Self-consciou power mechanism end-to-end model input carry out speech synthesis training, to obtain optimal speech synthesis model.

Preferably, the pretreatment includes the sound in speech samples described in the audio and completion in the cutting speech samples Frequently, so that the length of the audio is equal to the preset audio length.

Preferably, it is described pretreatment specifically includes the following steps:

Calculate the difference between the length of the speech samples sound intermediate frequency and the preset audio length；

Judge that data type belonging to the difference, the data type include negative, positive number and zero；

If the difference is negative, at the end of the speech samples, addition is mute, and the mute length is equal to institute State the absolute value of difference；

If the difference is positive number, the audio that the preset audio length thereof is exceeded in the speech samples is cut It removes, the length of the audio of excision is equal to the absolute value of the difference；

If the difference is zero, the speech samples are not handled.

Preferably, the Regularization includes converting Chinese character string for non-chinese character string, after conversion The Chinese character string determines the pronunciation of the non-chinese character string.

Preferably, the vectorization, which is handled, includes:

According to preset text dictionary, it is corresponding in the text dictionary that the word is converted by the word in the samples of text Serial number；

The corresponding serial number of each word is sequentially subjected to vector arrangement by each word in the samples of text, to obtain the text The corresponding text vector of sample；

To comprising the length text vector different from the pre-set text length of text do cutting or completion so that institute The length for stating text vector is equal to the pre-set text length.

Preferably, the cutting for being when the difference between the length and the pre-set text length of the text vector When positive number, the step of cutting includes cutting the text vector for exceeding the pre-set text length thereof in institute's text vector It removes, the length of the text vector of excision is equal to the absolute value of the difference；

The completion is used for when the difference between the length of the text vector and the pre-set text length is negative, The step of completion includes adding text vector zero, the length of the text vector zero of addition at the end of the text vector Degree is equal to the absolute value of the difference.

Preferably, speech synthesis training the following steps are included:

Point of addition mask is right to obtain initial audio matrix in the Audio Matrix for including in the voice training sample The initial audio matrix successively carries out non-LS-SVM sparseness, self-consciou power mechanism is handled, first residual error network technology is handled, The processing of one layer of feed-forward neural net layer and the again processing of residual error network technology are with the Audio Matrix that obtains that treated；

The text vector for including in the text training sample is moved to right into two lattice and point of addition mask to obtain initial text This matrix successively carries out non-LS-SVM sparseness, the processing of self-consciou power mechanism, first residual error network to the original text matrix Technical treatment, the processing of one layer of feed-forward neural net layer and again the processing of residual error network technology with the text square that obtains that treated Battle array；

It will treated the Audio Matrix and described treated that text matrix successively carries out self-consciou power machine together System processing, layer normalization and one layer of feed-forward neural network are to obtain voice prediction signal；

The voice prediction signal and treated the text matrix are calculated into loss function, and done using chain rule Backpropagation calculates, and continuous iteration is until iteration goes out optimal speech synthesis model.

The invention also discloses a kind of end-to-end speech synthetic devices, comprising:

Speech processing module includes preset audio length to obtain for obtaining speech samples and being pre-processed The voice training sample of audio, and Audio Matrix is converted by the audio in the voice training sample；

Text processing module, for samples of text corresponding with the voice training sample carry out Regularization and Vectorization processing, with obtain include the text vector of pre-set text length text training sample；

Training module, for the voice training sample and the corresponding text training sample are used as be based on before to The input of the self-consciou power mechanism end-to-end model of Feedback Neural Network carries out speech synthesis training, to obtain optimal voice Synthetic model.

The invention also discloses a kind of computer equipment, including memory and processor, meter is stored on the memory The step of calculation machine program, the computer program realizes aforementioned end-to-end phoneme synthesizing method when being executed by the processor.

The invention also discloses a kind of computer readable storage medium, meter is stored in the computer readable storage medium Calculation machine program, the computer program can be performed by least one processors, to realize end-to-end speech synthesis side above-mentioned The step of method.

The positive effect of the present invention is that: reduce the complexity of pretreatment and post-processing；Using based on DNN's Attention mechanism reduces model complexity compared to CNN (convolutional neural networks) and RNN (Recognition with Recurrent Neural Network) training pattern While also accelerate convergence rate.

Detailed description of the invention

Fig. 1 shows the flow chart of the end-to-end phoneme synthesizing method embodiment one of the present invention；

Fig. 2 shows in end-to-end phoneme synthesizing method embodiment one of the invention about pretreated flow chart；

Fig. 3 shows the flow chart in the end-to-end phoneme synthesizing method embodiment one of the present invention about vectorization processing；

Fig. 4 shows the flow chart in the end-to-end phoneme synthesizing method embodiment one of the present invention about speech synthesis training；

Fig. 5 shows the structure chart of end-to-end one embodiment of speech synthetic device of the present invention；

Fig. 6 shows the hardware structure schematic diagram of one embodiment of computer equipment of the present invention.

Specific embodiment

The present invention is further illustrated below by the mode of embodiment, but does not therefore limit the present invention to the reality It applies among a range.

Firstly, the present invention proposes a kind of end-to-end phoneme synthesizing method.

In example 1, as shown in Figure 1, the end-to-end phoneme synthesizing method includes the following steps:

Step 01: obtain speech samples simultaneously pre-processed, with obtain include the audio of preset audio length voice Training sample, and Audio Matrix is converted by the audio in the voice training sample.

Some original audios that speech samples are usually collected in advance, these original audios did not do processing, audio Length is different in size, would generally be stored in a corpus after these audio collections.Acquisition speech samples described here are usual Just refer to and obtains speech samples from corpus.

Difference between speech samples and voice training sample is, the audio length disunity in speech samples, and language Audio length in sound training sample is unified, for example is unified for 10 seconds.The pretreatment is exactly to add speech samples processing Work at voice training sample process.

The pretreatment specifically includes the audio cut in speech samples described in audio and completion in the speech samples Two kinds, in order to so that the length of the audio is equal to the preset audio length, pre-process mainly for audio length not Equal to those audios of preset audio length.It is unified for 10 seconds for example, presetting the audio length for including in voice training sample, It so needs to cut off latter 2.5 seconds of such as 12.5 seconds audios of too long speech samples；By too short speech samples such as 8.4 seconds Audio does completion, and concrete operations are behind the audio plus 1.6 seconds mute, by audio completion by 10 seconds.

In one embodiment, as shown in Fig. 2, pretreatment specifically includes the following steps:

Step 11: calculating the difference between the length of the speech samples sound intermediate frequency and the preset audio length.

Step 12: judging that data type belonging to the difference, the data type include negative, positive number and zero.

Step 13: if the difference is negative, adding mute, the mute length at the end of the speech samples Equal to the absolute value of the difference.

Step 14: if the difference is positive number, the preset audio length thereof will be exceeded in the speech samples Audio excision, the length of the audio of excision are equal to the absolute value of the difference.

Step 15: if the difference is zero, the speech samples not being handled.I.e. by speech samples directly as Voice training sample.

The dimension that audio is converted into Audio Matrix is set as needed, and is for 10 seconds audio samples of 8k with sample rate Example, shares 80000 sampled points, if each frame takes 800 points, and non-overlapping between adjacent frame, which can be converted There is the Audio Matrix of 800 sampled points for 100 frames, the dimension of the Audio Matrix is 100 multiplied by 800.

Step 02: samples of text corresponding with the voice training sample is carried out at Regularization and vectorization Reason, with obtain include the text vector of pre-set text length text training sample.

Here Regularization is primarily referred to as, and Chinese character string is converted by non-chinese character string, after conversion The Chinese character string determines the pronunciation of the non-chinese character string.For example, converting 1 for number 123.

Vectorization processing primarily to what is obtained includes the text training sample of the text vector of pre-set text length, Specifically include following steps (as shown in Figure 3):

Step 21: according to preset text dictionary, converting the word in the text word for the word in the samples of text Corresponding serial number in allusion quotation.

Text dictionary refers to the set being made of several Chinese characters, the corresponding serial number of each Chinese character.With samples of text, " you are eaten Meal does not have " for, the serial number 12 of " you " in certain text dictionary, the serial number 66 of " eating ", the serial number 35 of " meal ", " not having " Serial number 973, the serial number 465 of " having ".

Step 22: the corresponding serial number of each word being sequentially subjected to vector arrangement by each word in the samples of text, to obtain The corresponding text vector of the samples of text.

Example is connected, samples of text " you, which have a meal, does not have " can be converted into text vector [12,66,35,973,465].

Step 23: to comprising the length text vector different from the pre-set text length of text cut or mend Entirely, so that the length of the text vector is equal to the pre-set text length.

Specifically, completion refers to when the difference between the length and the pre-set text length of the text vector is negative When, text vector zero is added at the end of the text vector, the length of the text vector zero of addition is equal to the difference Absolute value.

Connect example, it is assumed that the text size of the text vector of training is uniformly set as 10, then needing to samples of text " you, which have a meal, does not have " corresponding text vector [12,66,35,973,465] by adding 0 carry out completion, obtain text vector [12, 66,35,973,465,0,0,0,0,0], text vector can serve as the use of text training sample.

Cutting refers to when the difference between the length of the text vector and the pre-set text length is positive number, by institute Text vector in text vector beyond the pre-set text length thereof is cut off, and the length of the text vector of excision is equal to The absolute value of the difference.

It is similar with upper example, if the length of a text vector is more than 10, as text vector [99,331,55,62, 2355,888,999,535,676,2,22,36,68], text vector is needed to carry out cutting operation, obtains can be used as text The text vector [99,331,55,62,2355,888,999,535,676,2] that this training sample uses.It is to be noted that When deep learning is for batch (batch processing) training production master sample, completion and cutting operation are done, often to guarantee sample This each dimension is in the same size, and the text vector for the part cut directly is given up to fall.

Step 03: using the voice training sample and the corresponding text training sample as based on feed-forward mind The input of self-consciou power mechanism end-to-end model through network carries out speech synthesis training, to obtain optimal speech synthesis mould Type.

Speech synthesis training specifically includes following four step (as shown in Figure 4):

Step 31: point of addition mask is in the Audio Matrix for including in the voice training sample to obtain initial audio Matrix successively carries out non-LS-SVM sparseness, the processing of self-consciou power mechanism, first residual error network skill to the initial audio matrix Art processing, the processing of one layer of feed-forward neural net layer and again the processing of residual error network technology with the audio square that obtains that treated Battle array.

Here the prior art is belonged to using the technology of point of addition mask (positional encoding), herein no longer It repeats.

Position mask refers to the conversion location information of original matrix.Such as matrix X=[[0.1,0.5], [0.4 ,- 0.3], [- 0.9,0.6]], the position Index of three vectors is respectively 1,2,3 in matrix, is then produced at random between -1 to 1 Raw three vectors such as [0, -0.5], [- 0.4, -0.6], [0.3,0.5], new matrix that these three vectors are constituted [[0, -0.5], [- 0.4, -0.6], [0.3,0.5]] be exactly original matrix position mask.

Processing for initial audio is specific as follows:

Initial audio matrix obtains the first Audio Matrix after non-LS-SVM sparseness, which is non-rarefaction Matrix.So-called non-LS-SVM sparseness, which refers to, directly sums position mask information with Audio Matrix, and obtained result is just non-dilute Thinization matrix.

The second Audio Matrix is obtained after carrying out the processing of self-consciou power mechanism to the first Audio Matrix.

Then, the second Audio Matrix is added in the first Audio Matrix using residual error network technology and obtains third audio square Battle array.

Finally, to third Audio Matrix after one layer of feed-forward neural net layer is handled and residual error network technology is handled The 4th Audio Matrix is obtained, the 4th Audio Matrix is exactly the Audio Matrix after aforementioned processing.So-called feed-forward neural network Layer processing, it is assumed that input matrix M, the neural variable matrix of neural network are K, then after handling before to Feedback Neural Network layer It is obtaining the result is that M*K.

Step 32: the text vector for including in the text training sample is moved to right into two lattice and point of addition mask to obtain Original text matrix successively carries out non-LS-SVM sparseness to the original text matrix, self-consciou power mechanism is handled, first residual Poor network technology processing, one layer of feed-forward neural net layer handle and treated to obtain for the processing of residual error network technology again Text matrix.

The processing of text training sample is similar to the processing of voice training sample described in step 31.For original text The processing of matrix is specific as follows:

Original text matrix carries out non-LS-SVM sparseness and obtains the first text matrix, which is non-rarefaction Matrix.

The second text matrix is obtained after carrying out the processing of self-consciou power mechanism to the first text matrix.

Then, then, the second text matrix is added in the first text matrix using residual error network technology and obtains third Text matrix.

Finally, to third text matrix after one layer of feed-forward neural net layer is handled and residual error network technology is handled The 4th text matrix is obtained, the 4th text matrix is exactly the text matrix after aforementioned processing.

Step 33: by treated the Audio Matrix and described treated that text matrix successively carries out self note together Power mechanism of anticipating processing, layer normalization and one layer of feed-forward neural network are to obtain voice prediction signal.

So-called self-consciou power mechanism refers to one selected in multi information of comforming to current task target more key message Kind mechanism.

The treatment process of self-consciou power mechanism specifically: utilize the self-consciou torque battle array of N frame before the of Audio Matrix The self-consciou torque battle array of the matrix and text matrix that obtain after residual error network and layer normalization is by forward-type feedback mind Matrix multiple after the normalization of network, residual error network and layer gains attention power (Attention) matrix, the attention matrix It normalized using residual error network, layer normalization, forward-type neural network, residual error network, layer, obtain Audio Matrix after tiling N+1 frame realizes that the 4th Audio Matrix and the 4th text matrix carry out attention mechanism processing with this autoregressive mode.

The treatment process of layer normalization and one layer of feed-forward neural network specifically: use the other autoregression mode of frame level Realization is predicted one by one, wherein the voice signal for being predicted out is taken as input again and text comes together to predict next frame language Sound signal.

So-called layer normalization, for doing layer normalization to matrix M, it is assumed that the standard deviation of M is Ma, and the mean value of M is Mu, that The M of update after layer normalizes is M=(M-Mu)/Ma.

Step 34: the voice prediction signal and treated the text matrix being calculated into loss function, and utilize chain Formula rule does backpropagation calculating, and continuous iteration is until iteration goes out optimal speech synthesis model.

Loss function is defined as the average value of the absolute value of the difference of prediction signal and original signal.Here specific to utilize The tensorflow1.4.0 frame of python3.6 version optimizes, and (tensorlow is one using data to tensorlow Flow graph is used for the open source software library that numerical value calculates) forward calculation, loss function, iterative steps need to only be defined, chain rule Backpropagation calculate the optimal model part tensorflow of iteration and will be automatically performed.It should be noted that in optimization Face, optimal is not best (referring to operational research related data).

Here, training learning rate is updated using exponential attenuation method dynamic, and gradient decline mode uses AdamOptimizer optimizer is trained.

Secondly, the invention proposes a kind of end-to-end speech synthetic device, described device 20 can be divided into one or The multiple modules of person.

For example, Fig. 5 shows the structure chart of 20 1 embodiment of end-to-end speech synthetic device, and in the embodiment, institute Speech processing module 201, text processing module 202 and training module 203 can be divided by stating device 20.Being described below will have Body introduces the concrete function of the module 201-203.

The speech processing module 201 includes preset audio to obtain for obtaining speech samples and being pre-processed The voice training sample of the audio of length, and Audio Matrix is converted by the audio in the voice training sample.

The text processing module 202 is used to carry out regularization to samples of text corresponding with the voice training sample Processing and vectorization processing, with obtain include the text vector of pre-set text length text training sample.

The training module 203 is used for using the voice training sample and the corresponding text training sample as base Speech synthesis training is carried out in the input of the self-consciou power mechanism end-to-end model of feed-forward neural network, it is optimal to obtain Speech synthesis model.

Again, the present invention also puts forward a kind of computer equipment.

As shown in fig.6, being the hardware structure schematic diagram of one embodiment of computer equipment of the present invention.In the present embodiment, institute Stating computer equipment 2 is that one kind can be automatic to carry out at numerical value calculating and/or information according to the instruction for being previously set or storing The equipment of reason.For example, it may be smart phone, tablet computer, laptop, desktop computer, rack-mount server, blade Formula server, (including the service composed by independent server or multiple servers of tower server or Cabinet-type server Device cluster) etc..As shown, the computer equipment 2 includes at least, but it is not limited to, company can be in communication with each other by system bus Connect memory 21, processor 22 and network interface 23.Wherein:

The memory 21 includes at least a type of computer readable storage medium, and the readable storage medium storing program for executing includes Flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.), random access storage device (RAM), it is static with Machine accesses memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable Read memory (PROM), magnetic storage, disk, CD etc..In some embodiments, the memory 21 can be the meter Calculate the internal storage unit of machine equipment 2, such as the hard disk or memory of the computer equipment 2.In further embodiments, described Memory 21 is also possible to the plug-in type being equipped on the External memory equipment of the computer equipment 2, such as the computer equipment 2 Hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Certainly, the memory 21 can also both including the computer equipment 2 internal storage unit or Including its External memory equipment.In the present embodiment, the memory 21 is installed on the computer equipment 2 commonly used in storage Operating system and types of applications software, such as the computer program etc. for realizing the end-to-end phoneme synthesizing method.In addition, The memory 21 can be also used for temporarily storing the Various types of data that has exported or will export.

The processor 22 can be in some embodiments central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 22 is commonly used in described in control The overall operation of computer equipment 2, such as execute control relevant to the computer equipment 2 progress data interaction or communication System and processing etc..In the present embodiment, the processor 22 is for running the program code stored in the memory 21 or place Data, such as operation are managed for realizing the computer program etc. of the end-to-end phoneme synthesizing method.

The network interface 23 may include radio network interface or wired network interface, which is commonly used in Communication connection is established between the computer equipment 2 and other computer equipments.For example, the network interface 23 is for passing through net The computer equipment 2 is connected by network with exterior terminal, establishes data biography between the computer equipment 2 and exterior terminal Defeated channel and communication connection etc..The network can be intranet (Intranet), internet (Internet), whole world shifting Dynamic communication system (Global System of Mobile communication, GSM), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), 4G network, 5G network, bluetooth (Bluetooth), the nothings such as Wi-Fi Line or cable network.

It should be pointed out that Fig. 6 illustrates only the computer equipment 2 with component 21-23, it should be understood that simultaneously All components shown realistic are not applied, the implementation that can be substituted is more or less component.

In the present embodiment, the computer for realizing the end-to-end phoneme synthesizing method being stored in memory 21 Program can be performed by one or more processors (the present embodiment is processor 22), to complete the operation of following steps:

Step 01: obtain speech samples simultaneously pre-processed, with obtain include the audio of preset audio length voice Training sample, and Audio Matrix is converted by the audio in the voice training sample；

Step 02: samples of text corresponding with the voice training sample is carried out at Regularization and vectorization Reason, with obtain include the text vector of pre-set text length text training sample；

In addition, a kind of computer readable storage medium of the present invention, the computer readable storage medium be it is non-volatile can Storage medium is read, computer program is stored with, the computer program can be performed by least one processor, to realize The operation of above-mentioned end-to-end phoneme synthesizing method or device.

Wherein, computer readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX Memory etc.), random access storage device (RAM), static random-access memory (SRAM), read-only memory (ROM), electricity can Erasable programmable read-only memory (EPROM) (EEPROM), programmable read only memory (PROM), magnetic storage, disk, CD etc..? In some embodiments, computer readable storage medium can be the internal storage unit of computer equipment, such as the computer is set Standby hard disk or memory.In further embodiments, it deposits the outside that computer readable storage medium is also possible to computer equipment The plug-in type hard disk being equipped in storage equipment, such as the computer equipment, intelligent memory card (Smart Media Card, SMC), peace Digital (Secure Digital, SD) card, flash card (Flash Card) etc..Certainly, computer readable storage medium may be used also With the internal storage unit both including computer equipment or including its External memory equipment.It is computer-readable to deposit in the present embodiment Storage media is installed on the operating system and types of applications software of computer equipment commonly used in storage, such as is previously used for realizing institute State the computer program etc. of end-to-end phoneme synthesizing method.In addition, computer readable storage medium can be also used for temporarily depositing Store up the Various types of data that has exported or will export.

Although specific embodiments of the present invention have been described above, it will be appreciated by those of skill in the art that this is only For example, protection scope of the present invention is to be defined by the appended claims.Those skilled in the art without departing substantially from Under the premise of the principle and substance of the present invention, many changes and modifications may be made, but these change and Modification each falls within protection scope of the present invention.

Claims

1. a kind of end-to-end phoneme synthesizing method, which comprises the following steps:

Obtain speech samples simultaneously pre-processed, with obtain include the audio of preset audio length voice training sample, and Audio Matrix is converted by the audio in the voice training sample；

Regularization and vectorization processing are carried out to samples of text corresponding with the voice training sample, to be included There is the text training sample of the text vector of pre-set text length；

Using the voice training sample and the corresponding text training sample as based on feed-forward neural network from The input of my attention mechanism end-to-end model carries out speech synthesis training, to obtain optimal speech synthesis model.

2. end-to-end phoneme synthesizing method according to claim 1, which is characterized in that the pretreatment includes described in cutting Audio in speech samples described in audio and completion in speech samples, so that the length of the audio is equal to the preset audio Length.

3. end-to-end phoneme synthesizing method according to claim 2, which is characterized in that the pretreatment specifically includes following Step:

If the difference is negative, at the end of the speech samples, addition is mute, and the mute length is equal to the difference The absolute value of value；

If the difference is positive number, the audio that the preset audio length thereof is exceeded in the speech samples is cut off, is cut The length for the audio removed is equal to the absolute value of the difference；

If the difference is zero, the speech samples are not handled.

4. end-to-end phoneme synthesizing method according to claim 1, which is characterized in that the Regularization includes will be non- Chinese character string is converted into Chinese character string, and the reading of the non-chinese character string is determined according to the Chinese character string after conversion Sound.

5. end-to-end phoneme synthesizing method according to claim 4, which is characterized in that the vectorization, which is handled, includes:

According to preset text dictionary, the word corresponding sequence in the text dictionary is converted by the word in the samples of text Number；

The corresponding serial number of each word is sequentially subjected to vector arrangement by each word in the samples of text, to obtain the samples of text Corresponding text vector；

To comprising the length text vector different from the pre-set text length of text do cutting or completion so that the text The length of this vector is equal to the pre-set text length.

6. end-to-end phoneme synthesizing method according to claim 5, which is characterized in that the cutting is for working as the text When difference between the length of vector and the pre-set text length is positive number, the step of cutting includes by institute's text vector In the text vector beyond the pre-set text length thereof cut off, the length of the text vector of excision is equal to the difference Absolute value；

The completion is used for when the difference between the length of the text vector and the pre-set text length is negative, described The step of completion includes adding text vector zero, the length etc. of the text vector zero of addition at the end of the text vector In the absolute value of the difference.

7. end-to-end phoneme synthesizing method according to claim 1, which is characterized in that the speech synthesis training include with Lower step:

Point of addition mask is in the Audio Matrix for including in the voice training sample to obtain initial audio matrix, to described Initial audio matrix successively carries out non-LS-SVM sparseness, the processing of self-consciou power mechanism, the processing of first residual error network technology, one layer The processing of feed-forward neural net layer and the again processing of residual error network technology are with the Audio Matrix that obtains that treated；

The text vector for including in the text training sample is moved to right into two lattice and point of addition mask to obtain original text square Battle array successively carries out non-LS-SVM sparseness, the processing of self-consciou power mechanism, first residual error network technology to the original text matrix Processing, the processing of one layer of feed-forward neural net layer and again the processing of residual error network technology with the text matrix that obtains that treated；

It will treated the Audio Matrix and described treated that text matrix successively carries out at self-consciou power mechanism together Reason, layer normalization and one layer of feed-forward neural network are to obtain voice prediction signal；

The voice prediction signal and treated the text matrix are calculated into loss function, and done reversely using chain rule It propagates and calculates, continuous iteration is until iteration goes out optimal speech synthesis model.

8. a kind of end-to-end speech synthetic device characterized by comprising

Speech processing module, for obtaining speech samples and being pre-processed, with obtain include preset audio length audio Voice training sample, and convert Audio Matrix for the audio in the voice training sample；

Text processing module, for carrying out Regularization and vector to samples of text corresponding with the voice training sample Change processing, with obtain include the text vector of pre-set text length text training sample；

Training module, for using the voice training sample and the corresponding text training sample as based on feed-forward The input of the self-consciou power mechanism end-to-end model of neural network carries out speech synthesis training, to obtain optimal speech synthesis Model.

9. a kind of computer equipment, including memory and processor, which is characterized in that be stored with computer journey on the memory Sequence realizes such as end-to-end voice of any of claims 1-7 when the computer program is executed by the processor The step of synthetic method.

10. a kind of computer readable storage medium, which is characterized in that be stored with computer in the computer readable storage medium Program, the computer program can be performed by least one processors, to realize as of any of claims 1-7 The step of end-to-end phoneme synthesizing method.