CN108806665A

CN108806665A - Phoneme synthesizing method and device

Info

Publication number: CN108806665A
Application number: CN201811061208.9A
Authority: CN
Inventors: 顾宇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-09-12
Filing date: 2018-09-12
Publication date: 2018-11-13

Abstract

The embodiment of the present application discloses phoneme synthesizing method and device.One specific implementation mode of the system includes：It obtains for synthesizing the corresponding predictive information of target text for being converted into voice, predictive information includes：The duration information, the spectrum information corresponding to target text predicted, the fundamental frequency information corresponding to target text that predicts corresponding to target text predicted；The corresponding predictive information of the target text got is input to phonetic synthesis model trained in advance, obtains synthesis voice corresponding with target text.This embodiment improves the precision of synthesized voice.

Description

Phoneme synthesizing method and device

Technical field

The invention relates to field of computer technology, and in particular to phoneme synthesizing method and device.

Background technology

With the development of artificial intelligence technology, voice processing technology is widely used.Voice processing technology is usual Including speech recognition technology and speech synthesis technique.Speech recognition technology is typically that intelligent machine is allowed to pass through to speech recognition and understanding Voice signal is changed into the technology of corresponding text or order.Speech synthesis technique usually by computer oneself generate or The technology for spoken output that user can listen to understand, fluent that externally input text information is changed into.

In related voice synthetic technology, feature usually is carried out to the information of the information and the characterization shape of the mouth as one speaks of describing vocal cord vibration and is carried It takes, voice signal is converted text to by simulating human body sounding.

Invention content

The embodiment of the present application proposes phoneme synthesizing method and device.

In a first aspect, the embodiment of the present application provides a kind of phoneme synthesizing method, including：It obtains and is converted into for synthesizing The corresponding predictive information of target text of voice, predictive information include：The duration information, pre- corresponding to target text predicted The spectrum information corresponding to target text measured, the fundamental frequency information corresponding to target text that predicts；The mesh that will be got The corresponding predictive information of mark text is input to phonetic synthesis model trained in advance, obtains synthesis language corresponding with target text Sound.

In some embodiments, training obtains phonetic synthesis model as follows：Obtain training sample set, training Sample includes sample text, sample voice information corresponding with sample text, sample voice information include acoustic feature information and Sample voice waveform, acoustic feature information include fundamental frequency information, spectrum information and duration information, wherein it is right that fundamental frequency information passes through Sample voice carries out fundamental frequency and extracts to obtain, and spectrum information carries out frequency spectrum using spectrum prediction model trained in advance to sample text Prediction obtains；Using the acoustic feature information in the corresponding sample voice information of the sample text in training sample set as voice The input of synthetic model, using sample voice waveform as desired output, train to obtain phonetic synthesis using the method for machine learning Model.

In some embodiments, it is to utilize spectrum prediction model pair trained in advance corresponding to the spectrum information of target text The corresponding voice of target text carries out what spectrum prediction obtained；And training obtains spectrum prediction model as follows：It will Sample text in training sample set as the input of initial spectrum prediction model, from sample voice corresponding with sample text In the spectrum information that extracts as desired output, train to obtain spectrum prediction model using the method for machine learning.

In some embodiments, the corresponding duration information with target text is to utilize duration prediction model pair trained in advance The corresponding voice of target text carries out what duration prediction obtained；And training obtains duration prediction model as follows：It will The input and desired output of sample text and duration information respectively as duration prediction model in training sample set, utilizes machine The method of device study, training obtain duration prediction model.

In some embodiments, the corresponding fundamental frequency information with target text is by pitch prediction model to target text pair The voice answered carries out pitch prediction obtains and pitch prediction model, and training obtains as follows：By training sample set Sample text in conjunction as initial pitch prediction model input, will be extracted from sample voice corresponding with sample text Fundamental frequency information as desired output, using the method for machine learning, training obtains pitch prediction model.

In some embodiments, the type of phonetic synthesis model is WaveRNN.

Second aspect, the embodiment of the present application provide a kind of speech synthetic device, which includes：Acquiring unit, by with Acquisition is set to for synthesizing the corresponding predictive information of target text for being converted into voice, predictive information includes：Pair predicted Should in target text duration information, predict corresponding to target text spectrum information, predict correspond to target text This fundamental frequency information；Synthesis unit is configured to the corresponding predictive information of the target text that will be got and is input to advance training Phonetic synthesis model, obtain synthesis voice corresponding with target text.

In some embodiments, wherein the corresponding fundamental frequency information with target text is by pitch prediction model to target The corresponding voice of text carries out pitch prediction obtains and pitch prediction model, and training obtains as follows：It will train Sample text in sample set as initial pitch prediction model input, will be from sample voice corresponding with sample text The fundamental frequency information extracted is as desired output, and using the method for machine learning, training obtains pitch prediction model.

In some embodiments, the type of phonetic synthesis model is WaveRNN.

The third aspect, the embodiment of the present application provide a kind of electronic equipment, including：One or more processors；Storage dress Set, be stored thereon with one or more programs, when one or more programs are executed by one or more processors so that one or Multiple processors realize the method such as any embodiment in control method.

Fourth aspect, the embodiment of the present application provide a kind of computer readable storage medium, are stored thereon with computer journey Sequence realizes the method such as any embodiment in control method when the program is executed by processor.

Phoneme synthesizing method and device provided by the embodiments of the present application, by obtaining for synthesizing the mesh for being converted into voice The corresponding duration information of text, spectrum information and fundamental frequency information are marked, then by the corresponding predictive information of the target text got It it is input in advance trained phonetic synthesis model, obtains synthesis voice corresponding with target text, synthesize to improve The accuracy of voice.

Description of the drawings

By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon：

Fig. 1 is that one embodiment of the application can be applied to exemplary system architecture figure therein；

Fig. 2 is the flow chart according to one embodiment of the phoneme synthesizing method of the application；

Fig. 3 is the schematic diagram according to an application scenarios of the phoneme synthesizing method of the application；

Fig. 4 is the flow chart according to another embodiment of the phoneme synthesizing method of the application；

Fig. 5 is the structural schematic diagram according to one embodiment of the speech synthetic device of the application；

Fig. 6 is adapted for the structural schematic diagram of the computer system of the server for realizing the embodiment of the present application.

Specific implementation mode

The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, is illustrated only in attached drawing and invent relevant part with related.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

Fig. 1 shows the exemplary system of the embodiment of the phoneme synthesizing method or speech synthetic device that can apply the application System framework 100.

As shown in Figure 1, system architecture 100 may include terminal device 101,102,103, network 104 and server 105. Network 104 between terminal device 101,102,103 and server 105 provide communication link medium.Network 104 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..

Speech ciphering equipment (such as microphone, sound equipment etc.), text input class can be installed on terminal device 101,102,103 Using, phonetic synthesis class application etc..User can be handed over using terminal equipment 101,102,103 by network 104 and server 105 Mutually, to receive or send message etc..

Terminal device 101,102,103 can be hardware, can also be software.When terminal device 101,102,103 is hard Can be the various electronic equipments that there is display screen and support data input when part, including but not limited to smart mobile phone, tablet is electric Brain, pocket computer on knee and desktop computer etc..When terminal device 101,102,103 is software, may be mounted at In above-mentioned cited electronic equipment.Multiple softwares or software module may be implemented into it, can also be implemented as single software or Software module.It is not specifically limited herein.

Server 105 can be to provide the server of various services, such as the mesh to the transmission of terminal device 101,102,103 Mark text is analyzed, and the voice synthesizing server with the matched synthesis voice of target text is generated.Voice synthesizing server can To carry out analyzing processing to the target text got, prediction spectrum information corresponding with target text, prediction fundamental frequency letter are determined Then identified various information are carried out synthesis processing by breath, prediction duration information, to generate conjunction corresponding with target text At voice.

It should be noted that the phoneme synthesizing method that the embodiment of the present application is provided generally is executed by server 105, accordingly Ground, speech synthetic device are generally positioned in server 105.

It should be pointed out that the local of server 105 can also be stored with the text of voice to be synthesized, server 105 can Directly to extract the text of local voice to be synthesized, at this point, exemplary system architecture 100 can not include terminal device 101, 102,103 and network 104.

It should be noted that server 105 can be hardware, can also be software.It, can when server 105 is hardware To be implemented as the distributed server cluster that multiple servers form, individual server can also be implemented as.When server is soft When part, multiple softwares or software module may be implemented into, single software or software module can also be implemented as.It does not do herein specific It limits.

It should be understood that the number of the terminal device, network and server in Fig. 1 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.

With continued reference to Fig. 2, the flow 200 of one embodiment of the phoneme synthesizing method according to the application is shown.Voice Synthetic method includes the following steps：

Step 201, it obtains for synthesizing the corresponding predictive information of target text for being converted into voice.

In the present embodiment, the executive agent (such as server shown in FIG. 1) of phoneme synthesizing method can be by wired The mode of connection type or wireless connection receives the target text for synthesizing voice to be synthesized that terminal is sent from terminal device This.Then, above-mentioned executive agent can be obtained for synthesizing the corresponding predictive information of the target text to be installed for changing into voice.At this In, which may include：Predict corresponding to target text duration information, predict correspond to target text Spectrum information, the fundamental frequency information corresponding to target text that predicts.Herein, the correspondence predicted and target text Duration information be by the way that target text is converted into aligned phoneme sequence after, long letter when corresponding with each phoneme in aligned phoneme sequence Breath.Herein, phoneme is the least unit in voice, is usually analyzed according to the articulation of syllable example, and an action constitutes one A phoneme.Phoneme is generally divided into vowel and consonant two major classes.Due to the pronunciation of each word in text, letter generally include it is more A action, therefore, the texts such as each word, letter include multiple phonemes.To which target text is usually according to tone period Sequencing include continuous multiple phonemes, which constitutes aligned phoneme sequence.

In the present embodiment, the frequency spectrum in voice is commonly used in expression sound spectrum and frequency spectrum, the relationship of loudness.In general, The sounding position of different words is different, therefore the frequency of vocal cord vibration, sounding loudness are embodied by time-domain speech waveform.And frequency spectrum It is embodiment of the voice messaging in frequency domain, is that Fourier is passed through by the continuous speech waveform including multi-frequency, a variety of amplitudes It is obtained after transformation.To which by carrying out spectrum prediction to the corresponding audio of text, each word or word in text can be simulated Sounding.

In the present embodiment, due to the not individual frequency of voice, there are many simple harmonic oscillations of frequency to be formed by stacking. And multiple frequencies of simple harmonic oscillation form the different wave crest of multiple amplitudes by superposition.In multiple wave crest, first wave crest The frequency of as fundamental tone, first wave crest is fundamental frequency.To which the fundamental frequency in voice determines the pitch in voice.

In some optional realization methods of the present embodiment, each sound in aligned phoneme sequence corresponding with target text Element includes pronunciation duration, and since the different pronunciation duration of same phoneme shows the different tone, the meaning of expression is logical It is often also different.Therefore, above-mentioned executive agent can utilize duration prediction model trained in advance to being used to synthesize Chinese idiom to be converted The target text of sound carries out duration prediction, corresponding pre- to obtain each phoneme in aligned phoneme sequence corresponding with target text Survey duration information.Herein, above-mentioned duration prediction model can be used for characterizing the corresponding pass between target text and duration information System.Herein, which can be convolutional neural networks.Trained convolutional neural networks can be to phoneme sequence in advance Row carry out various feature extractions.Herein, convolutional neural networks trained in advance can be according to each phoneme learnt in advance The features such as the feature of position that the feature of the relationship between its adjacent phoneme, each phoneme occur in aligned phoneme sequence are come Determine prediction duration information corresponding with each phoneme in aligned phoneme sequence.

Duration prediction model can be trained and be obtained as follows：By the sample text and duration in training sample set Input and desired output of the information respectively as the duration prediction model, using the method for machine learning, training obtains described Duration prediction model.Specifically,

First, training sample set is obtained.Herein, the training sample in the training sample set may include sample text Originally, sample audio information corresponding with sample text.Herein, which may include sample text and and sample The duration information of each phoneme in the corresponding aligned phoneme sequence of text.

Then, above-mentioned executive agent can choose training sample from training sample set, execute following training step：

First, using the sample text in each training sample of selection as the input of initial convolutional neural networks and with The duration information of each phoneme in the corresponding aligned phoneme sequence of sample text carries out initial convolutional neural networks as desired output Training, obtains the predicted value with the duration information of each phoneme in aligned phoneme sequence.Then, it is based on default loss function, is determined pre- If whether the penalty values of loss function reach predetermined target value.In response to determining that the penalty values of default loss function reach default When desired value, it may be determined that initial neural metwork training is completed, and it is pre- that the initial neural network completed will be trained to be determined as duration Survey model.Herein, default loss function can be used for characterizing the predicted value with the duration information of each phoneme in aligned phoneme sequence Difference between mark duration information.

For above-mentioned executive agent when in response to determining that the penalty values of default loss function are not up to predetermined target value, adjustment is just The parameter of beginning convolutional neural networks, and sample is chosen again from above-mentioned training sample set, by the initial convolution after adjustment Neural network continues to execute above-mentioned training step as initial convolutional neural networks.Herein, initial convolutional neural networks are adjusted Parameter can for example adjust initial convolutional neural networks convolutional layer number, the size of convolution kernel.

In some optional realization methods of the present embodiment, above-mentioned executive agent can utilize frequency spectrum trained in advance pre- It surveys model and spectrum prediction is carried out to the corresponding voice of target text, to obtain the corresponding prediction of the target text to be changed at voice Spectrum information.Herein, it in the prediction spectrum information may include in frequency spectrum formant, spectrum envelope.Herein, the frequency spectrum Prediction model can be used for characterizing target text and predict the correspondence between spectrum information.The spectrum prediction model can be Convolutional neural networks.Trained convolutional neural networks can be corresponded to according to each word, the word in the text learnt in advance in advance Sound mark information and adjacent word between the features such as relationship characteristic information, spectrum prediction is carried out to target text, to To prediction spectrum information.Herein, which can both refer to word, word, vowel, consonant etc. in Chinese, can also refer to The letter etc. of word, composition word in other language (such as English, French etc.).It does not limit herein.

Spectrum prediction model can be trained and be obtained as follows：Using the sample text in training sample set as just The input of beginning spectrum prediction model, the spectrum information extracted from sample voice corresponding with sample text are defeated as it is expected Go out, trains to obtain spectrum prediction model using the method for machine learning.Specifically,

First, training sample set is obtained.Herein, the training sample in the training sample set may include sample text Originally, the spectrum information of sample audio corresponding with sample text.

First, using the sample text in each training sample of selection as the input of initial convolutional neural networks and with The spectrum information of the corresponding sample audio of sample text is trained initial convolutional neural networks, obtains as desired output Spectrum information corresponding with sample text.Then, it is based on default loss function, determines whether the penalty values of default loss function reach To predetermined target value.When in response to determining that the penalty values of default loss function reach predetermined target value, it may be determined that initial god It is completed through network training, and the initial neural network that training is completed is determined as spectrum prediction model.Herein, loss function can For the difference between the spectrum information and the spectrum information of desired output of initial convolutional neural networks reality output.

In some optional realization methods of this implementation, above-mentioned executive agent can utilize pitch prediction trained in advance Model carries out pitch prediction to target text, obtains the prediction fundamental frequency information of target text.Herein, which can For characterization target text and predict the correspondence between fundamental frequency information.The pitch prediction model can be convolutional Neural net Network.

In some optional realization methods of the present embodiment, pitch prediction model can be trained as follows It arrives：Using the sample text in training sample set as the input of initial pitch prediction model, will be from corresponding with sample text The fundamental frequency information extracted in sample voice is as desired output, and using the method for machine learning, training obtains pitch prediction mould Type.Herein, the methods of auto-correlation algorithm, parallel processing method, Cepstrum Method and simplified liftering method can be utilized to be carried from voice Take fundamental frequency.

Step 202, the corresponding predictive information of the target text got is input to phonetic synthesis model trained in advance, Obtain synthesis voice corresponding with target text.

In the present embodiment, the target that above-mentioned executive agent can be using phonetic synthesis model trained in advance to getting The corresponding predictive information of text carries out phonetic synthesis, to obtain synthesis voice corresponding with target text.Herein, voice closes It can be used for characterizing predictive information at model and synthesize the correspondence between voice.

In the present embodiment, above-mentioned phonetic synthesis model can be for example convolutional neural networks, which can To carry out feature extraction by convolution kernel, respectively by each phoneme and prediction frequency spectrum letter in aligned phoneme sequence corresponding with sample text Breath, prediction fundamental frequency information correspond, and it is corresponding with each phoneme a certain to that is to say that the pronunciation duration information based on each phoneme is determined Section prediction spectrum information and a certain section of prediction fundamental frequency information.Finally, based on a certain section corresponding with each phoneme in aligned phoneme sequence Spectrum information and a certain section of prediction fundamental frequency information are predicted, to generate synthesis voice.

In some optional realization methods of the present embodiment, the type of above-mentioned phonetic synthesis model can also be WaveRNN。

It is a schematic diagram according to the application scenarios of the phoneme synthesizing method of the present embodiment with continued reference to Fig. 3, Fig. 3.? In the application scenarios of Fig. 3, user is had input to server 302 by computer 301 " my motherland " text converting Chinese idiom The request of sound.Server 302 upon receiving the request, can obtain the corresponding prediction letter of " my motherland " text information Breath, the predictive information include prediction duration information, prediction spectrum information and prediction fundamental frequency information.Then, server 302 is by gained To prediction spectrum information, prediction fundamental frequency information and predict that duration information is input in advance trained phonetic synthesis model 303, from And synthesis voice corresponding with " my motherland " text is obtained, and exported by sound equipment 304.

With further reference to Fig. 4, it illustrates the embodiments according to a training method of the phonetic synthesis model of the application Flow 400.This flow 400, includes the following steps：

Step 401, training sample set is obtained, the training sample includes sample text, sample corresponding with sample text This voice messaging, the sample voice information include acoustic feature information and speech waveform, and the acoustic feature information includes base Frequency information, spectrum information and duration information.

In the present embodiment, the executive agent (such as server shown in FIG. 1) of phoneme synthesizing method can be by wired The mode of connection type or wireless connection obtains instruction from the storage server for being stored with sample text and sample voice information Practice sample set.Herein, the training sample in training sample set may include sample text, sample corresponding with sample text This voice messaging.The sample voice information may include speech waveform, which can be by acquiring natural person to text The waveform that this sound read aloud is recorded.The sample voice information can also include the acoustic feature information of sample voice.? Here, which may include fundamental frequency information, spectrum information and duration information.

In the present embodiment, fundamental frequency information is typically to extract to obtain by carrying out fundamental frequency to sample voice.The fundamental frequency extracts Method for example may include the methods of auto-correlation algorithm, parallel processing method, Cepstrum Method and simplified liftering method.With sample text pair The aligned phoneme sequence answered is typically to be obtained to sample text cutting using hidden Markov model.The spectrum information is to utilize instruction in advance Experienced spectrum prediction model carries out spectrum prediction to sample text and obtains.

Step 402, by the acoustic feature information in the corresponding sample voice information of sample text in training sample set As the input of phonetic synthesis model, using sample voice waveform as desired output, train to obtain using the method for machine learning Phonetic synthesis model.

In the present embodiment, above-mentioned executive agent can choose training sample from training sample set, execute following instruction Practice step：

First, using the acoustic feature information in the sample voice information in each training sample of selection as initial volume Product neural network input, will sample voice waveform corresponding with sample text as desired output, to initial convolution nerve net Network is trained, and obtains prediction speech waveform corresponding with sample text.Then, it is based on default loss function, determines default damage Whether the penalty values for losing function reach predetermined target value.In response to determining that the penalty values of default loss function reach goal-selling When value, it may be determined that initial neural metwork training is completed, and the initial neural network that training is completed is determined as phonetic synthesis mould Type.Herein, default loss function can be used for characterizing the difference between prediction speech waveform and sample voice waveform.

Figure 4, it is seen that unlike embodiment shown in Fig. 2, the present embodiment is highlighted to phonetic synthesis mould The training step of type.So that synthesized voice is more accurate.

With further reference to Fig. 5, as the realization to method shown in above-mentioned Fig. 4, this application provides a kind of phonetic synthesis dresses The one embodiment set, the device embodiment is corresponding with embodiment of the method shown in Fig. 2, which specifically can be applied to respectively In kind electronic equipment.

As shown in figure 5, the speech synthetic device 500 of the present embodiment includes：Acquiring unit 501 and synthesis unit 502.Its In, acquiring unit 501 is configured to obtain for synthesizing the corresponding predictive information of target text for being converted into voice, predict Information includes：The duration information corresponding to target text predicted, the spectrum information, pre- corresponding to target text that predicts The fundamental frequency information corresponding to target text measured.It is corresponding pre- to be configured to the target text that will be got for synthesis unit 502 Measurement information is input to phonetic synthesis model trained in advance, obtains synthesis voice corresponding with target text.

In the present embodiment, in speech synthetic device 500：The specific processing of acquiring unit 501 and synthesis unit 502 and its The advantageous effect brought can referring to the associated description of the realization method of step 201 and step 202 in Fig. 2 corresponding embodiments, This is repeated no more.

In some optional realization methods of the present embodiment, training obtains phonetic synthesis model as follows：It obtains It includes sample text, sample voice information corresponding with sample text, sample voice information to take training sample set, training sample Including acoustic feature information and sample voice waveform, acoustic feature information includes fundamental frequency information, spectrum information and duration information, In, fundamental frequency information extracts to obtain by carrying out fundamental frequency to sample voice, and spectrum information utilizes spectrum prediction model trained in advance Spectrum prediction is carried out to sample text to obtain；By the sound in the corresponding sample voice information of sample text in training sample set Learn characteristic information as the input of phonetic synthesis model, using sample voice waveform as desired output, utilize the side of machine learning Method trains to obtain phonetic synthesis model.

In some optional realization methods of the present embodiment, the spectrum information corresponding to target text is to utilize instruction in advance Experienced spectrum prediction model carries out what spectrum prediction obtained to the corresponding voice of target text；And spectrum prediction model passes through such as Lower step trains to obtain：Using the sample text in training sample set as the input of initial spectrum prediction model, from sample The spectrum information extracted in the corresponding sample voice of text trains to obtain frequency using the method for machine learning as desired output Compose prediction model.

In some optional realization methods of the present embodiment, the corresponding duration information with target text is to utilize instruction in advance Experienced duration prediction model carries out what duration prediction obtained to the corresponding voice of target text；And duration prediction model passes through such as Lower step trains to obtain：Using in training sample set sample text and duration information as the input of duration prediction model And desired output, using the method for machine learning, training obtains duration prediction model.

In some optional realization methods of the present embodiment, wherein the corresponding fundamental frequency information with target text is to pass through Pitch prediction model carries out pitch prediction obtains and pitch prediction model by walking as follows to the corresponding voice of target text Rapid training obtains：Using the sample text in training sample set as the input of initial pitch prediction model, will from sample text The fundamental frequency information extracted in this corresponding sample voice is as desired output, and using the method for machine learning, training obtains base Frequency prediction model.

Below with reference to Fig. 6, it illustrates suitable for for realizing that the electronic equipment of the embodiment of the present application is (such as shown in FIG. 1 Server) computer system 600 structural schematic diagram.Electronic equipment shown in Fig. 6 is only an example, should not be to this Shen Please embodiment function and use scope bring any restrictions.

As shown in fig. 6, computer system 600 includes central processing unit (CPU) 601, it can be read-only according to being stored in Program in memory (ROM) 602 or be loaded into the program in random access storage device (RAM) 603 from storage section 608 and Execute various actions appropriate and processing.In RAM 603, also it is stored with system 600 and operates required various programs and data. CPU 601, ROM 602 and RAM 603 are connected with each other by bus 604.Input/output (I/O) interface 605 is also connected to always Line 604.

It is connected to I/O interfaces 605 with lower component：Importation 606 including keyboard, mouse etc.；It is penetrated including such as cathode The output par, c 607 of spool (CRT), liquid crystal display (LCD) etc. and loud speaker etc.；Storage section 608 including hard disk etc.； And the communications portion 609 of the network interface card including LAN card, modem etc..Communications portion 609 via such as because The network of spy's net executes communication process.Driver 610 is also according to needing to be connected to I/O interfaces 605.Detachable media 611, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on driver 610, as needed in order to be read from thereon Computer program be mounted into storage section 608 as needed.

Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium On computer program, which includes the program code for method shown in execution flow chart.In such reality It applies in example, which can be downloaded and installed by communications portion 609 from network, and/or from detachable media 611 are mounted.When the computer program is executed by central processing unit (CPU) 601, limited in execution the present processes Above-mentioned function.It should be noted that the computer-readable medium that the application is somebody's turn to do can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two arbitrarily combines.Computer readable storage medium can be for example but not limited to System, device or the device of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or the arbitrary above combination.It is computer-readable The more specific example of storage medium can include but is not limited to：Electrical connection, portable computing with one or more conducting wires Machine disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM Or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device or above-mentioned Any appropriate combination.In this application, computer readable storage medium can be any include or storage program it is tangible Medium, the program can be commanded the either device use or in connection of execution system, device.And in this application, Computer-readable signal media may include in a base band or as the data-signal that a carrier wave part is propagated, wherein carrying Computer-readable program code.Diversified forms may be used in the data-signal of this propagation, and including but not limited to electromagnetism is believed Number, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer-readable storage medium Any computer-readable medium other than matter, the computer-readable medium can be sent, propagated or transmitted for being held by instruction Row system, device either device use or program in connection.The program code for including on computer-readable medium It can transmit with any suitable medium, including but not limited to：Wirelessly, electric wire, optical cable, RF etc. or above-mentioned arbitrary conjunction Suitable combination.

The calculating of the operation for executing the application can be write with one or more programming languages or combinations thereof Machine program code, the programming language include object oriented program language-such as Java, Smalltalk, C++, Further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion Divide and partly executes or executed on a remote computer or server completely on the remote computer on the user computer.? Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including LAN (LAN) or Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as carried using Internet service It is connected by internet for quotient).

Flow chart in attached drawing and block diagram, it is illustrated that according to the system of the various embodiments of the application, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part for a part for one module, program segment, or code of table, the module, program segment, or code includes one or more uses The executable instruction of the logic function as defined in realization.It should also be noted that in some implementations as replacements, being marked in box The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually It can be basically executed in parallel, they can also be executed in the opposite order sometimes, this is depended on the functions involved.Also it to note Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction Combination realize.

Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard The mode of part is realized.Described unit can also be arranged in the processor, for example, can be described as：A kind of processor packet Include acquiring unit and synthesis unit.Wherein, the title of these units does not constitute the limit to the unit itself under certain conditions It is fixed, for example, acquiring unit is also described as " obtaining for synthesizing the corresponding prediction letter of target text for being converted into voice The unit of breath ".

As on the other hand, present invention also provides a kind of computer-readable medium, which can be Included in electronic equipment described in above-described embodiment；Can also be individualism, and without be incorporated the electronic equipment in. Above computer readable medium carries one or more program, when said one or multiple programs are held by the electronic equipment When row so that the electronic equipment：It obtains for synthesizing the corresponding predictive information of target text for being converted into voice, predictive information Including：Predict corresponding to target text duration information, predict corresponding to target text spectrum information, predict The fundamental frequency information corresponding to target text；The corresponding predictive information of the target text got is input to language trained in advance Sound synthetic model obtains synthesis voice corresponding with target text.

Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.People in the art Member should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature Other technical solutions of arbitrary combination and formation.Such as features described above has similar work(with (but not limited to) disclosed herein Can technical characteristic replaced mutually and the technical solution that is formed.

Claims

1. a kind of phoneme synthesizing method, including：

It obtains for synthesizing the corresponding predictive information of target text for being converted into voice, the predictive information includes：It predicts Corresponding to target text duration information, predict corresponding to target text spectrum information, predict correspond to mesh Mark the fundamental frequency information of text；

The corresponding predictive information of the target text got is input to in advance trained phonetic synthesis model, is obtained and institute State the corresponding synthesis voice of target text.

2. according to the method described in claim 1, wherein, training obtains the phonetic synthesis model as follows：

Training sample set is obtained, the training sample includes sample text, sample voice information corresponding with sample text, institute It includes acoustic feature information and sample voice waveform to state sample voice information, and the acoustic feature information includes fundamental frequency information, frequency Spectrum information and duration information, wherein fundamental frequency information extracts to obtain by carrying out fundamental frequency to sample voice, and spectrum information is using in advance Trained spectrum prediction model carries out spectrum prediction to sample text and obtains；

Using the acoustic feature information in the corresponding sample voice information of the sample text in the training sample set as voice The input of synthetic model, using sample voice waveform as desired output, train to obtain the voice using the method for machine learning Synthetic model.

3. according to the method described in claim 2, wherein, the spectrum information corresponding to target text is to utilize training in advance Spectrum prediction model spectrum prediction carried out to the target text corresponding voice obtain；And

Training obtains the spectrum prediction model as follows：

Using the sample text in the training sample set as the input of initial spectrum prediction model, from corresponding with sample text Sample voice in the spectrum information that extracts as desired output, train to obtain the frequency spectrum using the method for machine learning pre- Survey model.

4. according to the method described in claim 2, wherein, the duration information of the correspondence and target text is to utilize training in advance Duration prediction prediction model duration prediction carried out to the target text corresponding voice obtain；And

Training obtains the duration prediction model as follows：

Using in the training sample set sample text and duration information as the input of the duration prediction model and Desired output, using the method for machine learning, training obtains the duration prediction model.

5. according to the method described in claim 2, wherein, the fundamental frequency information of the correspondence and target text is to pass through pitch prediction Model carries out what pitch prediction obtained to the corresponding voice of the target text, and

Training obtains the pitch prediction model as follows：

Using the sample text in the training sample set as the input of initial pitch prediction model, will from sample text pair The fundamental frequency information extracted in the sample voice answered is as desired output, and using the method for machine learning, training obtains the base Frequency prediction model.

6. according to the method described in one of claim 1-5, wherein the type of the phonetic synthesis model is WaveRNN.

7. a kind of speech synthetic device, including：

Acquiring unit is configured to obtain for synthesizing the corresponding predictive information of target text for being converted into voice, described pre- Measurement information includes：Predict corresponding to target text duration information, predict corresponding to target text spectrum information, The fundamental frequency information corresponding to target text predicted；

Synthesis unit is configured to the corresponding predictive information of the target text that will be got and is input to phonetic synthesis trained in advance Model obtains synthesis voice corresponding with the target text.

8. device according to claim 7, wherein training obtains the phonetic synthesis model as follows：

9. device according to claim 8, wherein the spectrum information corresponding to target text is to utilize training in advance Spectrum prediction model spectrum prediction carried out to the target text corresponding voice obtain；And

Training obtains the spectrum prediction model as follows：

10. device according to claim 8, wherein the duration information of the correspondence and target text is to utilize instruction in advance Experienced duration prediction model carries out what duration prediction obtained to the corresponding voice of the target text；And

Training obtains the duration prediction model as follows：

11. device according to claim 8, wherein the fundamental frequency information of the correspondence and target text is pre- by fundamental frequency It surveys model and what pitch prediction obtained is carried out to the corresponding voice of the target text, and

Training obtains the pitch prediction model as follows：

12. according to the device described in one of claim 7-11, wherein the type of the phonetic synthesis model is WaveRNN.

13. a kind of electronic equipment, including：

One or more processors；

Storage device is stored thereon with one or more programs,

When one or more of programs are executed by one or more of processors so that one or more of processors are real The now method as described in any in claim 1-6.

14. a kind of computer-readable medium, is stored thereon with computer program, wherein the program is realized when being executed by processor Method as described in any in claim 1-6.