CN110164413A - Phoneme synthesizing method, device, computer equipment and storage medium - Google Patents

Phoneme synthesizing method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN110164413A
CN110164413A CN201910394665.8A CN201910394665A CN110164413A CN 110164413 A CN110164413 A CN 110164413A CN 201910394665 A CN201910394665 A CN 201910394665A CN 110164413 A CN110164413 A CN 110164413A
Authority
CN
China
Prior art keywords
processed
word
text
sound
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910394665.8A
Other languages
Chinese (zh)
Other versions
CN110164413B (en
Inventor
熊皓
张睿卿
张传强
何中军
李芝
吴华
王海峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910394665.8A priority Critical patent/CN110164413B/en
Publication of CN110164413A publication Critical patent/CN110164413A/en
Application granted granted Critical
Publication of CN110164413B publication Critical patent/CN110164413B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Abstract

The application proposes a kind of phoneme synthesizing method, device, computer equipment and storage medium, wherein, method includes: the text compressing result by once only generating a word to be processed, the sound characteristic of processed word can be examined simultaneously, enable the text compressing result generated very smooth, the feeling of pause and transition in rhythm or melody will not be generated, the text compressing result of word can namely be received, and after merging Ziwen this speech conversion result segment in a sentence, do not influence overall effect, guarantee sound effect while improving speech synthesis efficiency, it solves and in the prior art a sentence is split into multiple text compressing results, it is easy to produce the voice signal of pause and transition in rhythm or melody, connecting is excessively poor, either waiting voice synthesis system, which generates complete speech synthesis result and can just be handed down to relevant device and play out, leads The technical problem for causing time delay larger.

Description

Phoneme synthesizing method, device, computer equipment and storage medium
Technical field
This application involves voice processing technology field more particularly to a kind of phoneme synthesizing method, device, computer equipment and Storage medium.
Background technique
In general, in traditional voice synthesis system, speech synthesis can only be carried out for a whole sentence, can not receive a word or The speech synthesis of one phrase, the final synthesis voice experience for splicing the composite result formation of phrase segment in other words is very poor, holds It is also easy to produce the feeling of pause and transition in rhythm or melody, and the linking between sound bite is very unnatural.Especially in some real-time scenes, such as together Sound is interpreted in scene, is needed the translation result according to speaker, is generated voice signal in real time, if speaker is waited to say one Word or splicing part translation result, all can cause speech synthesis effect less desirable.
In the related technology, the text compressing of several words is once generated as needed as a result, being individually called broadcasting; Either waiting voice synthesis system generates complete text sentence, unified to generate a text compressing as a result, however, one If a sentence splits into multiple text compressings as a result, being easy to produce the voice signal of pause and transition in rhythm or melody, connecting is excessively poor;And It is larger that waiting voice synthesis system generates complete sentence time delay, the complete speech synthesis result to be generated such as needs can just issue It is played out to relevant device.
Summary of the invention
The application is intended to solve at least some of the technical problems in related technologies.
For this purpose, the application proposes a kind of phoneme synthesizing method, device, computer equipment and storage medium, it is existing for solving Have in technology and a sentence is split into multiple text compressings as a result, being easy to produce the voice signal of pause and transition in rhythm or melody, connecting The complete speech synthesis result of excessively poor or waiting voice synthesis system generation can just be handed down to relevant device and play out Lead to the technical problem that time delay is larger, by once only generating the text compressing of a word to be processed as a result, same When can examine the sound characteristic of processed word, enable the text compressing result generated very smooth, pause and transition in rhythm or melody will not be generated Feeling, that is, can receive word text compressing as a result, and merge a sentence in sub- text voice turn After changing result segment, overall effect is not influenced, guarantees sound effect while improving speech synthesis efficiency.
In order to achieve the above object, the application first aspect embodiment proposes a kind of phoneme synthesizing method, comprising:
Text to be processed is obtained, and word cutting processing is carried out to the text to be processed and generates multiple words to be processed;
Coded treatment is carried out to n-th word to be processed and generates n-th semantic space vector;Wherein, N is positive integer;
Obtain N-1 sound characteristic of the processed word before n-th word to be processed;
Processing is decoded according to N-1 sound characteristic of processed word described in the n-th semantic space vector sum Generate target sound feature corresponding with n-th word to be processed;
N-th voice corresponding with n-th word to be processed is generated according to the target sound feature, and according to more A n-th speech synthesis voice messaging corresponding with the text to be processed.
The phoneme synthesizing method of the present embodiment carries out word cutting processing by obtaining text to be processed, and to text to be processed Multiple words to be processed are generated, coded treatment is carried out to n-th word to be processed and generates n-th semantic space vector;Wherein, N For positive integer, N-1 sound characteristic of the processed word before n-th word to be processed is obtained, according to n-th semantic space N-1 sound characteristic of the processed word of vector sum is decoded processing and generates target sound corresponding with n-th word to be processed Sound feature generates n-th voice corresponding with n-th word to be processed according to target sound feature, and according to multiple n-th languages Sound synthesizes corresponding with text to be processed voice messaging, solves and in the prior art a sentence is split into multiple text languages Sound transformation result is easy to produce the voice signal of pause and transition in rhythm or melody, and connecting is excessively poor or waiting voice synthesis system generates completely Speech synthesis result can just be handed down to relevant device and play out the technical problem for causing time delay larger, pass through it is primary only The text compressing of a word to be processed is generated as a result, the sound characteristic of processed word can be examined simultaneously, so that generate Text compressing result can be very smooth, will not generate the feeling of pause and transition in rhythm or melody, that is, can receive the text voice turn of word Change as a result, and merge a sentence in Ziwen this speech conversion result segment after, do not influence overall effect, improve voice Guarantee sound effect while combined coefficient.
In order to achieve the above object, the application second aspect embodiment proposes a kind of speech synthetic device, comprising:
First obtains module, and for obtaining text to be processed, and it is more to carry out word cutting processing generation to the text to be processed A word to be processed;
Coding module generates n-th semantic space vector for carrying out coded treatment to n-th word to be processed;Wherein, N is positive integer;
Second obtains module, for obtaining N-1 sound characteristic of the processed word before n-th word to be processed;
Decoder module, the N-1 sound characteristic for the processed word according to the n-th semantic space vector sum It is decoded processing and generates target sound feature corresponding with n-th word to be processed;
Processing module, for generating n-th corresponding with n-th word to be processed according to the target sound feature Voice, and according to multiple n-th speech syntheses voice messaging corresponding with the text to be processed.
The speech synthetic device of the present embodiment carries out word cutting processing by obtaining text to be processed, and to text to be processed Multiple words to be processed are generated, coded treatment is carried out to n-th word to be processed and generates n-th semantic space vector;Wherein, N For positive integer, N-1 sound characteristic of the processed word before n-th word to be processed is obtained, according to n-th semantic space N-1 sound characteristic of the processed word of vector sum is decoded processing and generates target sound corresponding with n-th word to be processed Sound feature generates n-th voice corresponding with n-th word to be processed according to target sound feature, and according to multiple n-th languages Sound synthesizes corresponding with text to be processed voice messaging, solves and in the prior art a sentence is split into multiple text languages Sound transformation result is easy to produce the voice signal of pause and transition in rhythm or melody, and connecting is excessively poor or waiting voice synthesis system generates completely Speech synthesis result can just be handed down to relevant device and play out the technical problem for causing time delay larger, pass through it is primary only The text compressing of a word to be processed is generated as a result, the sound characteristic of processed word can be examined simultaneously, so that generate Text compressing result can be very smooth, will not generate the feeling of pause and transition in rhythm or melody, that is, can receive the text voice turn of word Change as a result, and merge a sentence in Ziwen this speech conversion result segment after, do not influence overall effect, improve voice Guarantee sound effect while combined coefficient.
In order to achieve the above object, the application third aspect embodiment proposes a kind of computer equipment, comprising: processor and deposit Reservoir;Wherein, the processor is held to run with described by reading the executable program code stored in the memory The corresponding program of line program code, for realizing the phoneme synthesizing method as described in first aspect embodiment.
In order to achieve the above object, the application fourth aspect embodiment proposes a kind of non-transitory computer-readable storage medium Matter is stored thereon with computer program, realizes that the voice as described in first aspect embodiment closes when which is executed by processor At method.
In order to achieve the above object, the 5th aspect embodiment of the application proposes a kind of computer program product, when the calculating When instruction in machine program product is executed by processor, the phoneme synthesizing method as described in first aspect embodiment is realized.
The additional aspect of the application and advantage will be set forth in part in the description, and will partially become from the following description It obtains obviously, or recognized by the practice of the application.
Detailed description of the invention
The application is above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, in which:
Fig. 1 is a kind of flow diagram of phoneme synthesizing method provided by the embodiment of the present application;
Fig. 2 is the flow diagram of another kind phoneme synthesizing method provided by the embodiment of the present application;
Fig. 3 is a kind of exemplary diagram of phoneme synthesizing method provided by the embodiment of the present application;
Fig. 4 is a kind of structural schematic diagram of speech synthetic device provided by the embodiment of the present application;And
Fig. 5 is the structural schematic diagram of computer equipment provided by the embodiment of the present application.
Specific embodiment
Embodiments herein is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, it is intended to for explaining the application, and should not be understood as the limitation to the application.
Below with reference to the accompanying drawings the phoneme synthesizing method, device, computer equipment and storage medium of the embodiment of the present application are described.
Fig. 1 is a kind of flow diagram of phoneme synthesizing method provided by the embodiment of the present application.
As shown in Figure 1, the phoneme synthesizing method may comprise steps of:
Step 101, text to be processed is obtained, and word cutting processing is carried out to text to be processed and generates multiple words to be processed.
In practical applications, it needs to generate voice messaging in real time and play out there are many kinds of scene, in the prior art, one If a sentence splits into multiple text compressings as a result, being easy to produce the voice signal of pause and transition in rhythm or melody, connecting is excessively poor;Or Just being handed down to relevant device etc. complete speech synthesis result to be generated and play out has that time delay is bigger.
Therefore, the application proposes a kind of phoneme synthesizing method, by the text language for once only generating a word to be processed Sound transformation result, while the sound characteristic of processed word can be examined, enable the text compressing result generated very smooth, The feeling of pause and transition in rhythm or melody will not be generated, that is, the text compressing of word can be received as a result, and being merged into a sentence Speech conversion result does not influence overall effect, guarantees sound effect while improving speech synthesis efficiency.
Specifically, text to be processed is obtained, that is, needs to carry out the text of speech synthesis processing, it is to be understood that Text to be processed can be obtained according to factors such as different user different scenes, such as in simultaneous interpretation scene, the text of translation This result can be used as text to be processed;The text that user inputs as desired by relevant device for another example is as text to be processed This.
Further, after obtaining text to be processed, word cutting processing is carried out to text to be processed and generates multiple words to be processed Language can carry out word cutting processing using many middle modes, for example analyze text to be processed, according between part of speech, word Correlation carry out word cutting processing;Text to be processed is carried out according to the features such as user's habit and text to be processed intention for another example Word cutting processing.
Step 102, coded treatment is carried out to n-th word to be processed and generates n-th semantic space vector;Wherein, N is positive Integer.
Step 103, N-1 sound characteristic of the processed word before n-th word to be processed is obtained.
Specifically, in the embodiment of the present application, speech synthesis model can be realized using end-to-end framework, be divided into encoder With two major parts of decoder, wherein encoder is encoded mainly for word to be processed, be mapped to semantic space to Semantic space vector decoding is mainly sound characteristic by amount, decoder.Wherein, sound characteristic can be Meier frequency spectrum, linear pre- Coding etc. is surveyed, in order to guarantee that synthetic video effect preferentially chooses Meier frequency spectrum.
Therefore, when handling n-th word to be processed, coded treatment is carried out to n-th word to be processed first Corresponding n-th semantic space vector is generated, for example coded treatment is carried out to the 1st word to be processed and generates corresponding 1st Semantic space vector;Coded treatment is carried out to the 5th word to be processed for another example and generates corresponding 5th semantic space vector.
Further, N-1 sound characteristic of the processed word before n-th word to be processed is obtained, it is possible to understand that , the 1st word to be processed is before no processed word, that is, the sound characteristic without processed word;Than Such as the 5th word to be processed, that is, had 4 processed words before.Therefore obtaining 4 sound characteristics is the 1st respectively Sound characteristic, the 2nd sound characteristic and the 3rd sound characteristic and falling tone sound feature, when sound characteristic is Meier frequency spectrum namely 4 A Meier frequency spectrum.
It needs to illustrate, the processed word before n-th word to be processed can be obtained using many middle modes N-1 sound characteristic, be illustrated below:
The first example searches for the corresponding N-1 sound characteristic of processed word in the preset database.
Specifically, it can be stored it in corresponding database after having handled word and having generated corresponding sound characteristic, from And the corresponding N-1 sound characteristic of processed word can be directly searched in the preset database.
Second of example obtains N-1 processed words, and in real time respectively to each processed word carry out coding and Decoding generates N-1 sound characteristic.
Specifically, coding and decoding N-1 sound characteristic of generation can also be carried out to each processed word in real time.
Step 104, processing is decoded according to N-1 sound characteristic of the processed word of n-th semantic space vector sum Generate target sound feature corresponding with n-th word to be processed.
Specifically, it is no processed word before the 1st word to be processed such as in above-mentioned example, that is, does not have There is a sound characteristic of processed word, mesh corresponding with the 1st word to be processed is directly generated according to the 1st semantic space vector Mark sound characteristic;For another example in above-mentioned example, according to the 5th the 1st sound characteristic of semantic space vector sum, the 2nd sound characteristic and 3rd sound characteristic and falling tone sound feature generate target sound feature corresponding with the 5th word to be processed.
It is understood that in order to improve the efficiency and accuracy of speech synthesis into one, it can be using a variety of mode roots Processing is decoded according to N-1 sound characteristic of the processed word of n-th semantic space vector sum to generate and n-th word to be processed The corresponding target sound feature of language, is illustrated below:
The first example carries out summation process to N-1 sound characteristic, semantic empty according to summation process result and n-th Between vector be decoded processing and generate corresponding with n-th word to be processed target sound feature.
N-1 sound characteristic is averaging processing in second of example, semantic empty according to average treatment result and n-th Between vector be decoded processing and generate corresponding with n-th word to be processed target sound feature.
Step 105, n-th voice corresponding with n-th word to be processed is generated according to target sound feature, and according to more A n-th speech synthesis voice messaging corresponding with text to be processed.
Specifically, after obtaining target sound feature, can according to target sound feature, using GriffinLim or The modes such as WavenetVocoder synthesize corresponding n-th voice, and according to multiple n-th speech syntheses and text pair to be processed The voice messaging answered.It illustrates as a kind of scene, multiple n-th voices is subjected to splicing according to preset order and generate target phase language Sound, target phase voice is as the corresponding voice messaging of text to be processed.Wherein, preset order is selected according to the actual application Select setting.
The phoneme synthesizing method of the present embodiment carries out word cutting processing by obtaining text to be processed, and to text to be processed Multiple words to be processed are generated, coded treatment is carried out to n-th word to be processed and generates n-th semantic space vector;Wherein, N For positive integer, N-1 sound characteristic of the processed word before n-th word to be processed is obtained, according to n-th semantic space N-1 sound characteristic of the processed word of vector sum is decoded processing and generates target sound corresponding with n-th word to be processed Sound feature generates n-th voice corresponding with n-th word to be processed according to target sound feature, and according to multiple n-th languages Sound synthesizes corresponding with text to be processed voice messaging, solves and in the prior art a sentence is split into multiple text languages Sound transformation result is easy to produce the voice signal of pause and transition in rhythm or melody, and connecting is excessively poor or waiting voice synthesis system generates completely Speech synthesis result can just be handed down to relevant device and play out the technical problem for causing time delay larger, pass through it is primary only The text compressing of a word to be processed is generated as a result, the sound characteristic of processed word can be examined simultaneously, so that generate Text compressing result can be very smooth, will not generate the feeling of pause and transition in rhythm or melody, that is, can receive the text voice turn of word Change as a result, and merge a sentence in Ziwen this speech conversion result segment after, do not influence overall effect, improve voice Guarantee sound effect while combined coefficient.
Fig. 2 is the flow diagram of another kind phoneme synthesizing method provided by the embodiment of the present application.
As shown in Fig. 2, the phoneme synthesizing method may comprise steps of:
Step 201, text to be processed is obtained, and word cutting processing is carried out to text to be processed and generates multiple words to be processed.
Step 202, coded treatment is carried out to n-th word to be processed and generates n-th semantic space vector;Wherein, N is positive Integer.
It should be noted that step 201- step 202 is identical as the step 101- step 102 of above-described embodiment, herein not It is described in detail again, referring specifically to the description of step 101- step 102.
Step 203, the corresponding N-1 sound characteristic of processed word is searched in the preset database.
Specifically, each word to be processed is carried out coding generative semantics space vector after, and according to semantic space to When amount is decoded generation sound characteristic, which be recorded and stored in presetting database.
Thus, it is possible to directly search for the corresponding N-1 sound characteristic of processed word in the preset database, further Improve speech synthesis efficiency.
Step 204, N-1 sound characteristic is averaging processing, according to average treatment result and n-th semantic space to Amount is decoded processing and generates target sound feature corresponding with n-th word to be processed.
Specifically, while by the sound characteristic of 1+2+ ... N-1 word average treatment operation is done, frame number is reduced, as decoder Additional input and n-th semantic space vector decodes together generates the corresponding target sound feature of n-th word to be processed.
Step 205, corresponding with n-th word to be processed n-th voice is generated according to target sound feature, by multiple the N number of voice carries out splicing according to preset order and generates target phase voice, and target phase voice is as the corresponding voice of text to be processed Information.
Specifically, multiple voices of generation are subjected to splicing according to preset order and generate one section of target phase voice, that is, The corresponding voice messaging of text to be processed, can send the relevant device and play out.
In order to which those skilled in the art more understand the above process, citing description is carried out below with reference to Fig. 3.
Specifically, entire text compressing model can be realized using end-to-end framework, be divided into encoder and decoder Two major parts, wherein encoder is encoded mainly for sequence of terms to be synthesized, is mapped to semantic space, decoder Semantic space is mainly decoded as Meier frequency spectrum.
If currently pending is the 1st word, the 1st word is generated according to normative text voice transformation model Meier frequency spectrum, according to the Meier frequency spectrum of the 1st word, using wavenet synthetic video.
If currently pending is the 2nd word, encoder input is two words of 1+2, when decoding, is searched Rope goes out the Meier frequency spectrum of the 1st word, while it is defeated as additional decoder to continue the Meier frequency spectrum that the 1st word is generated Enter, generates the Meier frequency spectrum of second word.
Thus, it is contemplated that n-th word and the 2nd similar operation of word, what it is in encoder input is the N number of word of 1+2+ ... Language first forces to decode 1+2+ ... the Meier frequency spectrum of N-1 word, while by the plum of 1+2+ ... N-1 word when decoding Your frequency spectrum does summation or average operation, reduces frame number, and the input additional as decoder generates the Meier frequency of n-th word Spectrum.
As shown in figure 3, the Meier Mel1 of the 1st word is generated using the content W1 of first word as inputtext, Using the content W2 of the 2nd word as inputtext, at the same using first word generate Meier frequency spectrum Mel1 as The input of DensePre-net using the content W3 of the 3rd word as inputtext, while utilizing first word, second Input of Meier the frequency spectrum Mel1, Mel2 that word generates as Dense Pre-net, can ask Mel1 and Mel2 here And average operation, the length of sequence is reduced with this, the text compressing of subsequent word can be realized using similar process.
The phoneme synthesizing method of the present embodiment carries out word cutting processing by obtaining text to be processed, and to text to be processed Multiple words to be processed are generated, coded treatment is carried out to n-th word to be processed and generates n-th semantic space vector;Wherein, N For positive integer step 203 the corresponding N-1 sound characteristic of processed word is searched in the preset database, to N-1 sound Feature is averaging processing, and is decoded processing generation according to average treatment result and n-th semantic space vector and waits for n-th The corresponding target sound feature of word is handled, n-th language corresponding with n-th word to be processed is generated according to target sound feature Multiple n-th voices are carried out splicing according to preset order and generate target phase voice by sound, and target phase voice is as text to be processed Corresponding voice messaging.Text compressing is generated in real time as a result, can in short split into more thereby, it is possible to dynamic After a word is synthesized, the result of generation does not still influence last user experience.
In order to realize above-described embodiment, the application also proposes a kind of speech synthetic device.
Fig. 4 is a kind of structural schematic diagram of speech synthetic device provided by the embodiment of the present application.
As shown in figure 4, the speech synthetic device may include: the first acquisition module 410, the acquisition of coding module 420, second Module 430, decoder module 440 and processing module 450.Wherein,
Wherein, first module 410 is obtained, carries out word cutting processing life for obtaining text to be processed, and to text to be processed At multiple words to be processed.
Coding module 420 generates n-th semantic space vector for carrying out coded treatment to n-th word to be processed;Its In, N is positive integer.
Second obtains module 430, and the N-1 sound for the processed word before obtaining n-th word to be processed is special Sign.
Decoder module 440, for being carried out according to N-1 sound characteristic of the processed word of n-th semantic space vector sum Decoding process generates target sound feature corresponding with n-th word to be processed.
Processing module 450, for generating n-th voice corresponding with n-th word to be processed according to target sound feature, And according to multiple n-th speech syntheses voice messaging corresponding with text to be processed.
In a kind of possible implementation of the embodiment of the present application, second obtains module 430, is specifically used for: in present count According to searching for the corresponding N-1 sound characteristic of processed word in library;Or N-1 processed words are obtained, and in real time respectively to every One processed word carries out coding and decoding and generates N-1 sound characteristic.
In a kind of possible implementation of the embodiment of the present application, decoder module 440 is specifically used for: special to N-1 sound Sign is averaging processing;Processing generation is decoded according to average treatment result and n-th semantic space vector to wait locating with n-th Manage the corresponding target sound feature of word.
In a kind of possible implementation of the embodiment of the present application, decoder module 440 is specifically also used to: to N-1 sound Feature carries out summation process;Processing generation is decoded according to summation process result and n-th semantic space vector to wait for n-th Handle the corresponding target sound feature of word.
In a kind of possible implementation of the embodiment of the present application, processing module 450 is specifically used for: by multiple n-th languages Sound carries out splicing according to preset order and generates target phase voice;Target phase voice is as the corresponding voice messaging of text to be processed.
It should be noted that the aforementioned voice for being also applied for the embodiment to the explanation of phoneme synthesizing method embodiment Synthesizer, realization principle is similar, and details are not described herein again.
The speech synthetic device of the embodiment of the present application carries out word cutting by obtaining text to be processed, and to text to be processed Processing generates multiple words to be processed, carries out coded treatment to n-th word to be processed and generates n-th semantic space vector;Its In, N is positive integer, N-1 sound characteristic of the processed word before n-th word to be processed is obtained, according to n-th semanteme N-1 sound characteristic of space vector and processed word is decoded processing and generates mesh corresponding with n-th word to be processed Sound characteristic is marked, n-th voice corresponding with n-th word to be processed is generated according to target sound feature, and according to multiple N A speech synthesis voice messaging corresponding with text to be processed solves and in the prior art a sentence is split into multiple texts This speech conversion result is easy to produce the voice signal of pause and transition in rhythm or melody, and connecting is excessively poor or waiting voice synthesis system generates Complete speech synthesis result can just be handed down to relevant device and play out the technical problem for causing time delay larger, pass through one The secondary text compressing for only generating a word to be processed as a result, can examine the sound characteristic of processed word, so that producing simultaneously Raw text compressing result can be very smooth, will not generate the feeling of pause and transition in rhythm or melody, that is, can receive the text language of word Sound transformation result, and after Ziwen this speech conversion result segment in one sentence of merging, overall effect is not influenced, is being improved Guarantee sound effect while speech synthesis efficiency.
By the way that in order to realize above-described embodiment, the application also proposes a kind of computer equipment, comprising: processor and storage Device.Wherein, processor is corresponding with executable program code to run by reading the executable program code stored in memory Program, for realizing phoneme synthesizing method as in the foregoing embodiment.
Fig. 5 is the structural schematic diagram of computer equipment provided by the embodiment of the present application, shows and is suitable for being used to realizing this Apply for the block diagram of the exemplary computer device 90 of embodiment.The computer equipment 90 that Fig. 5 is shown is only an example, no The function and use scope for coping with the embodiment of the present application bring any restrictions.
As shown in figure 5, computer equipment 90 is showed in the form of general purpose computing device.The component of computer equipment 90 can To include but is not limited to: one or more processor or processing unit 906, system storage 910 connect not homologous ray group The bus 908 of part (including system storage 910 and processing unit 906).
Bus 908 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts For example, these architectures include but is not limited to industry standard architecture (Industry Standard Architecture;Hereinafter referred to as: ISA) bus, microchannel architecture (Micro Channel Architecture;Below Referred to as: MAC) bus, enhanced isa bus, Video Electronics Standards Association (Video Electronics Standards Association;Hereinafter referred to as: VESA) local bus and peripheral component interconnection (Peripheral Component Interconnection;Hereinafter referred to as: PCI) bus.
Computer equipment 90 typically comprises a variety of computer system readable media.These media can be it is any can be by The usable medium that computer equipment 90 accesses, including volatile and non-volatile media, moveable and immovable medium.
System storage 910 may include the computer system readable media of form of volatile memory, such as deposit at random Access to memory (Random Access Memory;Hereinafter referred to as: RAM) 911 and/or cache memory 912.Computer is set Standby 90 may further include other removable/nonremovable, volatile/non-volatile computer system storage mediums.Only As an example, storage system 913 can be used for reading and writing immovable, non-volatile magnetic media (Fig. 5 do not show, commonly referred to as " hard disk drive ").Although being not shown in Fig. 5, can provide for reading removable non-volatile magnetic disk (such as " floppy disk ") The disc driver write, and to removable anonvolatile optical disk (such as: compact disc read-only memory (Compact Disc Read Only Memory;Hereinafter referred to as: CD-ROM), digital multi CD-ROM (Digital Video Disc Read Only Memory;Hereinafter referred to as: DVD-ROM) or other optical mediums) read-write CD drive.In these cases, each driving Device can be connected by one or more data media interfaces with bus 908.System storage 910 may include at least one Program product, the program product have one group of (for example, at least one) program module, these program modules are configured to perform this Apply for the function of each embodiment.
Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including --- but It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be Any computer-readable medium other than computer readable storage medium, which can send, propagate or Transmission is for by the use of instruction execution system, device or device or program in connection.
The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.
Can with one or more programming languages or combinations thereof come write for execute the application operation computer Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, It further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion Divide and partially executes or executed on a remote computer or server completely on the remote computer on the user computer.
Program/utility 914 with one group of (at least one) program module 9140, can store and deposit in such as system In reservoir 910, such program module 9140 includes but is not limited to operating system, one or more application program, Qi Tacheng It may include the realization of network environment in sequence module and program data, each of these examples or certain combination.Program Module 9140 usually executes function and/or method in embodiments described herein.
Computer equipment 90 can also be with one or more external equipments 10 (such as keyboard, sensing equipment, display 100 Deng) communication, can also be enabled a user to one or more equipment interact with the terminal device 90 communicate, and/or with make Any equipment (such as network interface card, the modulation /demodulation that the computer equipment 90 can be communicated with one or more of the other calculating equipment Device etc.) communication.This communication can be carried out by input/output (I/O) interface 902.Also, computer equipment 90 can be with Pass through network adapter 900 and one or more network (such as local area network (Local Area Network;Hereinafter referred to as: LAN), wide area network (Wide Area Network;Hereinafter referred to as: WAN) and/or public network, for example, internet) communication.Such as figure Shown in 5, network adapter 900 is communicated by bus 908 with other modules of computer equipment 90.Although should be understood that in Fig. 5 It is not shown, other hardware and/or software module can be used in conjunction with computer equipment 90, including but not limited to: microcode, equipment Driver, redundant processing unit, external disk drive array, RAID system, tape drive and data backup storage system Deng.
Processing unit 906 by the program that is stored in system storage 910 of operation, thereby executing various function application with And the speech synthesis based on vehicle-mounted scene, such as realize the phoneme synthesizing method referred in previous embodiment.
In order to realize above-described embodiment, the application also proposes a kind of non-transitorycomputer readable storage medium, deposits thereon Computer program is contained, when which is executed by processor, realizes phoneme synthesizing method as in the foregoing embodiment.
In order to realize above-described embodiment, the application also proposes a kind of computer program product, when the computer program produces When instruction in product is executed by processor, phoneme synthesizing method as in the foregoing embodiment is realized.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is contained at least one embodiment or example of the application.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It can be combined in any suitable manner in one or more embodiment or examples.In addition, without conflicting with each other, the skill of this field Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples It closes and combines.
In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present application, the meaning of " plurality " is at least two, such as two, three It is a etc., unless otherwise specifically defined.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes It is one or more for realizing custom logic function or process the step of executable instruction code module, segment or portion Point, and the range of the preferred embodiment of the application includes other realization, wherein can not press shown or discussed suitable Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, Lai Zhihang function, this should be by the application Embodiment person of ordinary skill in the field understood.
Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (such as computer based system, including the system of processor or other can be held from instruction The instruction fetch of row system, device or equipment and the system executed instruction) it uses, or combine these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium ", which can be, any may include, stores, communicates, propagates or pass Defeated program is for instruction execution system, device or equipment or the dress used in conjunction with these instruction execution systems, device or equipment It sets.The more specific example (non-exhaustive list) of computer-readable medium include the following: there is the electricity of one or more wirings Interconnecting piece (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory (ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable optic disk is read-only deposits Reservoir (CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other are suitable Medium, because can then be edited, be interpreted or when necessary with it for example by carrying out optical scanner to paper or other media His suitable method is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each section of the application can be realized with hardware, software, firmware or their combination.Above-mentioned In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage Or firmware is realized.Such as, if realized with hardware in another embodiment, following skill well known in the art can be used Any one of art or their combination are realized: have for data-signal is realized the logic gates of logic function from Logic circuit is dissipated, the specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), scene can compile Journey gate array (FPGA) etc..
Those skilled in the art are understood that realize all or part of step that above-described embodiment method carries It suddenly is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer-readable storage medium In matter, which when being executed, includes the steps that one or a combination set of embodiment of the method.
It, can also be in addition, can integrate in a processing module in each functional unit in each embodiment of the application It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as Fruit is realized and when sold or used as an independent product in the form of software function module, also can store in a computer In read/write memory medium.
Storage medium mentioned above can be read-only memory, disk or CD etc..Although having been shown and retouching above Embodiments herein is stated, it is to be understood that above-described embodiment is exemplary, and should not be understood as the limit to the application System, those skilled in the art can be changed above-described embodiment, modify, replace and become within the scope of application Type.

Claims (12)

1. a kind of phoneme synthesizing method, which comprises the following steps:
Text to be processed is obtained, and word cutting processing is carried out to the text to be processed and generates multiple words to be processed;
Coded treatment is carried out to n-th word to be processed and generates n-th semantic space vector;Wherein, N is positive integer;
Obtain N-1 sound characteristic of the processed word before n-th word to be processed;
Processing is decoded according to N-1 sound characteristic of processed word described in the n-th semantic space vector sum to generate Target sound feature corresponding with n-th word to be processed;
N-th voice corresponding with n-th word to be processed is generated according to the target sound feature, and according to multiple institutes State n-th speech synthesis voice messaging corresponding with the text to be processed.
2. the method as described in claim 1, which is characterized in that the processed word obtained before n-th word to be processed N-1 sound characteristic of language, comprising:
The corresponding N-1 sound characteristic of processed word is searched in the preset database;Or
N-1 processed words are obtained, and coding and decoding is carried out to each processed word respectively in real time and generates N-1 sound Sound feature.
3. the method as described in claim 1, which is characterized in that it is described according to the n-th semantic space vector sum N-1 sound characteristic of processing word is decoded processing and generates target sound spy corresponding with n-th word to be processed Sign, comprising:
The N-1 sound characteristic is averaging processing;
It is to be processed with the n-th that processing generation is decoded according to average treatment result and the n-th semantic space vector The corresponding target sound feature of word.
4. the method as described in claim 1, which is characterized in that it is described according to the n-th semantic space vector sum N-1 sound characteristic of processing word is decoded processing and generates target sound spy corresponding with n-th word to be processed Sign, comprising:
Summation process is carried out to the N-1 sound characteristic;
It is to be processed with the n-th that processing generation is decoded according to summation process result and the n-th semantic space vector The corresponding target sound feature of word.
5. the method as described in claim 1, which is characterized in that it is described according to multiple n-th speech syntheses and it is described to Handle the corresponding voice messaging of text, comprising:
Multiple n-th voices are subjected to splicing according to preset order and generate target phase voice;
The target phase voice is as the corresponding voice messaging of the text to be processed.
6. a kind of speech synthetic device characterized by comprising
First obtains module, for obtaining text to be processed, and to the text to be processed carry out word cutting processing generate it is multiple to Handle word;
Coding module generates n-th semantic space vector for carrying out coded treatment to n-th word to be processed;Wherein, N is Positive integer;
Second obtains module, for obtaining N-1 sound characteristic of the processed word before n-th word to be processed;
Decoder module, the N-1 sound characteristic for the processed word according to the n-th semantic space vector sum carry out Decoding process generates target sound feature corresponding with n-th word to be processed;
Processing module, for generating n-th language corresponding with n-th word to be processed according to the target sound feature Sound, and according to multiple n-th speech syntheses voice messaging corresponding with the text to be processed.
7. device as claimed in claim 6, which is characterized in that described second obtains module, is specifically used for:
The corresponding N-1 sound characteristic of processed word is searched in the preset database;Or
N-1 processed words are obtained, and coding and decoding is carried out to each processed word respectively in real time and generates N-1 sound Sound feature.
8. device as claimed in claim 6, which is characterized in that the decoder module is specifically used for:
The N-1 sound characteristic is averaging processing;
It is to be processed with the n-th that processing generation is decoded according to average treatment result and the n-th semantic space vector The corresponding target sound feature of word.
9. device as claimed in claim 6, which is characterized in that the decoder module is specifically also used to:
Summation process is carried out to the N-1 sound characteristic;
It is to be processed with the n-th that processing generation is decoded according to summation process result and the n-th semantic space vector The corresponding target sound feature of word.
10. device as claimed in claim 6, which is characterized in that the processing module is specifically used for:
Multiple n-th voices are subjected to splicing according to preset order and generate target phase voice;
The target phase voice is as the corresponding voice messaging of the text to be processed.
11. a kind of computer equipment, which is characterized in that including processor and memory;
Wherein, the processor is run by reading the executable program code stored in the memory can be performed with described The corresponding program of program code, for realizing phoneme synthesizing method according to any one of claims 1 to 5.
12. a kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, which is characterized in that the program Phoneme synthesizing method according to any one of claims 1 to 5 is realized when being executed by processor.
CN201910394665.8A 2019-05-13 2019-05-13 Speech synthesis method, apparatus, computer device and storage medium Active CN110164413B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910394665.8A CN110164413B (en) 2019-05-13 2019-05-13 Speech synthesis method, apparatus, computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910394665.8A CN110164413B (en) 2019-05-13 2019-05-13 Speech synthesis method, apparatus, computer device and storage medium

Publications (2)

Publication Number Publication Date
CN110164413A true CN110164413A (en) 2019-08-23
CN110164413B CN110164413B (en) 2021-06-04

Family

ID=67634315

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910394665.8A Active CN110164413B (en) 2019-05-13 2019-05-13 Speech synthesis method, apparatus, computer device and storage medium

Country Status (1)

Country Link
CN (1) CN110164413B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110473516A (en) * 2019-09-19 2019-11-19 百度在线网络技术(北京)有限公司 Phoneme synthesizing method, device and electronic equipment
CN111524500A (en) * 2020-04-17 2020-08-11 浙江同花顺智能科技有限公司 Speech synthesis method, apparatus, device and storage medium
CN111951781A (en) * 2020-08-20 2020-11-17 天津大学 Chinese prosody boundary prediction method based on graph-to-sequence
CN113595868A (en) * 2021-06-28 2021-11-02 深圳云之家网络有限公司 Voice message processing method and device based on instant messaging and computer equipment

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104867491A (en) * 2015-06-17 2015-08-26 百度在线网络技术(北京)有限公司 Training method and device for prosody model used for speech synthesis
CN105185372A (en) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN105206257A (en) * 2015-10-14 2015-12-30 科大讯飞股份有限公司 Voice conversion method and device
CN105355193A (en) * 2015-10-30 2016-02-24 百度在线网络技术(北京)有限公司 Speech synthesis method and device
US20160343366A1 (en) * 2015-05-19 2016-11-24 Google Inc. Speech synthesis model selection
US9508338B1 (en) * 2013-11-15 2016-11-29 Amazon Technologies, Inc. Inserting breath sounds into text-to-speech output
CN106297766A (en) * 2015-06-04 2017-01-04 科大讯飞股份有限公司 Phoneme synthesizing method and system
WO2017094500A1 (en) * 2015-12-02 2017-06-08 株式会社電通 Determination device and voice provision system provided therewith
CN107564511A (en) * 2017-09-25 2018-01-09 平安科技(深圳)有限公司 Electronic installation, phoneme synthesizing method and computer-readable recording medium
CN108364632A (en) * 2017-12-22 2018-08-03 东南大学 A kind of Chinese text voice synthetic method having emotion
CN108492818A (en) * 2018-03-22 2018-09-04 百度在线网络技术(北京)有限公司 Conversion method, device and the computer equipment of Text To Speech
CN109036371A (en) * 2018-07-19 2018-12-18 北京光年无限科技有限公司 Audio data generation method and system for speech synthesis
CN109697973A (en) * 2019-01-22 2019-04-30 清华大学深圳研究生院 A kind of method, the method and device of model training of prosody hierarchy mark

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9508338B1 (en) * 2013-11-15 2016-11-29 Amazon Technologies, Inc. Inserting breath sounds into text-to-speech output
US20160343366A1 (en) * 2015-05-19 2016-11-24 Google Inc. Speech synthesis model selection
CN106297766A (en) * 2015-06-04 2017-01-04 科大讯飞股份有限公司 Phoneme synthesizing method and system
CN104867491A (en) * 2015-06-17 2015-08-26 百度在线网络技术(北京)有限公司 Training method and device for prosody model used for speech synthesis
CN105206257A (en) * 2015-10-14 2015-12-30 科大讯飞股份有限公司 Voice conversion method and device
CN105185372A (en) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN105355193A (en) * 2015-10-30 2016-02-24 百度在线网络技术(北京)有限公司 Speech synthesis method and device
WO2017094500A1 (en) * 2015-12-02 2017-06-08 株式会社電通 Determination device and voice provision system provided therewith
CN107564511A (en) * 2017-09-25 2018-01-09 平安科技(深圳)有限公司 Electronic installation, phoneme synthesizing method and computer-readable recording medium
CN108364632A (en) * 2017-12-22 2018-08-03 东南大学 A kind of Chinese text voice synthetic method having emotion
CN108492818A (en) * 2018-03-22 2018-09-04 百度在线网络技术(北京)有限公司 Conversion method, device and the computer equipment of Text To Speech
CN109036371A (en) * 2018-07-19 2018-12-18 北京光年无限科技有限公司 Audio data generation method and system for speech synthesis
CN109697973A (en) * 2019-01-22 2019-04-30 清华大学深圳研究生院 A kind of method, the method and device of model training of prosody hierarchy mark

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
WANG X, LORENZO-TRUEBA J, TAKAKI S, ET AL: "A comparison of recent waveform generation and acoustic modeling methods for neural-network-based speech synthesis", 《2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 *
周志平: "基于深度学习的小尺度单元拼接语音合成方法研究", 《中国优秀硕士学位论文全文数据库》 *
张斌,全昌勤,任福继: "语音合成方法和发展综述", 《小型微型计算机系统》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110473516A (en) * 2019-09-19 2019-11-19 百度在线网络技术(北京)有限公司 Phoneme synthesizing method, device and electronic equipment
CN110473516B (en) * 2019-09-19 2020-11-27 百度在线网络技术(北京)有限公司 Voice synthesis method and device and electronic equipment
US11417314B2 (en) 2019-09-19 2022-08-16 Baidu Online Network Technology (Beijing) Co., Ltd. Speech synthesis method, speech synthesis device, and electronic apparatus
CN111524500A (en) * 2020-04-17 2020-08-11 浙江同花顺智能科技有限公司 Speech synthesis method, apparatus, device and storage medium
CN111524500B (en) * 2020-04-17 2023-03-31 浙江同花顺智能科技有限公司 Speech synthesis method, apparatus, device and storage medium
CN111951781A (en) * 2020-08-20 2020-11-17 天津大学 Chinese prosody boundary prediction method based on graph-to-sequence
CN113595868A (en) * 2021-06-28 2021-11-02 深圳云之家网络有限公司 Voice message processing method and device based on instant messaging and computer equipment

Also Published As

Publication number Publication date
CN110164413B (en) 2021-06-04

Similar Documents

Publication Publication Date Title
CN110164413A (en) Phoneme synthesizing method, device, computer equipment and storage medium
US7831432B2 (en) Audio menus describing media contents of media players
US9761219B2 (en) System and method for distributed text-to-speech synthesis and intelligibility
CN105845125B (en) Phoneme synthesizing method and speech synthetic device
US7483832B2 (en) Method and system for customizing voice translation of text to speech
CN1889170B (en) Method and system for generating synthesized speech based on recorded speech template
CN109389968B (en) Waveform splicing method, device, equipment and storage medium based on double syllable mixing and lapping
JP2002530703A (en) Speech synthesis using concatenation of speech waveforms
CN109147831A (en) A kind of voice connection playback method, terminal device and computer readable storage medium
CN107437413A (en) voice broadcast method and device
CN109599090B (en) Method, device and equipment for voice synthesis
US11908448B2 (en) Parallel tacotron non-autoregressive and controllable TTS
CN108492818A (en) Conversion method, device and the computer equipment of Text To Speech
CN110059313A (en) Translation processing method and device
CN107766325A (en) Text joining method and its device
CN112365878A (en) Speech synthesis method, device, equipment and computer readable storage medium
CN1945692B (en) Intelligent method for improving prompting voice matching effect in voice synthetic system
US7089187B2 (en) Voice synthesizing system, segment generation apparatus for generating segments for voice synthesis, voice synthesizing method and storage medium storing program therefor
KR100710600B1 (en) The method and apparatus that createdplayback auto synchronization of image, text, lip's shape using TTS
CN114911973A (en) Action generation method and device, electronic equipment and storage medium
JP2002062890A (en) Method and device for speech synthesis and recording medium which records voice synthesis processing program
US20030216920A1 (en) Method and apparatus for processing number in a text to speech (TTS) application
CN113870833A (en) Speech synthesis related system, method, device and equipment
JP2000020744A (en) Method for producing contents by moving image and synthesized voice and program recording medium
JP4648183B2 (en) Continuous media data shortening reproduction method, composite media data shortening reproduction method and apparatus, program, and computer-readable recording medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant