CN109448694A

CN109448694A - A kind of method and device of rapid synthesis TTS voice

Info

Publication number: CN109448694A
Application number: CN201811611687.7A
Authority: CN
Inventors: 林婷; 郭志煌
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2018-12-27
Filing date: 2018-12-27
Publication date: 2019-03-08

Abstract

A kind of method that the present invention discloses rapid synthesis TTS voice includes the following steps: to obtain response text information；Convergence strategy is determined according to response text information；TTS voice is generated according to determining convergence strategy.The invention also discloses a kind of devices of rapid synthesis TTS voice.The methods and apparatus disclosed may be implemented to reduce the interactive voice time of intelligent sound equipment and user according to the present invention, to improve the function of interactive voice, and under the lower hardware condition of device configuration, can also be supplied to the perfect interactive voice experience of client.

Description

A kind of method and device of rapid synthesis TTS voice

Technical field

The present invention relates to technical field of voice interaction, especially a kind of method and device of rapid synthesis TTS voice.

Background technique

With the continuous development of interactive voice technology, interactive voice using more and more, interactive voice in the prior art The realization principle of technology is as follows: user, which speaks, issues phonetic order, and equipment identifies phonetic order, carries out to the phonetic order semantic Understand, the text information of response this phonetic order is needed according to semantic output, text information is converted into TTS voice and is played out Come, to realize the interactive voice between intelligent sound equipment and user, can achieve in this way and ask and can answer, that is, realize man-machine stream It is smooth to link up.

But in this interactive voice scene, TTS aggregate velocity is to influence the important step of user experience.Especially existing Have in technology, the hardware configuration that can carry voice technology is irregular, this results in needing voice interactive function that can be adapted to respectively The type of the high configuration of kind or low configuration frequently can lead to the speed of TTS synthesis for the type of low configuration during interactive voice Degree is slower, influences the interactive voice experience of user.

Summary of the invention

To solve the above-mentioned problems, it is contemplated that from TTS synthesis process, TTS conjunction is carried out by convergence strategy At processing, to improve the response speed of voice.

According to the first aspect of the invention, a kind of method of rapid synthesis TTS voice is provided, is included the following steps:

Obtain response text information；

Convergence strategy is determined according to response text information；

TTS voice is generated according to determining convergence strategy.

According to the second aspect of the invention, a kind of device of rapid synthesis TTS voice is provided, comprising:

Response message obtains module, for obtaining response text information；

Tactful determining module, for determining convergence strategy according to response text information；

Voice output module, for generating TTS voice according to determining convergence strategy.

According to the third aspect of the present invention, a kind of electronic equipment is provided comprising: at least one processor, and The memory being connect at least one processor communication, wherein memory is stored with the finger that can be executed by least one processor It enables, instruction is executed by least one processor, so that the step of at least one processor is able to carry out the above method.

According to the fourth aspect of the present invention, a kind of storage medium is provided, computer program is stored thereon with, the program The step of above method is realized when being executed by processor.

Device and method provided by the invention carry out TTS synthesis processing by convergence strategy, and convergence strategy is to be based on Response text information determines, it is thus possible to carry out flexible speech synthesis processing based on response message, may be implemented to reduce The interactive voice time of intelligent sound equipment and user, to improve the function of interactive voice.Also, based on provided by the invention Device and method can also be supplied to the perfect interactive voice experience of client under the lower hardware condition of device configuration.

Detailed description of the invention

Fig. 1 is the method flow diagram of the rapid synthesis TTS voice of an embodiment of the present invention；

Fig. 2 is the device principle block diagram of the rapid synthesis TTS voice of an embodiment of the present invention；

Fig. 3 is the device principle block diagram of the rapid synthesis TTS voice of a further embodiment of this invention；

Fig. 4 is the electronic device block diagram of an embodiment of the present invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.

The present invention can describe in the general context of computer-executable instructions executed by a computer, such as program Module.Generally, program module includes routines performing specific tasks or implementing specific abstract data types, programs, objects, member Part, data structure etc..The present invention can also be practiced in a distributed computing environment, in these distributed computing environments, by Task is executed by the connected remote processing devices of communication network.In a distributed computing environment, program module can be with In the local and remote computer storage media including storage equipment.

In the present invention, the fingers such as " module ", " device ", " system " are applied to the related entities of computer, such as hardware, hardware Combination, software or software in execution with software etc..In detail, for example, element can with but be not limited to run on processing Process, processor, object, executable element, execution thread, program and/or the computer of device.In addition, running on server Application program or shell script, server can be element.One or more elements can be in the process and/or thread of execution In, and element can be localized and/or be distributed between two or multiple stage computers on one computer, and can be by each Kind computer-readable medium operation.Element can also according to the signal with one or more data packets, for example, from one with Another element interacts in local system, distributed system, and/or the network in internet passes through signal and other system interactions The signals of data communicated by locally and/or remotely process.

Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise", not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, method, article or equipment institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence " including ... ", it is not excluded that including described want There is also other identical elements in the process, method, article or equipment of element.

The method of the rapid synthesis TTS voice of the embodiment of the present invention can be applied to any terminal for being configured with phonetic function Equipment, for example, the terminal devices such as smart phone, tablet computer, smart home, the invention is not limited in this regard.So as to make It obtains user and obtains response more promptly and accurately during using these terminal devices, promote user experience.

The invention will now be described in further detail with reference to the accompanying drawings.

Fig. 1 schematically shows a kind of method flow of the rapid synthesis TTS voice of embodiment according to the present invention Figure.As shown in Figure 1, the present embodiment includes the following steps:

Step S101: response text information is obtained.Response text information is the text information for needing response, illustratively, It can be in speech recognition process according to the response content of text of semanteme output.The mode of acquisition can be according to its application scenarios It realizes referring to the prior art, such as when being during interactive voice, can be obtained from database according to speech recognition result Preconfigured response text information is taken, is also possible to provide the calling interface of response text information, directly be connect from calling interface Receive the response text information that input is come in.

Step S102: convergence strategy is determined according to response text information, wherein convergence strategy includes high frequency strategy, local Synthetic strategy and cloud synthetic strategy.

Illustratively, configuring high-frequency sound bank first, includes the high corpus of frequency of use in High frequency speech library and its is right The voice answered, wherein when carrying out the configuration in High frequency speech library, can rule of thumb determine the high corpus of frequency of use and its right The voice answered, such as mobile unit, listen the phonetic order of song more commonly used, then the voice replied " will play Music " can be set to High frequency speech, configure in High frequency speech library, and corpus is configured to " will play music ", audio text Part is the broadcasting audio to the corpus.When being configured, by phonetic storage it is the form of audio file, while generates a language Material and audio file name or the one-to-one map listing of ID.

After getting response text information, response text information is matched with the corpus in High frequency speech library, When with success, illustrates that corresponding audio content stores in High frequency speech library, that is, determine the response text information For the text information that high frequency uses, then convergence strategy is determined as high frequency strategy.

When it fails to match, illustrates that current response text information not yet stores in High frequency speech library, then obtain network-like State is judged, convergence strategy is determined as local synthetic strategy or cloud synthetic strategy according to network state.Specifically, if net Network state be it is not connected, then determine it as local synthetic strategy.If network state is to have connected, cloud conjunction is determined it as At strategy.

Step S103: TTS voice is generated according to determining convergence strategy.After convergence strategy has been determined according to step S102, Its corresponding TTS voice will be generated according to determining convergence strategy, to achieve the effect that quick response.Specifically, Ke Yishi Now it is following several situations:

It is real to generate TTS voice according to determining convergence strategy when convergence strategy is determined as high frequency strategy for the first situation It is existing are as follows: by the corresponding voice of the corpus stored in inquiry High frequency speech library, acquisition is matched with current response text information The corresponding voice of corpus (pass through and obtain the corresponding audio file of matched corpus), the corresponding voice conduct that directly will acquire TTS voice output (plays corresponding audio file), can achieve the effect of transient response in this way.

Second situation generates TTS language according to determining convergence strategy when convergence strategy is determined as local synthetic strategy Sound is realized are as follows: response text information is synthesized TTS voice by local Compositing Engine, wherein local Compositing Engine is existing Technology is converted to TTS voice what local was automated, and processing speed is exceedingly fast.

The third situation generates TTS language according to determining convergence strategy when convergence strategy is determined as cloud synthetic strategy Sound is realized are as follows: response text information exported to cloud Compositing Engine, and obtains the voice messaging of cloud Compositing Engine return, In, cloud Compositing Engine is referred to prior art realization, and in order to guarantee response speed and control flow, cloud Compositing Engine is returned The voice messaging returned is the form of compressed package, thus after the voice messaging returned by cloud, need the voice messaging to return It is decoded according to agreement, the TTS voice of the playable format of decoding forming apparatus.Illustratively, cloud Compositing Engine is by voice Information synthesizes MP3 format, and the TTS voice of PCM format is resolved to the formatted voice information of return.

In the preferred embodiment, after generating TTS voice according to determining convergence strategy, also judge the TTS being currently generated Whether voice is High frequency speech, and judgment mode, which can be, counts the access times of response text information, according to counting Access times to determine whether being high frequency, such as when access times reach 10 times or more, be then judged as High frequency speech, work as determination When for High frequency speech, just by current TTS voice and its corresponding response text information storage to high frequency sound bank, i.e., by response text This information is used as corpus, and the audio file name of TTS voice and corpus are carried out binding storage, and by audio files storage to corresponding Store path under, for subsequent interactive voice use, can constantly expand High frequency speech library in this way, reach efficient process Effect.

The interactive voice time that reduction intelligent sound equipment and user may be implemented according to the method for the present embodiment, to mention The function of high interactive voice, and under the lower hardware condition of device configuration, the perfect voice of client can also be supplied to and handed over Mutually experience.

Fig. 2 schematically shows the device principle block diagram of rapid synthesis TTS voice according to an embodiment of the present invention, As shown in Fig. 2,

The device of rapid synthesis TTS voice includes: that response message obtains module 2, tactful determining module 3, voice output Module 4, High frequency speech library 6, local Compositing Engine 5 and speech processing module 7.

Speech processing module 7 is identified and is parsed for receiving user speech instruction, and according to identification and parsing result It generates response text information to export to response message acquisition module 2, voice recognition mode is referred to prior art realization.

Response message obtains module 2 and is used to receive response text information, can be according to the triggering command of speech processing module 7 Obtain current response text information.

Tactful determining module 3 is used to determine that convergence strategy, convergence strategy include high frequency strategy, sheet according to response text information Ground synthetic strategy and cloud synthetic strategy.It is realized wherein it is determined that the mode of convergence strategy is referred to above-mentioned method part, This is without repeating.

Voice output module 4 is used to generate TTS voice according to determining convergence strategy.

Wherein, High frequency speech library 6 is for storing High frequency speech and its corresponding corpus.

Local Compositing Engine 5 is used to synthesize TTS voice according to the text information of input.

Voice output module 4 includes high frequency synthesis unit 401, local synthesis unit 402 and cloud synthesis unit 403.It is high Frequency synthesis unit 401 is used to obtain corresponding voice as TTS voice output from High frequency speech library 6 according to response text information. Local synthesis unit 402 is for calling local Compositing Engine 5 that response text information is synthesized TTS voice output.Cloud synthesis Unit 403 is exported for that will reply text information to cloud Compositing Engine, and receives the voice messaging of cloud Compositing Engine return, It decodes it as TTS voice output.The TTS speech synthesis mode that three kinds of situations can be completed according to the voice output module 4, from And voice response speed can be provided.

When using the device, user need to only have the audio of pickup function to the microphone of equipment etc. equipped with the device It acquires equipment and exports phonetic order, the phonetic order is carried out according to speech processing module 7 to handle final acquisition response text envelope Breath, which is exported to response message and obtains module 2, and response message obtains module 2 and transmits text information Strategy is determined to tactful determining module 3, and after determining strategy, which is exported to voice output module 4, according to different strategies Corresponding different speech synthesis mode, to obtain final TTS voice output.

The interactive voice time that reduction intelligent sound equipment and user may be implemented according to the device of the present embodiment, to mention The function of high interactive voice, and under the lower hardware condition of device configuration, the perfect voice of client can also be supplied to and handed over Mutually experience.

Fig. 3 schematically shows the principle of device frame of the rapid synthesis TTS voice of another embodiment according to the present invention Figure, as shown in figure 3,

The device further includes High frequency speech adding module 8, which is used to synthesize local synthesis unit 402 and cloud single The TTS voice and its corresponding response text information of 403 output of member are made whether the judgement for High frequency speech, and the mode of judgement can Referring to above-mentioned method part.When being determined as High frequency speech, TTS voice and its corresponding response text information are added to High frequency speech library 6.

It can be summarized by High frequency speech adding module 8 according to each response results according to the device of the present embodiment, It extends in High frequency speech library, it is possible thereby to which duration is that High frequency speech library increases resource, greatly improves treatment process The speed of middle voice response.

It should be noted that in other embodiments, the device of rapid synthesis TTS voice can not also include speech processes Module, but response message is directly obtained into module as external calling interface, with direct by response message acquisition module Response text information is received to be handled.The embodiment of the present invention is not limited the block combiner of device, above are only one kind Specific implementation, in a particular application, those skilled in the art can carry out any group to above-mentioned modular character according to demand It closes, uses purpose accordingly to reach.

In some embodiments, the embodiment of the present invention provides a kind of non-volatile computer readable storage medium storing program for executing, described to deposit Being stored in storage media one or more includes the programs executed instruction, and executing instruction can be (including but unlimited by electronic equipment In computer, server or the network equipment etc.) it reads and executes, for executing any of the above-described rapid synthesis of the present invention The method of TTS voice.

In some embodiments, the embodiment of the present invention also provides a kind of computer program product, computer program product packet The computer program being stored on non-volatile computer readable storage medium storing program for executing is included, computer program includes program instruction, works as institute When program instruction is computer-executed, make the method for computer execution any of the above-described rapid synthesis TTS voice.

In some embodiments, the embodiment of the present invention also provides a kind of electronic equipment comprising: at least one processor, And the memory being connect at least one processor communication, wherein memory, which is stored with, to be executed by least one processor Instruction, instruction executed by least one processor so that at least one processor is able to carry out the side of rapid synthesis TTS voice Method.

In some embodiments, the embodiment of the present invention also provides a kind of storage medium, is stored thereon with computer program, It is characterized in that, the method for rapid synthesis TTS voice when which is executed by processor.

The device of the rapid synthesis TTS voice of the embodiments of the present invention can be used for executing the quick conjunction of the embodiment of the present invention At the method for TTS voice, and the method for reaching the realization rapid synthesis TTS voice of the embodiments of the present invention accordingly is reached Technical effect, which is not described herein again.Hardware processor (hardware processor) can be passed through in the embodiment of the present invention To realize related function module.

Fig. 4 is the hardware knot of the electronic equipment of the method for the execution rapid synthesis TTS voice that one embodiment of the invention provides Structure schematic diagram, as shown in figure 4, the equipment includes:

One or more processors 410 and memory 420, in Fig. 4 by taking a processor 410 as an example.

The equipment for executing the method for rapid synthesis TTS voice can also include: input unit 430 and output device 440.

Processor 410, memory 420, input unit 430 and output device 440 can pass through bus or other modes It connects, in Fig. 4 for being connected by bus.

Memory 420 is used as a kind of non-volatile computer readable storage medium storing program for executing, can be used for storing non-volatile software journey Sequence, non-volatile computer executable program and module, such as the method pair of the rapid synthesis TTS voice in the embodiment of the present application Program instruction/the module answered.Processor 410 by operation be stored in memory 420 non-volatile software program, instruction with And module realizes the quick conjunction of above method embodiment thereby executing the various function application and data processing of server At the method for TTS voice.

Memory 420 may include storing program area and storage data area, wherein storing program area can store operation system Application program required for system, at least one function；Storage data area can be stored to be made according to the device of rapid synthesis TTS voice With the data etc. created.In addition, memory 420 may include high-speed random access memory, it can also include non-volatile Memory, for example, at least a disk memory, flush memory device or other non-volatile solid state memory parts.In some realities It applies in example, optional memory 420 includes the memory remotely located relative to processor 410, these remote memories can lead to Cross the device of network connection to rapid synthesis TTS voice.The example of above-mentioned network includes but is not limited to internet, enterprises Net, local area network, mobile radio communication and combinations thereof.

Input unit 430 can receive the number or character information of input, and generate the device with rapid synthesis TTS voice User setting and the related signal of function control.Output device 440 may include that display screen etc. shows equipment.

Said one or multiple modules are stored in memory 420, are held when by one or more of processors 410 When row, the method that executes the rapid synthesis TTS voice in above-mentioned any means embodiment.

Method provided by the embodiment of the present application can be performed in the said goods, has the corresponding functional module of execution method and has Beneficial effect.The not technical detail of detailed description in the present embodiment, reference can be made to method provided by the embodiment of the present application.

The electronic equipment of the embodiment of the present application exists in a variety of forms, including but not limited to:

(1) mobile communication equipment: the characteristics of this kind of equipment is that have mobile communication function, and to provide speech, data Communication is main target.This Terminal Type includes: smart phone (such as iPhone), multimedia handset, functional mobile phone and low Hold mobile phone etc..

(2) super mobile personal computer equipment: this kind of equipment belongs to the scope of personal computer, there is calculating and processing function Can, generally also have mobile Internet access characteristic.This Terminal Type includes: PDA, MID and UMPC equipment etc., such as iPad.

(3) car-mounted device: this kind of equipment application may be implemented and the companies such as other auxiliary systems of automobile in vehicle carried driving It connects.

(4) server: providing the equipment of the service of calculating, and the composition of server includes that processor, hard disk, memory, system are total Line etc., server is similar with general computer architecture, but due to needing to provide highly reliable service, in processing energy Power, stability, reliability, safety, scalability, manageability etc. are more demanding.

(5) other electronic devices with data interaction function.

The apparatus embodiments described above are merely exemplary, wherein described, unit can as illustrated by the separation member It is physically separated with being or may not be, component shown as a unit may or may not be physics list Member, it can it is in one place, or may be distributed over multiple network units.It can be selected according to the actual needs In some or all of the modules achieve the purpose of the solution of this embodiment.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It is realized by the mode of software plus general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, above-mentioned technology Scheme substantially in other words can be embodied in the form of software products the part that the relevant technologies contribute, the computer Software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions to So that computer equipment (can be personal computer, server or the network equipment etc.) execute each embodiment or Method described in certain parts of embodiment.

Finally, it should be noted that above embodiments are only to illustrate the technical solution of the application, rather than its limitations；Although The application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features； And these are modified or replaceed, each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims

1. the method for rapid synthesis TTS voice, which comprises the steps of:

Obtain response text information；

Convergence strategy is determined according to response text information；

TTS voice is generated according to determining convergence strategy.

2. the method according to claim 1, wherein wherein, the convergence strategy includes high frequency strategy, local conjunction At strategy and cloud synthetic strategy, the method also includes

Configuring high-frequency sound bank, the High frequency speech library include corpus and corresponding voice；

It is described to determine that convergence strategy includes according to response text information

Response text information is matched with corpus, convergence strategy is determined as high frequency strategy in successful match；

It when it fails to match, obtains network state and is judged, convergence strategy is determined as by local synthesis plan according to network state Summary or cloud synthetic strategy.

3. according to the method described in claim 2, wherein, when convergence strategy is determined as high frequency strategy, according to determining fusion Strategy generating TTS voice includes

Voice corresponding with the current matched corpus of response text information is obtained, the corresponding voice that will acquire is as TTS voice Output；

When convergence strategy is determined as local synthetic strategy, generating TTS voice according to determining convergence strategy includes

Response text information is synthesized into TTS voice by local Compositing Engine；

When convergence strategy is determined as cloud synthetic strategy, generating TTS voice according to determining convergence strategy includes

Response text information is exported to cloud Compositing Engine, and obtains the voice messaging of cloud Compositing Engine return；

The voice messaging of return is decoded, TTS voice is generated.

4. according to the method in claim 2 or 3, which is characterized in that according to local synthetic strategy or cloud synthetic strategy After generating TTS voice, further include

Judge whether the TTS voice that is currently generated is High frequency speech, when being determined as High frequency speech, by current TTS voice and its Corresponding response text information storage is to the High frequency speech library.

5. the device of rapid synthesis TTS voice characterized by comprising

Response message obtains module, for obtaining response text information；

6. device according to claim 5, which is characterized in that the convergence strategy includes high frequency strategy, local synthesis plan Slightly and cloud synthetic strategy, described device further include

High frequency speech library, for storing High frequency speech and its corresponding corpus；

Local Compositing Engine, for synthesizing TTS voice according to the text information of input；

Voice output module includes

High frequency synthesis unit, it is defeated as TTS voice for obtaining corresponding voice from High frequency speech library according to response text information Out；

Local synthesis unit, for calling local Compositing Engine that response text information is synthesized TTS voice output；

Cloud synthesis unit is exported for that will reply text information to cloud Compositing Engine, and receives the return of cloud Compositing Engine Voice messaging, decode it as TTS voice output.

7. device according to claim 6, which is characterized in that further include

High frequency speech adding module, for the TTS voice and its corresponding to local synthesis unit and the output of cloud synthesis unit Response text information is judged, when being determined as High frequency speech, TTS voice and its corresponding response text information are added to The High frequency speech library.

8. according to the described in any item devices of claim 5 to 7, which is characterized in that further include

Speech processing module is identified and is parsed for receiving user speech instruction, and generated according to identification and parsing result Response text information exports to the response message and obtains module.

9. electronic equipment comprising: at least one processor, and the storage being connect at least one described processor communication Device, wherein the memory is stored with the instruction that can be executed by least one described processor, and described instruction is by described at least one A processor executes, so that at least one described processor is able to carry out the step of any one of claim 1-4 the method Suddenly.

10. storage medium is stored thereon with computer program, which is characterized in that the program realizes right when being executed by processor It is required that the step of any one of 1-4 the method.