CN110534088A

CN110534088A - Phoneme synthesizing method, electronic device and storage medium

Info

Publication number: CN110534088A
Application number: CN201910915659.2A
Authority: CN
Inventors: 李晋; 叶子云; 周成成
Original assignee: China Merchants Finance Technology Co Ltd
Current assignee: China Merchants Finance Technology Co Ltd
Priority date: 2019-09-25
Filing date: 2019-09-25
Publication date: 2019-12-03

Abstract

The present invention relates to voice semantic technology field, a kind of phoneme synthesizing method, electronic device and computer storage medium are provided, this method comprises: obtaining the default words art in response scene, the words art includes fixed text and variable text；The fixed text in the words art is recorded according to the response scene from filtering out tamber characteristic corresponding with the response scene in default sound bank, obtains fixed voice；Speech synthesis is carried out to the variable text in the words art further according to the tamber characteristic filtered out, obtains the variable voice that there is identical tamber characteristic with the fixed voice；Finally, splicing the fixed voice and the variable voice, the synthesis voice with the tamber characteristic is generated.The voice that the present invention has tone color unified according to the synthesis of response scene keeps voice in human-computer interaction naturally coherent, and then promotes the Experience Degree of user.

Description

Phoneme synthesizing method, electronic device and storage medium

Technical field

The present invention relates to voice semantic technology field more particularly to a kind of phoneme synthesizing methods, electronic device and computer Readable storage medium storing program for executing.

Background technique

With the development of artificial intelligence technology, the voice broadcast in human-computer interaction requires have continuity and naturalness, and Voice broadcast is that the progress of art text is re-reading according to recording.

However, existing machine does not consider actual business scenario when recording and talking about art, recorded using a tone not of the same trade or business The various words arts for scene of being engaged in cause the voice tone recorded raw so that art records the business scenario where failing to comply with if having It is hard unnatural, it broadcasts discontinuous；And the sentence in words art recording process is transferred or intonation variation is easy to cause the voice being recorded to Front and back tone color difference, and then influence the experience of user when human-computer interaction.

Therefore, a kind of voice according to the synthesis of response scene with unified tone color, natural interaction how is obtained, ability is become Field technique personnel technical problem urgently to be resolved.

Summary of the invention

In view of the foregoing, the present invention provides a kind of phoneme synthesizing method, electronic device and computer readable storage medium, Main purpose is the voice for having tone color unified according to the synthesis of response scene, keeps voice in human-computer interaction naturally coherent, in turn Promote the Experience Degree of user.

To achieve the above object, the present invention provides a kind of phoneme synthesizing method, is applied to electronic device, this method comprises:

Obtaining step: the default words art in response scene is obtained, the words art includes fixed text and variable text；

First recording step: the sound with the response scene matching is filtered out from default sound bank according to the response scene Color characteristic is recorded the fixed text in the words art according to the tamber characteristic filtered out, is obtained comprising the sound The fixation voice of color characteristic, the tamber characteristic include one in fundamental frequency, word speed, tone and the mark space duration of sound Or several；

Second records step: carrying out voice conjunction to the variable text in the words art according to the tamber characteristic filtered out At, obtain with the fixed voice have identical tamber characteristic variable voice；And

Splice step: splicing the fixed voice and the variable voice, generates the synthesis language with the tamber characteristic Sound.

Preferably, the establishment step of the sound bank includes:

Obtain art sample if various response scenes correspond to；

Receive each section of voice that art sample is recorded to every kind of response scene；And

Extraction obtains corresponding tamber characteristic and establishes sound bank from each section of voice.

Preferably, the second recording step includes:

Parameter setting is carried out to the variable text in the words art, the parameter setting includes adjusting the tamber characteristic Fundamental frequency, word speed, the parameter of tone and/or mark space duration.

Preferably, the splicing step includes:

The voice quality MOS value of the synthesis voice of Self -adaptive；

When institute's Voice Quality MOS value is lower than preset threshold, judge that the voice quality of the synthesis voice is unqualified, it is raw At the underproof log information of voice quality；And

The segment of unqualified voice in the synthesis voice is determined according to institute's underproof log information of Voice Quality Position, and edit amendment.

Preferably, after the splicing step, this method further include:

Broadcast step: the synthesis voice of generation be implanted into automatic answering system, for the automatic answering system according to The operation of the user's input received carries out corresponding voice broadcast.

In addition, to achieve the above object, the present invention also provides a kind of electronic device, which includes memory and place Device is managed, is stored with the interactive voice program that can be run on the processor, the interactive voice program quilt in the memory The processor realizes following steps when executing:

In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer readable storage medium It include interactive voice program in storage medium, it can be achieved that language as described above when the interactive voice program is executed by processor Either step in sound synthetic method.

Phoneme synthesizing method, electronic device and computer readable storage medium proposed by the present invention, by obtaining response field Default words art in scape, the words art includes fixed text and variable text；It is sieved according to the response scene from default sound bank It selects and the fixed text in the words art is recorded with the tamber characteristic of the response scene matching, obtain fixed voice； Speech synthesis is carried out to the variable text in the words art further according to the tamber characteristic filtered out, is obtained and the fixed language Sound has the variable voice of identical tamber characteristic；Finally, splicing the fixed voice and the variable voice, generate described in having The synthesis voice of tamber characteristic.The voice that the present invention has tone color unified according to the synthesis of response scene, makes voice in human-computer interaction Naturally coherent, and then promote the Experience Degree of user.

Detailed description of the invention

Fig. 1 is the schematic diagram of electronic device preferred embodiment of the present invention；

Fig. 2 is the Program modual graph of one preferred embodiment of interactive voice program in Fig. 1；

Fig. 3 is the Program modual graph of the interactive voice program another preferred embodiment in Fig. 1；

Fig. 4 is the flow chart of one preferred embodiment of phoneme synthesizing method of the present invention；

Fig. 5 is the flow chart of phoneme synthesizing method another preferred embodiment of the present invention；

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before making creative work Every other embodiment obtained is put, shall fall within the protection scope of the present invention.

It should be noted that the description for being related to " first ", " second " etc. in the present invention is used for description purposes only, and cannot It is interpreted as its relative importance of indication or suggestion or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the One ", the feature of " second " can explicitly or implicitly include at least one of the features.In addition, the skill between each embodiment Art scheme can be combined with each other, but must be based on can be realized by those of ordinary skill in the art, when technical solution Will be understood that the combination of this technical solution is not present in conjunction with there is conflicting or cannot achieve when, also not the present invention claims Protection scope within.

It is the schematic diagram of electronic device preferred embodiment of the present invention shown in referring to Fig.1.Electronic device 1 is that one kind can be by According to the instruction for being previously set or storing, the automatic equipment for carrying out numerical value calculating and/or information processing.The electronic device 1 can To be computer, be also possible to single network server, the server group that multiple network servers form or based on cloud computing The cloud being made of a large amount of hosts or network server, wherein cloud computing is one kind of distributed computing, by the loose coupling of a group One super virtual computer of the computer set composition of conjunction.

In the present embodiment, electronic device 1 may include, but be not limited only to, and can be in communication with each other connection by system bus Memory 11, processor 12, network interface 13, memory 11 are stored with the interactive voice program that can be run on the processor 12 10.It should be pointed out that Fig. 1 illustrates only the electronic device 1 with component 11-13 it should be appreciated that be not required for Implement all components shown, the implementation that can be substituted is more or less component.

Wherein, memory 11 includes the readable storage medium storing program for executing of memory and at least one type.Inside save as the fortune of electronic device 1 Row provides caching；Readable storage medium storing program for executing can be for if flash memory, hard disk, multimedia card, card-type memory are (for example, SD or DX memory Deng), random access storage device (RAM), static random-access memory (SRAM), read-only memory (ROM), electric erasable can compile Journey read-only memory (EEPROM), programmable read only memory (PROM), magnetic storage, disk, CD etc. it is non-volatile Storage medium.In some embodiments, readable storage medium storing program for executing can be the internal storage unit of electronic device 1, such as the electronics The hard disk of device 1；In further embodiments, the external storage which is also possible to electronic device 1 is set Plug-in type hard disk that is standby, such as being equipped on electronic device 1, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..In the present embodiment, the readable storage medium storing program for executing of memory 11 Commonly used in storing in the operating system and types of applications software that are installed on electronic device 1, such as storage one embodiment of the invention Interactive voice program 10 etc..It has exported or will export each in addition, memory 11 can be also used for temporarily storing Class data.

The processor 12 can be in some embodiments central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 12 is commonly used in the control electricity The overall operation of sub-device 1, such as execute control relevant to other equipment progress data interaction or communication and processing Deng.In the present embodiment, the processor 12 is for running the program code stored in the memory 11 or processing data, example Such as run interactive voice program 10.

The network interface 13 may include radio network interface or wired network interface, which is commonly used in Communication connection is established between the electronic device 1 and other electronic equipments.

The interactive voice program 10 is stored in memory 11, including the computer-readable finger being stored in memory 11 It enables, which can be executed by processor 12, the method to realize each embodiment of the application.

In one embodiment, following steps are realized when above-mentioned interactive voice program 10 is executed by the processor 12:

Obtaining step: the default words art in response scene is obtained, the words art includes fixed text and variable text.

In human-computer interaction, the interaction mechanism of various response scenes, including every kind of setting are drafted in advance according to business demand The corresponding default words art of response scene.

In the present embodiment, the words art includes fixed text and variable text.The fixed text such as welcome words, conclusion Deng fixed words and phrases, common interrogative sentence or declarative sentence can be；The variable text such as name, telephone number, address or the amount of money Deng variation words and phrases.

For example, in the response scene of an insured loan, one section of words art being used interchangeably are as follows:

" you are good, is People's Bank of China here, may I ask what business you need to handle "；

" I will provide a loan "；

" how much you will provide a loan "；

" 1,000,000 ".

First recording step: the sound with the response scene matching is filtered out from default sound bank according to the response scene Color characteristic is recorded the fixed text in the words art according to the tamber characteristic filtered out, is obtained comprising the sound The fixation voice of color characteristic, the tamber characteristic include one in fundamental frequency, word speed, tone and the mark space duration of sound Or several.

The frequency that the fundamental frequency (English is Baseband) refers to fundamental tone, determines that the pitch of entire sound, the fundamental tone are every The minimum pure tone of frequency, maximum intensity in a musical sound.

The word speed refers to the size of vocabulary included in the unit time.For example, the word speed of general Chinese expression is about 240 A syllable is per minute, and the word speed of TV broadcast news broadcast is in 300 word per minute or so.

The height that the tone (English is Pitch) refers to sound frequency, indicates that the tune of a sound is differentiated in the sense of hearing of people The degree of sub- height.

The duration of the mark space refers to the comma in the words art or the time of fullstop pause.

For example, in one embodiment, filtering out tamber characteristic corresponding with the response scene, institute from default sound bank Stating tamber characteristic includes word speed (such as 280 syllables are per minute) and fundamental frequency (standard frequency with affinity), and utilizes and have The fixed text that the professional contact staff of the tamber characteristic talks in art is recorded, and the fixation language comprising the tamber characteristic is obtained Sound records the sound pick-up outfit using profession, to ensure to obtain the sound quality of voice.

It should be noted that according to the tamber characteristic and the response scene art, by professional contact staff into Recording of the row rich in real feelings, the voice that can be obtained and the emotion tone that true man speak are just the same, it is ensured that voice is true It is natural.

Further, the establishment step of the sound bank includes:

Obtain art sample if various response scenes correspond to；

Specifically, recording to art sample if every kind of response scene of acquisition, pass through the profession with tamber characteristic Contact staff's art to different response scenes is recorded to obtain each section of voice, and professional contact staff is to pass through special language Method pronunciation and Language Training simultaneously examine the professional contact staff passed through.The ability that professional contact staff has includes: for example to understand User the common demand in a certain field, it is very familiar to business, understand interactive profession words art, and the sound sweet tea of contact staff Beauty, have affinity, being capable of communication, recording according to business scenario rich in emotion etc..To being extracted in each section of voice after recording Corresponding tamber characteristic will extract obtained multiple tamber characteristics and be created as sound bank.Preferably, using the same professional customer service Personnel or professional contact staff with identical tamber characteristic record, and guarantee the sound that the voice being recorded to is extracted Color characteristic has uniformity.

Second records step: carrying out voice conjunction to the variable text in the words art according to the tamber characteristic filtered out At, obtain with the fixed voice have identical tamber characteristic variable voice.

In the present embodiment, the speech synthesis uses TTS, and (English is Text To Speech) speech synthesis technique, TTS Text is intelligently converted to voice using nerual network technique.To have provided musical note natural and tripping for conversion, and do not have it is cold and detached and The voice of jerky sense.

Existing speech synthesis tool has very much, such as Iflytek, Jie Tonghua sound, cloud know sound etc., but enterprise is according to certainly The speech synthesis tool of body service conditions selection haves the defects that certain in use.Such as existing speech synthesis tool is mentioned The function of confession and be unsatisfactory for the demand of enterprise or the association of function and usage that enterprise is able to use seldom, this just needs enterprise The voice technology for meeting enterprise's self-demand is established using speech synthesis technique, such as is established rich in emotion, met business naturally The voice of demand.

In an alternative embodiment, according to the tamber characteristic filtered out, by the variable text in the words art It inputs tts engine and carries out speech synthesis, obtain variable voice, the variable voice includes that single variable word directly passes through TTS The voice (for example, " Wei Sili " or " 1,000,000 ") of engine speech synthesis, and, there is part institute for being fitted into the variable text State fixed text synthesis voice (for example, synthesis variable voice " may I ask you is Mr. Wei Sili ", wherein " may I ask you It is " be the chimeric part fixed text).The variable voice synthesized using the identical tamber characteristic, and Apply the art in same response scene, it is ensured that the tamber characteristic of the variable voice and the fixed voice that synthesize Uniformity, naturality.

Further, the second recording step includes:

For example, in one embodiment, TTS parameter setting is carried out to the phrase " Mr. Wei Sili " of the variable text, The voice of the former duration distribution synthesis that average word speed is pressed there are five the phrase of word can be somewhat stiff, and system is by " Wei Sili " and " first The word speed of life " is automatically regulated to be the TTS parameter of identical duration, and the variable voice made is more natural.For another example, talk with art " may I ask you is Mr. Wei Sili " in variable text " Mr. Wei Sili " carry out the height-regulating of TTS parameter, will words art " may I ask you Does is it Mr. Wei Sili " in " Mr. Wei Sili " word speed can slightly accelerate so that the variable voice of synthesis is more natural coherent.

In one embodiment, the obtained fixed voice is spliced with the variable voice, connecting method root Spliced according to the response scene where fixed text, variable text and words art in words art.

For example, the art if a response scene are as follows: " you are good, is China Insurance here, and may I ask you is " Wei Sili " " first It is raw " there is anything that can help you ".First to " you are good, is China Insurance here " of fixed text, " may I ask you Be ", " having anything that can help you " it is recorded to obtain fixed sentence；TTS language is carried out to " Wei Sili ", " sir " again Sound synthesizes to obtain variable statement；Finally, obtained fixation sentence and variable statement are spliced by response scene, had The synthesis voice of unified tamber characteristic.

Further, the splicing step includes:

The voice quality MOS value of the synthesis voice of Self -adaptive；

The MOS is mean subjective scoring, and English is Mean Opinion Score.Existing voice quality classification standard In, the voice quality received is evaluated using MOS value, speech quality scores standard is the prior art, and and will not be described here in detail.

In one embodiment, using the voice quality MOS value of the mean subjective scoring assessment synthesis voice, when described When voice quality MOS value is greater than or equal to the first preset threshold (such as 4 points), judge that the voice quality of the synthesis voice is closed Lattice；When institute's Voice Quality MOS value is lower than the first preset threshold (such as 4 points), the voice quality of the synthesis voice is judged It is unqualified, generate the underproof log information of voice quality；It should be noted that 5 grades of the MOS standards of grading point, MOS Score value is that 5 representation language quality are natural and tripping, is linked up indefectible, and MOS score value is that 1 expression voice quality is very poor.According to institute's predicate The underproof log information of sound quality determines the piece fragment position of unqualified voice in the synthesis voice, corrects for editor, institute Stating editor's amendment includes having abnormal sound bite to sound progress gain process, addition blank sound increase pause duration, editing Or filtering noise etc., it obtains revised synthesis voice and saves.

It is the Program modual graph of 10 1 preferred embodiment of interactive voice program in Fig. 1 referring to shown in Fig. 2.

In one embodiment, interactive voice program 10 includes: and obtains module 101, first to record the record of module 102, second Molding block 103, splicing module 104.The functions or operations step that the module 101-104 is realized with following speech synthesis sides Method is similar, and and will not be described here in detail, illustratively, such as wherein:

Module 101 is obtained, for obtaining the default words art in response scene, the words art includes fixed text and variable text This；

First records module 102, for being filtered out and the response scene according to the response scene from default sound bank Matched tamber characteristic is recorded the fixed text in the words art according to the tamber characteristic filtered out, is wrapped Fixation voice containing the tamber characteristic, the tamber characteristic include fundamental frequency, word speed, tone and the mark space duration of sound In one or several；

Second records module 103, for according to the tamber characteristic that filters out to the variable text in the words art into Row speech synthesis obtains the variable voice for having identical tamber characteristic with the fixed voice；And

Splicing module 104, for splicing the fixed voice and the variable voice, generating has the tamber characteristic Synthesize voice.

It is the Program modual graph of 10 another preferred embodiment of interactive voice program in Fig. 1 referring to shown in Fig. 3, in splicing mould After block 104, the interactive voice program 10 further includes broadcasting module 105, illustratively:

Broadcasting module 105, the synthesis voice for that will generate is implanted into automatic answering system, for the automatic-answering back device system System carries out corresponding voice broadcast according to the operation of the user's input received.

It is the flow chart of phoneme synthesizing method preferred embodiment of the present invention referring to shown in Fig. 4.Disclosed herein a kind of languages Sound synthetic method, applied to above-mentioned electronic device, this method comprises:

Step S210: the default words art in response scene is obtained, the words art includes fixed text and variable text.

" I will provide a loan "；

" how much you will provide a loan "；

" 1,000,000 ".

Step S220: it is filtered out and the tone color of response scene matching spy according to the response scene from default sound bank Sign is recorded the fixed text in the words art according to the tamber characteristic filtered out, is obtained special comprising the tone color The fixation voice of sign, the tamber characteristic include one or several in fundamental frequency, word speed, tone and the mark space duration of sound .

Further, the establishment step of the sound bank includes:

Obtain art sample if various response scenes correspond to；

Step S230: carrying out speech synthesis to the variable text in the words art according to the tamber characteristic filtered out, Obtain the variable voice that there is identical tamber characteristic with the fixed voice.

Further, the step S230 includes:

Step S240: splicing the fixed voice and the variable voice, generates the synthesis language with the tamber characteristic Sound.

Further, the step S240 includes:

The voice quality MOS value of the synthesis voice of Self -adaptive；

It referring to Figure 5, is the flow chart of phoneme synthesizing method another preferred embodiment of the present invention, in the step S240 Afterwards, this method further include:

Step S250: being implanted into automatic answering system for the synthesis voice of generation, for the automatic answering system according to The operation of the user's input received carries out corresponding voice broadcast.

In the present embodiment, the automatic answering system (IVR) is the system for capableing of automatic butt user of incoming call.According to default Response scene if art interacted with user, provide corresponding business information for user.The automatic answering system (IVR) It pays a return visit, reserve by phone suitable for financial credit hot line, customer service, investigation.

In an optional implementation, words art is the response scene called in a loan, and the voice of generation is implanted into automatic Answering system, the automatic answering system (IVR) carry out the casting of different words art interactions according to user's speech content.It is following man-machine The voice answer-back of an interactive example:

In addition, including in the computer readable storage medium the present invention also provides a kind of computer readable storage medium Interactive voice program, it can be achieved that following operation when the interactive voice program is executed by processor:

Computer readable storage medium specific embodiment of the present invention and above-mentioned phoneme synthesizing method and each reality of electronic device It is essentially identical to apply example, does not make tired state herein.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, device, article or the method that include a series of elements not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, device, article or method institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do There is also other identical elements in the process, device of element, article or method.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in one as described above In storage medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that terminal device (it can be mobile phone, Computer, server or network equipment etc.) execute method described in each embodiment of the present invention.

The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of phoneme synthesizing method is applied to electronic device, which is characterized in that this method comprises:

First recording step: it is filtered out and the tone color of response scene matching spy according to the response scene from default sound bank Sign is recorded the fixed text in the words art according to the tamber characteristic filtered out, is obtained special comprising the tone color The fixation voice of sign, the tamber characteristic include one or several in fundamental frequency, word speed, tone and the mark space duration of sound ；

Second records step: speech synthesis is carried out to the variable text in the words art according to the tamber characteristic filtered out, Obtain the variable voice that there is identical tamber characteristic with the fixed voice；And

Splice step: splicing the fixed voice and the variable voice, generates the synthesis voice with the tamber characteristic.

2. phoneme synthesizing method as described in claim 1, which is characterized in that the establishment step of the sound bank includes:

Obtain art sample if various response scenes correspond to；

3. phoneme synthesizing method as described in claim 1, which is characterized in that described second, which records step, includes:

Parameter setting is carried out to the variable text in the words art, the parameter setting includes the base for adjusting the tamber characteristic Frequently, the parameter of word speed, tone and/or mark space duration.

4. phoneme synthesizing method as described in claim 1, which is characterized in that the splicing step includes:

The voice quality MOS value of the synthesis voice of Self -adaptive；

When institute's Voice Quality MOS value is lower than preset threshold, judges that the voice quality of the synthesis voice is unqualified, generate language The underproof log information of sound quality；And

The piece fragment position of unqualified voice in the synthesis voice is determined according to institute's underproof log information of Voice Quality, And edit amendment.

5. phoneme synthesizing method according to any one of claims 1-4, which is characterized in that described after the splicing step Method further include:

It broadcasts step: the synthesis voice of generation being implanted into automatic answering system, for the automatic answering system according to reception The operation for the user's input arrived carries out corresponding voice broadcast.

6. a kind of electronic device, which is characterized in that the electronic device includes memory and processor, is stored in the memory The interactive voice program that can be run on the processor is realized as follows when the interactive voice program is executed by the processor Step:

7. electronic device as claimed in claim 6, which is characterized in that the establishment step of the sound bank includes:

Obtain art sample if various response scenes correspond to；

8. electronic device as claimed in claim 6, which is characterized in that described second, which records step, includes:

9. electronic device as claimed in claim 6, which is characterized in that the splicing step includes:

The voice quality MOS value of the synthesis voice of Self -adaptive；

10. a kind of computer readable storage medium, which is characterized in that include interactive voice in the computer readable storage medium Program, it can be achieved that voice as described in any one of claim 1 to 5 when the interactive voice program is executed by processor The step of synthetic method.