CN103474075B

CN103474075B - Voice signal sending method and system, method of reseptance and system

Info

Publication number: CN103474075B
Application number: CN201310362024.7A
Authority: CN
Inventors: 江源; 周明; 凌震华; 何婷婷; 胡国平; 胡郁; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2013-08-19
Filing date: 2013-08-19
Publication date: 2016-12-28
Anticipated expiration: 2033-08-19
Also published as: CN103474075A

Abstract

The invention discloses a kind of voice signal sending method and system, this sending method comprises determining that the content of text that continuous speech signal to be sent is corresponding；The phonetic synthesis parameter model of each synthesis unit is determined according to described content of text；The phonetic synthesis parameter model splicing each synthesis unit obtains phonetic synthesis parameter model sequence；Determine the sequence number string that described phonetic synthesis parameter model sequence pair is answered；Described sequence number string is sent to receiving terminal, so that described receiving terminal recovers described continuous speech signal according to described sequence number string.The invention also discloses a kind of voice signal method of reseptance and system.Utilize the present invention, the signal transmission of extremely low rate bit stream on the premise of recovering tonequality minimization of loss ensureing voice, can be realized.

Description

Voice signal sending method and system, method of reseptance and system

Technical field

The present invention relates to signal transmission technology field, be specifically related to a kind of voice signal sending method and system, And a kind of voice signal method of reseptance and system.

Background technology

Along with the universal of the Internet and the popularization of portable set, various chat softwares based on handheld device should Transport and give birth to.The Natural humanity of interactive voice is that other interactive meanses are unsurpassable, is particularly being unfavorable for In the hand-held little screen equipment application of hand-written key-press input.These a lot of products are all supported voice interactive function, will The transmitting voice signal that certain terminal receives i.e. supports Voice to destination, the micro-news product released such as Tengxun The voice message transmission function of Message.But the voice signal data amount of directly transmission is often very big, The Internet or communication network etc. bring bigger financial burden by the channel of flow charging to user.The most such as The data volume compressing transmission on the premise of the most not affecting voice quality as far as possible is to improve transmitting voice signal The precondition of using value.

For the problem of transmitting voice signal, research worker has attempted multiple voice coded method, believes voice Number carry out digital quantization and compression transmission, under the conditions of improving the recovery words matter of voice signal, reduce encoder bit rate And promote efficiency of transmission.The most conventional speech signal compression method has waveform coding and parameter coding etc..Its In:

Waveform coding be by the analog signal waveform of time domain through over sampling, quantify, encode, form digital signal, This coded system has advantage adaptable, that speech quality is high.But owing to needs keep recovering original The waveform shape of voice signal, this scheme rate bit stream requires higher, could obtain preferably higher than 16kb/s Tonequality.

Parameter coding i.e. extracts the parameter characterizing sound pronunciation feature from primary speech signal, and to this feature Parameter encodes.The meaning of one's words aiming at holding raw tone of this scheme, it is ensured that intelligibility.It is excellent Point is that rate bit stream is relatively low, but it is impaired more to recover tonequality.

In traditional voice communication epoch, often use time-based charging mode, coded method primary concern algorithm Time delay and communication quality；And in the mobile interchange epoch, voice, as the one of data signal, generally uses stream Amount is collected the charges, and the height of encoded voice rate bit stream will directly affect the cost that user uses.Additionally, pass System telephone channel voice only uses 8k sample rate, belongs to narrowband speech, and tonequality is impaired and there is the upper limit.Obviously If being continuing with tradition coded system to process broadband or ultra broadband voice, needing to increase rate bit stream, carrying at double Rise flow consumption.

Summary of the invention

On the one hand the embodiment of the present invention provides a kind of voice signal sending method and system, is ensureing that voice recovers The signal transmission of extremely low rate bit stream is realized on the premise of tonequality minimization of loss.

On the other hand the embodiment of the present invention provides a kind of voice signal method of reseptance and system, extensive to reduce voice Complex tone matter is lost.

To this end, the present invention provides following technical scheme:

A kind of voice signal sending method, including:

Determine the content of text that continuous speech signal to be sent is corresponding；

The phonetic synthesis parameter model of each synthesis unit is determined according to described content of text；

The phonetic synthesis parameter model splicing each synthesis unit obtains phonetic synthesis parameter model sequence；

Determine the sequence number string that described phonetic synthesis parameter model sequence pair is answered；

Described sequence number string is sent to receiving terminal, so that described receiving terminal recovers described company according to described sequence number string Continuous voice signal.

A kind of voice signal sends system, including:

Text acquisition module, for determining the content of text that continuous speech signal to be sent is corresponding；

Parameter model determines module, for determining the phonetic synthesis ginseng of each synthesis unit according to described content of text Digital-to-analogue type；

Concatenation module, obtains phonetic synthesis parameter mould for splicing the phonetic synthesis parameter model of each synthesis unit Type sequence；

Sequence number string determines module, for determining the sequence number string that described phonetic synthesis parameter model sequence pair is answered；

Sending module, for being sent to receiving terminal by described sequence number string, so that described receiving terminal is according to described sequence Number string recover described continuous speech signal.

The voice signal sending method of embodiment of the present invention offer and system, use Statistic analysis models coding, Its processing mode is unrelated with speech sample rate, on the premise of ensureing voice recovery tonequality minimization of loss greatly Reduce transmission code flow rate, decrease flow consumption, solve traditional voice coded method and can not take into account sound Matter and the problem of flow, the telex network demand under mobile network's epoch that improves is experienced.

Correspondingly, the voice signal method of reseptance of embodiment of the present invention offer and system, recipient is according to reception To the sequence number string answered of phonetic synthesis parameter model sequence pair from code book, obtain phonetic synthesis parameter model sequence Row, utilize this sequence to obtain voice signal by phonetic synthesis mode, greatly reduce voice and recover tonequality damage Lose, it is achieved that the huge compression of voice signal and minimizing of the loss of signal.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present application or technical scheme of the prior art, below will be to enforcement In example, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is only Some embodiments described in the present invention, for those of ordinary skill in the art, it is also possible to according to these Accompanying drawing obtains other accompanying drawing.

Fig. 1 is the flow chart of embodiment of the present invention voice signal sending method；

Fig. 2 is a kind of flow process of the phonetic synthesis parameter model determining each synthesis unit in the embodiment of the present invention Figure；

Fig. 3 is the structure flow chart of binary decision tree in the embodiment of the present invention；

Fig. 4 is the schematic diagram of a kind of binary decision tree in the embodiment of the present invention；

Fig. 5 is the another kind of stream of the phonetic synthesis parameter model determining each synthesis unit in the embodiment of the present invention Cheng Tu；

Fig. 6 is the flow chart of embodiment of the present invention voice signal method of reseptance；

Fig. 7 is the structured flowchart that in the embodiment of the present invention, voice signal sends system；

Fig. 8 is the structured flowchart that in the embodiment of the present invention, parameter model determines module；

Fig. 9 is the structured flowchart that in the embodiment of the present invention, binary decision tree builds module；

Figure 10 is that in the embodiment of the present invention, in voice signal transmission system, fundamental frequency model determines a kind of knot of unit Structure block diagram；

Figure 11 is a kind of knot that in the embodiment of the present invention, voice signal transmission system intermediate frequency spectrum model determines unit Structure block diagram；

Figure 12 is that in the embodiment of the present invention, in voice signal transmission system, fundamental frequency model determines the another kind of unit Structured flowchart；

Figure 13 is the another kind that in the embodiment of the present invention, voice signal transmission system intermediate frequency spectrum model determines unit Structured flowchart；

Figure 14 is the structured flowchart that embodiment of the present invention voice signal receives system.

Detailed description of the invention

In order to make those skilled in the art be more fully understood that the scheme of the embodiment of the present invention, below in conjunction with the accompanying drawings With embodiment, the embodiment of the present invention is described in further detail.

Processing broadband or ultra broadband voice for tradition coded system, need to increase rate bit stream, flow consumes big Problem, the embodiment of the present invention provides a kind of voice signal sending method and system, and a kind of voice signal Method of reseptance and system, it is adaptable to various types of voice (ultra broadband voice, the 8KHz such as 16KHz sample rate The narrowband speech etc. of sample rate) coding, ensureing that voice recovers on the premise of tonequality minimization of loss, real The signal transmission of existing extremely low rate bit stream.

As it is shown in figure 1, be the flow chart of embodiment of the present invention voice signal sending method, comprise the following steps:

Step 101, determines the content of text that continuous speech signal to be sent is corresponding.

Specifically, described content of text can automatically be obtained by speech recognition algorithm, naturally it is also possible to pass through The mode of artificial mark obtains described content of text.It addition, in order to be further ensured that what speech recognition obtained The correctness of content of text, it is also possible to the content of text obtaining speech recognition carries out human-edited's correction.

Step 102, determines the phonetic synthesis parameter model of each synthesis unit according to described content of text.

Described synthesis unit is minimum synthetic object set in advance, such as syllable unit, phoneme unit, even It it is the state cell etc. in phoneme HMM model.

Recover the loss of tonequality in order to reduce receiving terminal as far as possible, enable receiving terminal extensive by phonetic synthesis mode Multiply-connected continuous voice signal, the phonetic synthesis parameter model that transmitting terminal obtains from primary speech signal should be as far as possible Meet primitive tone signal feature, to reduce the loss of Signal Compression and recovery.

Specifically, according to described content of text, continuous speech signal can be carried out voice snippet cutting, obtain The voice snippet that each synthesis unit is corresponding, and then obtain duration corresponding to each synthesis unit, fundamental frequency model and frequency Spectrum model, detailed process will be described in detail later.

Step 103, the phonetic synthesis parameter model splicing each synthesis unit obtains phonetic synthesis parameter model sequence Row.

Step 104, determines the sequence number string that described phonetic synthesis parameter model sequence pair is answered.

Step 105, is sent to receiving terminal by described sequence number string, so that described receiving terminal is according to described sequence number string Recover described continuous speech signal.

Embodiment of the present invention voice signal sending method, use Statistic analysis models coding, its processing mode with Speech sample rate is unrelated, to 16kHz ultra broadband voice coding without paying additional code flow rate cost, its sound Matter is effective, and coding flow is low.As a example by one section of typical Chinese speech fragment, its efficient voice section continues 10s, has 80 sound mothers (phoneme), has 5 fundamental frequency states, 5 frequency spectrum shapes with each phoneme State, 1 time long status meter, every state use 1 byte code (8bit), its rate bit stream is m:m= [80* (5+5+1)] * 8bit/10s=704b/s, less than 1kb/s, belongs to ELF magnetic field coded method, rate bit stream Being significantly less than every coding standard in current main-stream speech communication field, the flow of network communication will drop significantly Low.Comparing the communications field voice coding method of current main-stream, the voice coding modes of the inventive method can be located Reason ultra broadband voice (16kHz sample rate), tonequality is higher；And there is lower rate bit stream (below 1kb/s), Effectively reduce network traffic.

As in figure 2 it is shown, be the one of the phonetic synthesis parameter model determining each synthesis unit in the embodiment of the present invention Plant flow chart, comprise the following steps:

Step 201, carries out voice snippet cutting according to content of text to continuous speech signal, obtains each synthesis The voice snippet that unit is corresponding.

Specifically, can be by acoustics corresponding with synthesis unit in described content of text for described continuous speech signal Model sequence does pressure alignment, i.e. calculates the voice signal speech recognition solution corresponding to described acoustic model sequence Code, thus obtain the sound bite that each synthesis unit is corresponding.

It should be noted that described synthesis unit can select different size according to different application demands. In general, if requiring higher to rate bit stream, then bigger voice unit is selected, such as syllable unit, sound Element unit etc.；If otherwise tonequality is required higher, then can select less voice unit, such as the shape of model State unit, feature stream unit etc..

Using acoustic model based on HMM (Hidden Markov Model, hidden Markov model) Arrange down, also can choose each state of HMM model further as synthesis unit, and obtain corresponding base Voice snippet in state layer.Subsequently to each state respectively from fundamental frequency binary decision tree and the frequency spectrum of its correspondence Binary decision tree determines fundamental frequency model corresponding to each state and spectral model.So can make the language of acquisition Sound synthetic parameters model can describe the feature of voice signal more meticulously.

Step 202, obtains the current synthesis unit investigated.

Step 203, the sound bite duration that the current synthesis unit investigated of statistics is corresponding.

Step 204, determines the fundamental frequency model of the synthesis unit of current investigation.

Specifically, the fundamental frequency binary decision tree that the current synthesis unit investigated is corresponding is first obtained；To described conjunction Unit is become to carry out text resolution, it is thus achieved that the contextual information of described synthesis unit, such as, phoneme unit, tune Property, the inferior contextual information of part of speech, fascicule；Then, according to described contextual information at described fundamental frequency two Fork tree carries out path decision, obtains the leaf node of correspondence, by fundamental frequency model corresponding for described leaf node Fundamental frequency model as described synthesis unit.

Specifically, the process carrying out path decision is as follows:

According to the contextual information of described synthesis unit, start to depend on from the root node of described fundamental frequency binary decision tree Secondary each node split problem is answered；A top-down coupling path is obtained according to answering result； Leaf node is obtained according to described coupling path.

Step 205, determines the spectral model of the synthesis unit of current investigation.

Specifically, the fundamental frequency binary decision tree that the current synthesis unit investigated is corresponding is first obtained；To described conjunction Unit is become to carry out text resolution, it is thus achieved that the contextual information of described synthesis unit, such as, phoneme unit, tune Property, the inferior contextual information of part of speech, fascicule；Then, according to described contextual information, at described frequency spectrum Binary decision tree carries out path decision, obtains the leaf node of correspondence, by frequency corresponding for described leaf node Spectrum model is as the spectral model of described synthesis unit.

Specifically, the process carrying out path decision is as follows:

According to the contextual information of described synthesis unit, start to depend on from the root node of described frequency spectrum binary decision tree Secondary each node split problem is answered；A top-down coupling path is obtained according to answering result； Leaf node is obtained according to described coupling path.

Step 206, it is judged that whether the current synthesis unit investigated is last synthesis unit.If it is, Then perform step 207；Otherwise, step 202 is performed.

Step 207, exports sound bite duration, fundamental frequency model and spectral model that each synthesis unit is corresponding.

The quality of the phonetic synthesis parameter model that synthesis unit is corresponding and binary decision tree (include that fundamental frequency y-bend is certainly Plan tree and frequency spectrum binary decision tree) structure have direct relation.In embodiments of the present invention, use from Lower and on clustering method build binary decision tree.

As it is shown on figure 3, be the structure flow chart of binary decision tree in the embodiment of the present invention, comprise the following steps:

Step 301, obtains training data.

Specifically, substantial amounts of voice training data can be gathered and it is carried out text marking, then according to mark The content of text of note carries out basic voice unit or even synthesis unit (the state list such as basic speech unit models Unit) voice snippet cutting, obtain the voice snippet set that each synthesis unit is corresponding, and by single for each synthesis Voice snippet in the voice snippet set that unit is corresponding is as training data corresponding to this synthesis unit.

Step 302, extracts the synthesis ginseng of voice snippet set corresponding to synthesis unit from described training data Number.

Described synthetic parameters includes: fundamental frequency feature and spectrum signature etc..

Step 303, at the beginning of carrying out the binary decision tree that described synthesis unit is corresponding according to the synthetic parameters extracted Beginningization, and root node is set as currently investigating node.

Described binary decision tree is initialized the binary decision tree i.e. building only root node.

Step 304, it is judged that current node of investigating is the need of division.If it is, perform step 305； Otherwise perform step 306.

The residue problem selected in default problem set carries out division trial to the current data investigating node, obtains Take child node.Described residue problem refers to the problem do not inquired.

Specifically, can first calculate the current sample concentration class investigating node, i.e. describe voice snippet set The degree of scatter of interior sample.In general, degree of scatter is the biggest, then the probability of this node split is described more Greatly, the probability otherwise divided is the least.Specifically can use sample variance to weigh the sample concentration class of node, I.e. calculate the average of the distance (or square) at all sample distance-like centers under this node.Then division is calculated The sample concentration class of rear child node, and select there is the problem of maximum sample concentration class fall as preferably Problem.

Then carry out division according to described optimal selection problem to attempt, obtain child node.If preferably asked according to described The concentration class of topic division is dropped by less than the threshold value set, or in the child node after division, training data is less than most The thresholding set, it is determined that current node of investigating does not continues to division.

Step 305, divides current node of investigating, and obtains the child node after division and described sub-joint The training data that point is corresponding.Then, step 307 is performed.

Specifically, according to described optimal selection problem, current node of investigating can be divided.

Step 306, is leaf node by currently investigating vertex ticks.

Step 307, it is judged that whether also have the non-leaf nodes do not investigated in described binary decision tree.If It is then to perform step 308；Otherwise perform step 309.

Step 308, obtains the next non-leaf nodes do not investigated as currently investigating node.Then, Return step 304.

Step 309, exports binary decision tree.

It should be noted that in embodiments of the present invention, fundamental frequency binary decision tree and frequency spectrum binary decision tree are all Can set up according to flow process shown in Fig. 3.

As shown in Figure 4, it is the schematic diagram of a kind of binary decision tree in the embodiment of the present invention.

Fig. 4 illustrates phoneme " *-aa+ " the structure figure of binary decision tree of the 3rd state.Such as Fig. 4 institute Show, when root node divides according to can be by the answer of default problem " whether right adjacent phoneme is rhinophonia " The training data that root node is corresponding splits, subsequently when next node layer divides, during as left sibling is divided, According to can be by instruction corresponding for described node to the answer of default problem " whether left adjacent phoneme is voiced consonant " Practice data to split further.Last set when node cannot split further its as leaf node, and utilize It is corresponding that training data training obtains mathematical statistical model, such as Gauss model, by this statistics model As the synthetic parameters model that current leaf node is corresponding.

Obviously, in the embodiment depicted in figure 2, selecting of phonetic synthesis parameter model depends on based on literary composition The binary decision tree of this analysis, as by the phoneme class of synthesis unit context of current investigation, current sound The pronunciation type etc. of element.So sorting speech synthetic parameters model is convenient and swift, but defeated to special sound signal Entering, this phonetic synthesis parameter model with universality determines that method cannot embody pronunciation characteristic well.

To this end, Fig. 5 shows the phonetic synthesis parameter model determining each synthesis unit in the embodiment of the present invention Another kind of flow chart, comprises the following steps:

Step 501, carries out voice snippet cutting according to content of text to continuous speech signal, obtains each synthesis The voice snippet that unit is corresponding.

Specifically, the acoustic model that described continuous speech signal is corresponding with the synthesis unit preset can be run business into strong one System alignment, i.e. calculates the voice signal speech recognition decoder corresponding to described acoustic model sequence, thus obtains The sound bite that each synthesis unit is corresponding.

Step 502, determines duration and the described continuous speech letter of the voice snippet that each synthesis unit is corresponding Number corresponding fundamental frequency characteristic sequence and spectrum signature sequence.

Step 503, true according to the fundamental frequency model set that described fundamental frequency characteristic sequence and described synthesis unit are corresponding The fundamental frequency model of fixed described synthesis unit.

Specifically, it is determined that the fundamental frequency characteristic sequence that described synthesis unit is corresponding, and obtain described synthesis unit pair The fundamental frequency model set answered, all leaf nodes of the fundamental frequency binary decision tree of the most described synthesis unit are corresponding Fundamental frequency model.Then described fundamental frequency characteristic sequence is calculated with each fundamental frequency model in described fundamental frequency model set seemingly So degree, and select the fundamental frequency model with maximum likelihood degree as the fundamental frequency model of described synthesis unit.

Step 504, true according to the spectral model set that described spectrum signature sequence and described synthesis unit are corresponding The spectral model of fixed each synthesis unit.

Specifically, it is determined that the spectrum signature sequence that described synthesis unit is corresponding, and obtain described synthesis unit pair The spectral model set answered, all leaf nodes of the frequency spectrum binary decision tree of the most described synthesis unit are corresponding Spectral model.Then described spectrum signature sequence is calculated with each spectral model in described spectral model set seemingly So degree, and select the spectral model with maximum likelihood degree as the spectral model of described synthesis unit.

Visible, the voice signal sending method of the embodiment of the present invention, ensureing that voice recovers tonequality loss reduction Significantly reduce transmission code flow rate on the premise of change, decrease flow consumption, solve traditional voice coding Method can not take into account the problem of tonequality and flow, and the telex network demand under mobile network's epoch that improves is experienced.

Correspondingly, the embodiment of the present invention also provides for a kind of voice signal method of reseptance, as shown in Figure 6, is this The flow chart of method, comprises the following steps:

Step 601, receives the sequence number string that phonetic synthesis parameter model sequence pair is answered.

Step 602, obtains phonetic synthesis parameter model sequence according to described sequence number string from code book.

Owing to each phonetic synthesis parameter model has a unique sequence number, and, in sender and reception Side all preserves identical code book, contains all phonetic synthesis parameter models in described code book.Therefore, connect Debit can obtain the phonetic synthesis parameter model of corresponding each sequence number according to the sequence number string received from code book, spells Connect these phonetic synthesis parameter models and obtain described phonetic synthesis parameter model sequence.

Step 603, determines phonetic synthesis argument sequence according to described phonetic synthesis parameter model sequence.

Specifically, can be according to the described phonetic synthesis parameter model sequence duration sequence corresponding with synthesis unit Determine phonetic synthesis parameter, generate phonetic synthesis argument sequence.

Such as, phonetic synthesis argument sequence is obtained according to below equation:

O_max=arg maxP (O |, λ, T)

Wherein, O is argument sequence, and λ is given phonetic synthesis parameter model sequence, and T is that each synthesis is single The duration sequence that unit is corresponding.

O_maxThe base frequency parameters sequence i.e. ultimately generated or frequency spectrum parameter sequence, unit duration sequence T's In the range of, ask for the parameter with maximum likelihood value corresponding to given phonetic synthesis parameter model sequence λ Sequence O_max, thus obtain the argument sequence for phonetic synthesis.

Step 604, recovers voice signal according to described phonetic synthesis argument sequence.

The phonetic synthesis argument sequence O that upper step is obtained_maxSend into the i.e. available corresponding voice of voice operation demonstrator. Voice operation demonstrator is that the analysis of a kind of voice signal recovers instrument, can be by parameterized speech data (such as base Frequently parameter, frequency spectrum parameter) recover high-quality speech waveform.

Visible, embodiment of the present invention voice signal sending method and method of reseptance, by continuous speech signal The extraction of corresponding phonetic synthesis parameter model and signal syntheses, it is achieved that the huge compression of voice signal and letter Number loss minimize, i.e. efficiently reduce distorted signals.

Correspondingly, the embodiment of the present invention also provides for a kind of voice signal and sends system, as it is shown in fig. 7, be this The structured flowchart of system.

In this embodiment, described voice signal transmission system includes:

Text acquisition module 701, for determining the content of text that continuous speech signal to be sent is corresponding；

Parameter model determines module 702, for determining that according to described content of text the voice of each synthesis unit closes Become parameter model；

Concatenation module 703, obtains phonetic synthesis ginseng for splicing the phonetic synthesis parameter model of each synthesis unit Number Model sequence；

Sequence number string determines module 704, for determining the sequence number string that described phonetic synthesis parameter model sequence pair is answered；

Sending module 705, for being sent to receiving terminal by described sequence number string, so that described receiving terminal is according to institute State sequence number string and recover described continuous speech signal.

In actual applications, above-mentioned text acquisition module 701 can obtain institute automatically by speech recognition algorithm State content of text, naturally it is also possible to by the way of artificial mark, obtain described content of text.To this end, can To arrange voice recognition unit and/or markup information acquiring unit in text acquisition module 701, in order to can So that user selects different modes to obtain the content of text that continuous speech signal to be sent is corresponding.Wherein, By speech recognition algorithm, described voice recognition unit, for determining that continuous speech signal to be sent is corresponding Content of text；Described markup information acquiring unit is to be sent continuous for obtaining by the way of artificial mark The content of text that voice signal is corresponding.

Recover the loss of tonequality in order to reduce receiving terminal as far as possible, enable receiving terminal extensive by phonetic synthesis mode Multiply-connected continuous voice signal, parameter model determines the phonetic synthesis ginseng that module 702 obtains from primary speech signal Digital-to-analogue type should meet primitive tone signal feature as far as possible, to reduce the loss of Signal Compression and recovery.Specifically, According to described content of text, continuous speech signal can be carried out voice snippet cutting, obtain each synthesis unit pair The voice snippet answered, and then obtain duration corresponding to each synthesis unit, fundamental frequency model and spectral model.

Embodiment of the present invention voice signal send system, use Statistic analysis models coding, its processing mode with Speech sample rate is unrelated, to 16kHz ultra broadband voice coding without paying additional code flow rate cost, its sound Matter is effective, and coding flow is low.Compare the communications field speech coding system of current main-stream, present system Voice coding modes can process ultra broadband voice (16kHz sample rate), tonequality is higher；And have lower Rate bit stream (below 1kb/s), effectively reduce network traffic.

As shown in Figure 8, it is a kind of structured flowchart that in the embodiment of the present invention, parameter model determines module.

Described parameter model determines that module includes:

Cutting unit 801, for carrying out voice snippet according to described content of text to described continuous speech signal Cutting, obtains the voice snippet that each synthesis unit is corresponding.

Specifically, can be by acoustic model corresponding with synthesis unit in described content of text for continuous speech signal Sequence does pressure alignment, i.e. calculates the voice signal speech recognition decoder corresponding to described acoustic model sequence, Thus obtain the sound bite that each synthesis unit is corresponding.

It should be noted that described synthesis unit can select different size according to different application demands. In general, if requiring higher to rate bit stream, then bigger voice unit is selected, such as syllable unit, sound Element unit etc.；If otherwise tonequality is required higher, then can select less voice unit, such as the shape of model State unit, feature stream unit etc..Using based on HMM (Hidden Markov Model, hidden Ma Erke Husband's model) acoustic model arrange down, also can choose each state of HMM model further single as synthesis Unit, and obtain corresponding voice snippet based on state layer.Subsequently to each state respectively from the base of its correspondence Binary decision tree and frequency spectrum binary decision tree determine fundamental frequency model corresponding to each state and spectral model frequently. The phonetic synthesis parameter model that so can enable acquisition describes the feature of voice signal more meticulously.

Duration determines unit 802, is used for the duration of the voice snippet determining that each synthesis unit is corresponding successively.

Fundamental frequency model determines unit 803, is used for the fundamental frequency of the voice snippet determining that each synthesis unit is corresponding successively Model.

Spectral model determines unit 804, is used for the frequency spectrum of the voice snippet determining that each synthesis unit is corresponding successively Model.

In actual applications, above-mentioned fundamental frequency model determines that unit 803 and spectral model determine that unit 804 is permissible There is multiple implementation, such as, fundamental frequency model and spectral model can be obtained according to binary decision tree, for This, in voice signal of the present invention sends another embodiment of system, described system also includes binary decision tree Build module, be used for building fundamental frequency binary decision tree and frequency spectrum binary decision tree.It addition, above-mentioned fundamental frequency model Determine that unit 803 and spectral model determine that unit 804 is also based on signal characteristic optimization to obtain fundamental frequency mould Type and spectral model, will be described in detail later this.

As it is shown in figure 9, be that in the embodiment of the present invention, in voice signal transmission system, binary decision tree builds module Structured flowchart.

Described binary decision tree builds module and includes:

Training data acquiring unit 901, is used for obtaining training data；

Parameter extraction unit 902, for extracting the voice that described synthesis unit is corresponding from described training data The synthetic parameters of segment set, described synthetic parameters includes: fundamental frequency feature and spectrum signature；

Initialization unit 903, for the Binary decision corresponding to described synthesis unit according to described synthetic parameters Tree initializes, and i.e. builds the binary decision tree of only root node；

Node reviews unit 904, for from the beginning of the root node of described binary decision tree, investigates each successively Non-leaf nodes；If currently investigating node to need division, then current node of investigating is divided, and obtain Take the child node after division and training data corresponding to described child node；Otherwise, vertex ticks will currently be investigated For leaf node；

Binary decision tree output unit 905, for examining all non-leaf nodes at described node reviews unit After having examined, export the binary decision tree of described synthesis unit.

In this embodiment, training data acquiring unit 901 specifically can gather substantial amounts of voice training data And it is carried out text marking, then carry out basic voice unit according to the content of text of mark or even synthesis is single The voice snippet cutting of unit's (such as state cell of basic speech unit models), obtains each synthesis unit corresponding Voice snippet set, and using the voice snippet in voice snippet set corresponding for each synthesis unit as this The training data that synthesis unit is corresponding.

Above-mentioned node reviews unit 904, can be according to working as when judging current investigation node the need of division The sample concentration class of front investigation node, selects have the problem of maximum sample concentration class fall as preferably Problem carries out division and attempts, and obtains child node.If declining little according to the concentration class that described optimal selection problem divides In child node after the threshold value set, or division, training data is less than the thresholding set, it is determined that when Front investigation node does not continues to division.

Above-mentioned investigation and fission process can refer to retouching in above embodiment of the present invention voice signal sending method State, do not repeat them here.

It should be noted that in embodiments of the present invention, fundamental frequency binary decision tree and frequency spectrum binary decision tree are all Can be built module by this binary decision tree to set up, it is similar that it realizes process, the most detailed at this Explanation.

Based on above-mentioned fundamental frequency binary decision tree and frequency spectrum binary decision tree, the present invention is described in detail further below In embodiment, fundamental frequency model determines that unit and spectral model determine the implementation of unit.

As shown in Figure 10, it is that in the embodiment of the present invention, in voice signal transmission system, fundamental frequency model determines unit A kind of structured flowchart.

In this embodiment, described fundamental frequency model determines that unit includes:

First acquiring unit 161, for obtaining the fundamental frequency binary decision tree that described synthesis unit is corresponding.

First resolution unit 162, for carrying out text resolution to described synthesis unit, it is thus achieved that described synthesis list The contextual information of unit, such as, the inferior contextual information of phoneme unit, tonality, part of speech, fascicule.

First decision package 163, for carrying out road according to described contextual information in described fundamental frequency binary tree Footpath decision-making, obtains the leaf node of correspondence.

Specifically, the process carrying out path decision is as follows: according to the contextual information of described synthesis unit, from The root node of described fundamental frequency binary decision tree starts to answer each node split problem successively；According to answer Result obtains a top-down coupling path；Leaf node is obtained according to described coupling path.

First output unit 164, for single as described synthesis using fundamental frequency model corresponding for described leaf node The fundamental frequency model of unit.

Similar with the realization that above-mentioned fundamental frequency model determines unit, as shown in figure 11, it is in the embodiment of the present invention Voice signal sends system intermediate frequency spectrum model and determines a kind of structured flowchart of unit.

In this embodiment, described spectral model determines that unit includes:

Second acquisition unit 171, for obtaining the frequency spectrum binary decision tree that described synthesis unit is corresponding.

Second resolution unit 172, for described synthesis unit is carried out text resolution, it is thus achieved that its phoneme unit, Tonality, part of speech, the inferior contextual information of fascicule, such as, phoneme unit, tonality, part of speech, fascicule Inferior contextual information.

Second decision package 173, for the contextual information according to described synthesis text, at described frequency spectrum two Fork tree carries out path decision, obtains the leaf node of correspondence.

Specifically, the process carrying out path decision is as follows: according to the contextual information of described synthesis unit, from The root node of described frequency spectrum binary decision tree starts to answer each node split problem successively；According to answer Result obtains a top-down coupling path；Leaf node is obtained according to described coupling path.

Second output unit 174, using spectral model corresponding for described leaf node as described synthesis unit Spectral model.

It should be noted that in actual applications, the fundamental frequency model shown in above-mentioned Figure 10 determines unit and figure Spectral model shown in 11 determines that unit can be realized by the most independent physical location respectively, it is also possible to Unification is realized by a physical location.When needs generate fundamental frequency model, obtain the base that synthesis unit is corresponding Frequently binary decision tree, and synthesis unit is resolved and decision-making accordingly, obtain corresponding described synthesis unit Fundamental frequency model.When needs generate spectral model, obtain the frequency spectrum binary decision tree that synthesis unit is corresponding, And synthesis unit is resolved and decision-making accordingly, obtain the spectral model of corresponding described synthesis unit.

As shown in figure 12, it is that in the embodiment of the present invention, in voice signal transmission system, fundamental frequency model determines unit Another kind of structured flowchart.

First determines unit 181, for determining the fundamental frequency characteristic sequence that described synthesis unit is corresponding.

First set acquiring unit 182, for obtaining the fundamental frequency model set that described synthesis unit is corresponding, i.e. The fundamental frequency model that all leaf nodes of the fundamental frequency binary decision tree of described synthesis unit are corresponding.

First computing unit 183, is used for calculating described fundamental frequency characteristic sequence each with described fundamental frequency model set The likelihood score of fundamental frequency model.

First selects unit 184, for selecting the fundamental frequency model with maximum likelihood degree single as described synthesis The fundamental frequency model of unit.

Similar with the realization that above-mentioned fundamental frequency model determines unit, Figure 13 is voice signal in the embodiment of the present invention Transmission system intermediate frequency spectrum model determines the another kind of structured flowchart of unit.

In this embodiment, described spectral model determines that unit includes:

Second determines unit 191, for determining the spectrum signature sequence that described synthesis unit is corresponding.

Second set acquiring unit 192, for obtaining the spectral model set that described synthesis unit is corresponding, i.e. The spectral model that all leaf nodes of the fundamental frequency binary decision tree of described synthesis unit are corresponding.

Second computing unit 193, is used for calculating described spectrum signature sequence each with described spectral model set The likelihood score of spectral model.

Second selects unit 194, for selecting the spectral model with maximum likelihood degree single as described synthesis The spectral model of unit.

It should be noted that in actual applications, the fundamental frequency model shown in above-mentioned Figure 12 determines unit and figure Spectral model shown in 13 determines that unit can be realized by the most independent physical location respectively, it is also possible to Unification is realized by a physical location.When needs generate fundamental frequency model, obtain the base that synthesis unit is corresponding Frequently binary decision tree, and synthesis unit is resolved and decision-making accordingly, obtain corresponding described synthesis unit Fundamental frequency model.When needs generate spectral model, obtain the frequency spectrum binary decision tree that synthesis unit is corresponding, And synthesis unit is resolved and decision-making accordingly, obtain the spectral model of corresponding described synthesis unit.

Visible, the voice signal of the embodiment of the present invention sends system, is ensureing that voice recovers tonequality loss reduction Significantly reduce transmission code flow rate on the premise of change, decrease flow consumption, solve traditional voice coding Method can not take into account the problem of tonequality and flow, and the telex network demand under mobile network's epoch that improves is experienced.

Correspondingly, the embodiment of the present invention also provides for a kind of voice signal and receives system, as shown in figure 14, is The structured flowchart of this system.

In this embodiment, described voice signal reception system includes:

Receiver module 141, for receiving the sequence number string that phonetic synthesis parameter model sequence pair is answered；

Extraction module 142, for obtaining phonetic synthesis parameter model sequence according to described sequence number string from code book；

Determine module 143, for determining phonetic synthesis parameter sequence according to described phonetic synthesis parameter model sequence Row；

Signal recovers module 144, for recovering voice signal according to described phonetic synthesis argument sequence.

Above-mentioned determine that module 143 can continue according to described phonetic synthesis parameter model sequence and Model sequence Duration determines phonetic synthesis parameter, generates phonetic synthesis ginseng sequence.The process of implementing can refer to above this Description in bright embodiment voice signal method of reseptance, does not repeats them here.

Due in the embodiment of the present invention voice signal reception system recovery of voice signal and speech sample rate without Close, therefore, it can at the letter ensureing to realize on the premise of voice recovers tonequality minimization of loss extremely low rate bit stream Number transmission, preferably solve tonequality and the problems of liquid flow of traditional voice coded method, improve mobile network Under epoch, telex network demand is experienced, and has saved network charges.

The voice signal of the embodiment of the present invention sends and reception scheme goes for various types of voice (as 16k adopts The ultra broadband voice of sample rate, the narrowband speech etc. of 8k sample rate) coding, and available preferable tonequality.

Each embodiment in this specification all uses the mode gone forward one by one to describe, phase homophase between each embodiment As part see mutually, what each embodiment stressed is different from other embodiments it Place.For system embodiment, owing to it is substantially similar to embodiment of the method, so describing Fairly simple, relevant part sees the part of embodiment of the method and illustrates.System described above is implemented Example is only that schematically the wherein said unit illustrated as separating component can be or may not be Physically separate, the parts shown as unit can be or may not be physical location, the most permissible It is positioned at a place, or can also be distributed on multiple NE.Can select according to the actual needs Some or all of module therein realizes the purpose of the present embodiment scheme.Those of ordinary skill in the art exist In the case of not paying creative work, i.e. it is appreciated that and implements.

Being described in detail the embodiment of the present invention above, detailed description of the invention used herein is to this Bright being set forth, the explanation of above example is only intended to help to understand the method and apparatus of the present invention；With Time, for one of ordinary skill in the art, according to the thought of the present invention, in detailed description of the invention and application All will change in scope, in sum, this specification content should not be construed as limitation of the present invention.

Claims

1. a voice signal sending method, it is characterised in that including:

Method the most according to claim 1, it is characterised in that described determine continuous language to be sent Content of text corresponding to tone signal includes:

The content of text that continuous speech signal to be sent is corresponding is determined by speech recognition algorithm；Or

Obtain, by the way of artificial mark, the content of text that continuous speech signal to be sent is corresponding.

Method the most according to claim 1, it is characterised in that described true according to described content of text The phonetic synthesis parameter model of fixed each synthesis unit includes:

According to described content of text, described continuous speech signal is carried out voice snippet cutting, obtain each synthesis single The voice snippet that unit is corresponding；

Determine the duration of the voice snippet that each synthesis unit is corresponding, fundamental frequency model and spectral model successively.

Method the most according to claim 3, it is characterised in that described determine that synthesis unit is corresponding Fundamental frequency model includes:

Obtain the fundamental frequency binary decision tree that described synthesis unit is corresponding；

Described synthesis unit is carried out text resolution, it is thus achieved that the contextual information of described synthesis unit；

In described fundamental frequency binary tree, carry out path decision according to described contextual information, obtain the leaf of correspondence Node；

Using fundamental frequency model corresponding for described leaf node as the fundamental frequency model of described synthesis unit.

Method the most according to claim 3, it is characterised in that described determine that synthesis unit is corresponding Spectral model includes:

Obtain the frequency spectrum binary decision tree that described synthesis unit is corresponding；

Described synthesis unit is carried out text resolution, it is thus achieved that it includes phoneme unit, tonality, part of speech, the rhythm The contextual information of level；

According to the contextual information of described synthesis unit, in described frequency spectrum binary tree, carry out path decision, To corresponding leaf node；

Using spectral model corresponding for described leaf node as the spectral model of described synthesis unit.

6. according to the method described in claim 4 or 5, it is characterised in that described method also includes: press The binary decision tree that the in the following manner described synthesis unit of structure is corresponding:

Obtain training data；

The synthetic parameters of voice snippet set corresponding to described synthesis unit, institute is extracted from described training data State synthetic parameters to include: fundamental frequency feature and spectrum signature；

According to described synthetic parameters, the binary decision tree that described synthesis unit is corresponding is initialized；

From the beginning of the root node of described binary decision tree, investigate each non-leaf nodes successively；

If currently investigating node to need division, then current node of investigating is divided, and after obtaining division Child node and training data corresponding to described child node；Otherwise, by currently investigate vertex ticks be leaf joint Point；

After all non-leaf nodes have been investigated, obtain the binary decision tree of described synthesis unit.

Determine the fundamental frequency characteristic sequence that described synthesis unit is corresponding；

Obtain the fundamental frequency model set that described synthesis unit is corresponding；

Calculate described fundamental frequency characteristic sequence and the likelihood score of each fundamental frequency model in described fundamental frequency model set；

Select the fundamental frequency model with maximum likelihood degree as the fundamental frequency model of described synthesis unit.

Determine the spectrum signature sequence that described synthesis unit is corresponding；

Obtain the spectral model set that described synthesis unit is corresponding；

Calculate described spectrum signature sequence and the likelihood score of each spectral model in described spectral model set；

Select the spectral model with maximum likelihood degree as the spectral model of described synthesis unit.

9. a voice signal method of reseptance, it is characterised in that including:

Receive the sequence number string that phonetic synthesis parameter model sequence pair is answered；

From code book, phonetic synthesis parameter model sequence is obtained according to described sequence number string；

Phonetic synthesis argument sequence is determined according to described phonetic synthesis parameter model sequence；

Voice signal is recovered according to described phonetic synthesis argument sequence.

Method the most according to claim 9, it is characterised in that described join according to described phonetic synthesis Number Model sequence determines that phonetic synthesis argument sequence includes:

Phonetic synthesis parameter is determined according to described phonetic synthesis parameter model sequence and Model sequence duration, Generate phonetic synthesis ginseng sequence.

11. 1 kinds of voice signals send system, it is characterised in that including:

12. systems according to claim 11, it is characterised in that described text acquisition module includes:

By speech recognition algorithm, voice recognition unit, for determining that continuous speech signal to be sent is corresponding Content of text；Or

Markup information acquiring unit, for obtaining continuous speech signal to be sent by the way of artificial mark Corresponding content of text.

13. systems according to claim 11, it is characterised in that described parameter model determines module Including:

Cutting unit, cuts for described continuous speech signal being carried out voice snippet according to described content of text Point, obtain the voice snippet that each synthesis unit is corresponding；

Duration determines unit, is used for the duration of the voice snippet determining that each synthesis unit is corresponding successively；

Fundamental frequency model determines unit, is used for the fundamental frequency model of the voice snippet determining that each synthesis unit is corresponding successively

Spectral model determines unit, is used for the frequency spectrum mould of the voice snippet determining that each synthesis unit is corresponding successively Type.

14. systems according to claim 13, it is characterised in that described fundamental frequency model determines unit Including:

First acquiring unit, for obtaining the fundamental frequency binary decision tree that described synthesis unit is corresponding；

First resolution unit, for carrying out text resolution to described synthesis unit, it is thus achieved that described synthesis unit Contextual information；

First decision package, for carrying out path certainly according to described contextual information in described fundamental frequency binary tree Plan, obtains the leaf node of correspondence；

First output unit, is used for fundamental frequency model corresponding for described leaf node as described synthesis unit Fundamental frequency model.

15. systems according to claim 13, it is characterised in that described spectral model determines unit Including:

Second acquisition unit, for obtaining the frequency spectrum binary decision tree that described synthesis unit is corresponding；

Second resolution unit, for described synthesis unit is carried out text resolution, it is thus achieved that it includes phoneme unit, Tonality, part of speech, the contextual information of rhythm level；

Second decision package, for the contextual information according to described synthesis unit, described frequency spectrum binary tree In carry out path decision, obtain correspondence leaf node；

Second output unit, is used for spectral model corresponding for described leaf node as described synthesis unit Spectral model.

16. according to the system described in claims 14 or 15, it is characterised in that described system also includes: Binary decision tree builds module, and described binary decision tree builds module and includes:

Training data acquiring unit, is used for obtaining training data；

Parameter extraction unit, for extracting the voice snippet that described synthesis unit is corresponding from described training data The synthetic parameters of set, described synthetic parameters includes: fundamental frequency feature and spectrum signature；

Initialization unit, for entering the binary decision tree that described synthesis unit is corresponding according to described synthetic parameters Row initializes；

Node reviews unit, for from the beginning of the root node of described binary decision tree, investigates each n omicronn-leaf successively Child node；If current node of investigating needs division, then current node of investigating is divided, and acquisition point Child node after splitting and training data corresponding to described child node；Otherwise, it is leaf by currently investigating vertex ticks Child node；

Binary decision tree output unit, for having investigated all non-leaf nodes at described node reviews unit Cheng Hou, exports the binary decision tree of described synthesis unit.

17. systems according to claim 13, it is characterised in that described fundamental frequency model determines unit Including:

First determines unit, for determining the fundamental frequency characteristic sequence that described synthesis unit is corresponding；

First set acquiring unit, for obtaining the fundamental frequency model set that described synthesis unit is corresponding；

First computing unit, is used for calculating described fundamental frequency characteristic sequence and each fundamental frequency in described fundamental frequency model set The likelihood score of model；

First selects unit, has the fundamental frequency model of maximum likelihood degree as described synthesis unit for selection Fundamental frequency model.

18. systems according to claim 13, it is characterised in that described spectral model determines unit Including:

Second determines unit, for determining the spectrum signature sequence that described synthesis unit is corresponding；

Second set acquiring unit, for obtaining the spectral model set that described synthesis unit is corresponding；

Second computing unit, is used for calculating described spectrum signature sequence and each frequency spectrum in described spectral model set The likelihood score of model；

Second selects unit, has the spectral model of maximum likelihood degree as described synthesis unit for selection Spectral model.

19. 1 kinds of voice signals receive system, it is characterised in that including:

Receiver module, for receiving the sequence number string that phonetic synthesis parameter model sequence pair is answered；

Extraction module, for obtaining phonetic synthesis parameter model sequence according to described sequence number string from code book；

Determine module, for determining phonetic synthesis argument sequence according to described phonetic synthesis parameter model sequence；

Signal recovers module, for recovering voice signal according to described phonetic synthesis argument sequence.

20. systems according to claim 19, it is characterised in that

Described determine module, specifically for continuing according to described phonetic synthesis parameter model sequence and Model sequence Duration determines phonetic synthesis parameter, generates phonetic synthesis ginseng sequence.