CN103474067B - speech signal transmission method and system - Google Patents

speech signal transmission method and system Download PDF

Info

Publication number
CN103474067B
CN103474067B CN201310361783.1A CN201310361783A CN103474067B CN 103474067 B CN103474067 B CN 103474067B CN 201310361783 A CN201310361783 A CN 201310361783A CN 103474067 B CN103474067 B CN 103474067B
Authority
CN
China
Prior art keywords
model
fundamental frequency
unit
sequence
synthesis unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310361783.1A
Other languages
Chinese (zh)
Other versions
CN103474067A (en
Inventor
江源
周明
凌震华
何婷婷
胡国平
胡郁
刘庆峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201310361783.1A priority Critical patent/CN103474067B/en
Publication of CN103474067A publication Critical patent/CN103474067A/en
Application granted granted Critical
Publication of CN103474067B publication Critical patent/CN103474067B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a kind of speech signal transmission method and system, the method comprises determining that the content of text that continuous speech signal to be sent is corresponding;The phonetic synthesis parameter model of each synthesis unit is determined according to described content of text and described continuous speech signal;The phonetic synthesis parameter model splicing each synthesis unit obtains phonetic synthesis parameter model sequence;Determine the sequence number string that described phonetic synthesis parameter model sequence pair is answered;Described sequence number string is sent to receiving terminal, so that described receiving terminal recovers described continuous speech signal according to described sequence number string.Utilize the present invention, the signal transmission of extremely low rate bit stream on the premise of recovering tonequality minimization of loss ensureing voice, can be realized.

Description

Speech signal transmission method and system
Technical field
The present invention relates to signal transmission technology field, be specifically related to a kind of speech signal transmission method and system.
Background technology
Along with the universal of the Internet and the popularization of portable set, various chat softwares based on handheld device meet the tendency and Raw.The Natural humanity of interactive voice is that other interactive meanses are unsurpassable, is particularly being unfavorable for hand-written key-press input In the application of hand-held little screen equipment.These a lot of products are all supported voice interactive function, and the voice signal certain terminal received passes Transport to destination, as micro-news product of Tengxun's release i.e. supports the voice message transmission function of Voice Message.But directly The voice signal data amount of transmission is often very big, brings to user in the Internet or communication network etc. are by the channel of flow charging Bigger financial burden.Obviously the data volume how compressing transmission on the premise of not affecting voice quality as far as possible is to improve language The precondition of tone signal transmission using value.
For the problem of transmitting voice signal, research worker has attempted multiple voice coded method, carries out voice signal Digital quantization and compression transmission, reduce encoder bit rate and promote efficiency of transmission under the conditions of improving the recovery words matter of voice signal. The most conventional speech signal compression method has waveform coding and parameter coding etc..Wherein:
Waveform coding be by the analog signal waveform of time domain through over sampling, quantify, encode, form digital signal, this volume Code mode has advantage adaptable, that speech quality is high.But owing to needing the waveform shape keeping recovering original voice signal Shape, this scheme rate bit stream requires higher, could obtain preferable tonequality higher than 16kb/s.
Parameter coding i.e. extracts the parameter characterizing sound pronunciation feature from primary speech signal, and enters this feature parameter Row coding.The meaning of one's words aiming at holding raw tone of this scheme, it is ensured that intelligibility.Have an advantage in that rate bit stream is relatively low, But it is impaired more to recover tonequality.
In traditional voice communication epoch, often use time-based charging mode, coded method primary concern algorithm time delay and Communication quality;And in the mobile interchange epoch, voice, as the one of data signal, generally uses flow to collect the charges, coding The height of phonetic code flow rate will directly affect the cost that user uses.Additionally, black phone channels voice only uses 8k sample rate, Belonging to narrowband speech, tonequality is impaired and there is the upper limit.If being obviously continuing with tradition coded system to process broadband or ultra broadband Voice, needs to increase rate bit stream, promotes flow consumption at double.
Summary of the invention
The embodiment of the present invention provides a kind of speech signal transmission method and system, is ensureing that voice recovers tonequality loss reduction The signal transmission of extremely low rate bit stream is realized on the premise of change.
The embodiment of the present invention provides a kind of speech signal transmission method, including:
Determine the content of text that continuous speech signal to be sent is corresponding;
The phonetic synthesis parameter model of each synthesis unit is determined according to described content of text and described continuous speech signal;
The phonetic synthesis parameter model splicing each synthesis unit obtains phonetic synthesis parameter model sequence;
Determine the sequence number string that described phonetic synthesis parameter model sequence pair is answered;
Described sequence number string is sent to receiving terminal, so that described receiving terminal recovers described continuous speech according to described sequence number string Signal.
The embodiment of the present invention also provides for a kind of transmitting voice signal system, including:
Text acquisition module, for determining the content of text that continuous speech signal to be sent is corresponding;
Parameter model determines module, for determining each synthesis unit according to described content of text and described continuous speech signal Phonetic synthesis parameter model;
Concatenation module, obtains phonetic synthesis parameter model sequence for splicing the phonetic synthesis parameter model of each synthesis unit Row;
Sequence number string determines module, for determining the sequence number string that described phonetic synthesis parameter model sequence pair is answered;
Sending module, for being sent to receiving terminal by described sequence number string, so that described receiving terminal is extensive according to described sequence number string Multiple described continuous speech signal.
The speech signal transmission method of embodiment of the present invention offer and system, use Statistic analysis models coding, and it processes Mode is unrelated with speech sample rate, significantly reduces transmission code stream on the premise of ensureing voice recovery tonequality minimization of loss Rate, decreases flow consumption, solves traditional voice coded method and can not take into account the problem of tonequality and flow, improves mobile network Under the network epoch, telex network demand is experienced.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present application or technical scheme of the prior art, below will be to institute in embodiment The accompanying drawing used is needed to be briefly described, it should be apparent that, the accompanying drawing in describing below is only described in the present invention A little embodiments, for those of ordinary skill in the art, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the flow chart of embodiment of the present invention speech signal transmission method;
Fig. 2 is a kind of flow chart of the phonetic synthesis parameter model determining each synthesis unit in the embodiment of the present invention;
Fig. 3 is the structure flow chart of binary decision tree in the embodiment of the present invention;
Fig. 4 is the schematic diagram of a kind of binary decision tree in the embodiment of the present invention;
Fig. 5 is the flow chart that initial fundamental frequency model carries out in the embodiment of the present invention combined optimization;
Fig. 6 is the structural representation of embodiment of the present invention transmitting voice signal system;
Fig. 7 is a kind of structural representation that in the embodiment of the present invention, parameter model determines module;
Fig. 8 is that in the embodiment of the present invention, in voice signal transmission system, binary decision tree builds the structural representation of module;
Fig. 9 is the first structural representation optimizing unit in the embodiment of the present invention;
Figure 10 is the second structural representation optimizing unit in the embodiment of the present invention.
Detailed description of the invention
In order to make those skilled in the art be more fully understood that the scheme of the embodiment of the present invention, below in conjunction with the accompanying drawings and implement The embodiment of the present invention is described in further detail by mode.
Processing broadband or ultra broadband voice for tradition coded system, need to increase rate bit stream, flow consumes big problem, The embodiment of the present invention provides a kind of speech signal transmission method and system, it is adaptable to the various types of voice (ultra-wide such as 16KHz sample rate With voice, the narrowband speech etc. of 8KHz sample rate) coding, ensureing that voice recovers on the premise of tonequality minimization of loss, real The signal transmission of existing extremely low rate bit stream.
As it is shown in figure 1, be the flow chart of embodiment of the present invention voice signal sending method, comprise the following steps:
Step 101, determines the content of text that continuous speech signal to be sent is corresponding.
Specifically, described content of text can automatically be obtained by speech recognition algorithm, naturally it is also possible to by artificial mark The mode of note obtains described content of text.It addition, for the correctness being further ensured that content of text that speech recognition obtains, The content of text that can also obtain speech recognition carries out human-edited's correction.
Step 102, determines the phonetic synthesis ginseng of each synthesis unit according to described content of text and described continuous speech signal Digital-to-analogue type.
Described synthesis unit is minimum synthetic object set in advance, such as syllable unit, phoneme unit, even phoneme State cell etc. in HMM model.
Recover the loss of tonequality in order to reduce receiving terminal as far as possible, enable receiving terminal to recover continuous by phonetic synthesis mode Voice signal, it is special that the phonetic synthesis parameter model that transmitting terminal obtains from primary speech signal should meet primitive tone signal as far as possible Point, to reduce the loss of Signal Compression and recovery.
Specifically, according to described content of text, continuous speech signal can be carried out voice snippet cutting, obtain each synthesis The voice snippet that unit is corresponding, and then obtain duration corresponding to each synthesis unit, initialize phonetic synthesis parameter model, then profit Carrying out combined optimization with the voice signal gathered to initializing phonetic synthesis parameter model, detailed process will be retouched in detail below State.
Step 103, the phonetic synthesis parameter model splicing each synthesis unit obtains phonetic synthesis parameter model sequence.
Step 104, determines the sequence number string that described phonetic synthesis parameter model sequence pair is answered.
Step 105, is sent to receiving terminal by described sequence number string, so that described receiving terminal recovers described according to described sequence number string Continuous speech signal.
Correspondingly, after recipient receives the sequence number string that sender sends, can obtain from code book according to described sequence number string Take phonetic synthesis parameter model sequence.
Owing to each phonetic synthesis parameter model has a unique sequence number, and, all protect sender and recipient There is identical code book, described code book contains all phonetic synthesis parameter models.Therefore, recipient receives described sequence number After string, the phonetic synthesis parameter model of corresponding each sequence number can be obtained according to described sequence number string from code book, splice these voices Synthetic parameters model obtains described phonetic synthesis parameter model sequence.Then, true according to described phonetic synthesis parameter model sequence Attribute sound synthetic parameters sequence, recovers voice signal by phonetic synthesis mode.
Embodiment of the present invention speech signal transmission method, uses Statistic analysis models coding, and its processing mode is adopted with voice Sample rate is unrelated, and to 16kHz ultra broadband voice coding without paying additional code flow rate cost, its acoustical quality is good, and coding flow is low. As a example by one section of typical Chinese speech fragment, its efficient voice section continues 10s, has 80 sound mothers (phoneme), with each Phoneme have 5 fundamental frequency states, 5 frequency spectrum states, 1 time long status meter, every state uses 1 byte code (8bit), its Rate bit stream is m:m=[80* (5+5+1)] * 8bit/10s=704b/s, less than 1kb/s, belongs to ELF magnetic field coded method, code Flow rate is significantly less than every coding standard in current main-stream speech communication field, and the flow of network communication will be substantially reduced.Phase The communications field voice coding method of relatively current main-stream, the voice coding modes of the inventive method can process ultra broadband voice (16kHz sample rate), tonequality is higher;And there is lower rate bit stream (below 1kb/s), effectively reduce network traffic.
Embodiment of the present invention speech signal transmission method, by the phonetic synthesis parameter model corresponding to continuous speech signal Extraction and signal syntheses, it is achieved that the huge compression of voice signal and minimizing of the loss of signal, i.e. efficiently reduce signal Distortion.
As in figure 2 it is shown, be a kind of flow process of the phonetic synthesis parameter model determining each synthesis unit in the embodiment of the present invention Figure, comprises the following steps:
Step 201, carries out voice snippet cutting according to content of text to continuous speech signal, obtains each synthesis unit corresponding Voice snippet.
Specifically, the acoustic model that described continuous speech signal is corresponding with the synthesis unit preset can be done pressure right Together, i.e. calculate the voice signal speech recognition decoder corresponding to described acoustic model sequence, thus it is corresponding to obtain each synthesis unit Sound bite.
It should be noted that described synthesis unit can select different size according to different application demands.General next Saying, if requiring higher to rate bit stream, then selecting bigger voice unit, such as syllable unit, phoneme unit etc.;If otherwise to sound Matter requires higher, then can select less voice unit, such as the state cell of model, feature stream unit etc..
Acoustic model based on HMM (Hidden Markov Model, hidden Markov model) is being used to arrange down, also Each state of HMM model can be chosen further as synthesis unit, and obtain corresponding voice snippet based on state layer.Subsequently Each state is determined, from the fundamental frequency binary decision tree of its correspondence and frequency spectrum binary decision tree, the base that each state is corresponding respectively Frequently model and spectral model.The phonetic synthesis parameter model that so can enable acquisition describes the spy of voice signal more meticulously Point.
Step 202, determines duration and the initial speech synthetic parameters model of the voice snippet that each synthesis unit is corresponding successively, Described initial speech synthetic parameters model includes: initial fundamental frequency model and initial spectrum model, and obtains corresponding described continuous language The fundamental frequency model sequence of tone signal and spectral model sequence.
Specifically, the fundamental frequency binary decision tree that the current synthesis unit investigated is corresponding is first obtained;To described synthesis unit Carrying out text resolution, it is thus achieved that the contextual information of described synthesis unit, such as, phoneme unit, tonality, part of speech, fascicule are inferior Contextual information;Then, in described fundamental frequency binary tree, carry out path decision according to described contextual information, obtain the leaf of correspondence Node, using fundamental frequency model corresponding for described leaf node as the fundamental frequency model of described synthesis unit.
The process carrying out path decision is as follows:
According to the contextual information of described synthesis unit, start successively to respectively from the root node of described fundamental frequency binary decision tree Node split problem is answered;A top-down coupling path is obtained according to answering result;According to described coupling path Obtain leaf node.
It is also possible to obtain leaf node by inquiry in the frequency spectrum binary decision tree that the current synthesis unit investigated is corresponding Corresponding spectral model, using spectral model corresponding for leaf node as the initial spectrum model of the current synthesis unit investigated.Tool Body ground, first obtains the frequency spectrum binary decision tree that described synthesis unit is corresponding;Described synthesis unit is carried out text resolution, it is thus achieved that The contextual information of described synthesis unit.Then according to described contextual information, in described frequency spectrum binary decision tree, road is carried out Footpath decision-making, obtains the leaf node of correspondence;Using spectral model corresponding for described leaf node as corresponding initial of described synthesis unit Spectral model.
The process carrying out path decision is as follows:
According to the contextual information of described synthesis unit, start successively to respectively from the root node of described frequency spectrum binary decision tree Node split problem replies, and obtains a coupling path from top to down according to answering result, and decision-making obtains leaf segment Point.
It should be noted that the fundamental frequency model sequence of corresponding described continuous speech signal i.e. by each synthesis unit corresponding at the beginning of The sequence of primordium frequency model composition, similarly, the spectral model sequence of corresponding described continuous speech signal is i.e. by each synthesis unit The sequence of corresponding initial spectrum model composition.
Step 203, utilizes described continuous speech signal and corresponding initial of the described each synthesis unit of fundamental frequency model sequence pair Fundamental frequency model carries out combined optimization, obtains the fundamental frequency model of each synthesis unit.
Step 204, utilizes described continuous speech signal and corresponding initial of the described each synthesis unit of spectral model sequence pair Spectral model carries out combined optimization, obtains the spectral model of each synthesis unit.
In embodiments of the present invention, the quality of the initial speech synthetic parameters model that synthesis unit is corresponding and binary decision tree The structure of (including fundamental frequency binary decision tree and frequency spectrum binary decision tree) has direct relation.In embodiments of the present invention, adopt Binary decision tree is built with clustering method from below to up.
As it is shown on figure 3, be the structure flow chart of binary decision tree in the embodiment of the present invention, comprise the following steps:
Step 301, obtains training data.
Specifically, substantial amounts of voice training data can be gathered and it is carried out text marking, then according to the literary composition of mark This content carries out the voice snippet of basic voice unit or even synthesis unit (such as the state cell of basic speech unit models) and cuts Point, obtain the voice snippet set that each synthesis unit is corresponding, and by the language in voice snippet set corresponding for each synthesis unit The disconnected training data corresponding as this synthesis unit of tablet.
Step 302, extracts the synthetic parameters of voice snippet set corresponding to synthesis unit from described training data.
Described synthetic parameters includes: fundamental frequency feature and spectrum signature etc..
Step 303, the binary decision tree that described synthesis unit is corresponding is initialized by the synthetic parameters according to extracting, and Root node is set as currently investigating node.
Described binary decision tree is initialized the binary decision tree i.e. building only root node.
Step 304, it is judged that current node of investigating is the need of division.If it is, perform step 305;Otherwise perform step Rapid 306.
The residue problem selected in default problem set carries out division trial to the current data investigating node, obtains son joint Point.Described residue problem refers to the problem do not inquired.
Specifically, can first calculate the current sample concentration class investigating node, i.e. describe sample in voice snippet set Degree of scatter.In general, degree of scatter is the biggest, then illustrate that the probability of this node split is the biggest, the possibility otherwise divided Property is the least.Sample variance specifically can be used to weigh the sample concentration class of node, i.e. calculate all sample distances under this node The average of the distance at class center (or square).Then calculate the sample concentration class of child node after dividing, and select that there is full-page proof The problem of this concentration class fall is as optimal selection problem.
Then carry out division according to described optimal selection problem to attempt, obtain child node.If divided according to described optimal selection problem Concentration class be dropped by less than the threshold value that sets, or in the child node after division, training data is less than the thresholding set, the most really Investigate node before settled and do not continue to division.
Step 305, divides current node of investigating, and obtains the child node after division and described child node is corresponding Training data.Then, step 307 is performed.
Specifically, according to described optimal selection problem, current node of investigating can be divided.
Step 306, is leaf node by currently investigating vertex ticks.
Step 307, it is judged that whether also have the nonleaf node do not investigated in described binary decision tree.If it is, perform Step 308;Otherwise perform step 309.
Step 308, obtains the next nonleaf node do not investigated as currently investigating node.Then, step is returned 304。
Step 309, exports binary decision tree.
It should be noted that in embodiments of the present invention, fundamental frequency binary decision tree and frequency spectrum binary decision tree can be by Set up according to flow process shown in Fig. 3.
As shown in Figure 4, it is the schematic diagram of a kind of binary decision tree in the embodiment of the present invention.
Fig. 4 illustrates phoneme " *-aa+ " the structure figure of binary decision tree of the 3rd state.As shown in Figure 4, save at root According to can be by training data corresponding for root node to the answer of default problem " whether right adjacent phoneme is rhinophonia " during dot splitting Split, subsequently when next node layer divides, during as left sibling is divided, according to default problem, " whether left adjacent phoneme is Voiced consonant " answer training data corresponding for described node can be split further.Finally cannot split further at node Time set its as leaf node, and utilize it corresponding training data training obtains mathematical statistical model, such as Gauss model, should Statistics model is as synthetic parameters model corresponding to current leaf node.
Obviously, in the embodiment depicted in figure 2, the selecting to depend on and divide based on text of initial speech synthetic parameters model The binary decision tree of analysis, as passed through the phoneme class of synthesis unit context of current investigation, the pronunciation type etc. of current phoneme, So can obtain initial speech synthetic parameters model conveniently and efficiently.
Further, based on actual speech signal and the principle of encoding model synthetic speech signal minimization of loss, sending out In bright embodiment, in addition it is also necessary to initial fundamental frequency model and initial spectrum model are carried out combined optimization, respectively below to combined optimization Process elaborates.
As it is shown in figure 5, be the flow chart that in the embodiment of the present invention, initial fundamental frequency model is carried out combined optimization, including following Step:
Step 501, extracts the original fundamental frequency characteristic sequence that continuous speech signal is corresponding.
Step 502, obtains first synthesis unit synthesis unit for current optimization.
Step 503, obtains initial fundamental frequency model corresponding to the current synthesis unit optimized and relevant fundamental frequency model set, institute State all or part of leaf segment of fundamental frequency binary decision tree corresponding to synthesis unit that relevant fundamental frequency model set includes currently optimizing Point.
Step 504, according to described original fundamental frequency characteristic sequence select from described relevant fundamental frequency model set described initially The optimization model of fundamental frequency model.
It is to say, according to described original fundamental frequency characteristic sequence and described relevant fundamental frequency model set to described initial fundamental frequency Model carries out combined optimization.
Specifically, the fundamental frequency model in described relevant fundamental frequency model set can be selected successively to replace described fundamental frequency model sequence Initial fundamental frequency model corresponding in row, obtains new fundamental frequency model sequence;Determine then according to described new fundamental frequency model sequence The new fundamental frequency characteristic sequence of synthesis.Then calculate described new fundamental frequency characteristic sequence and described original fundamental frequency characteristic sequence away from From;Select fundamental frequency model corresponding to minimum range as the optimization model of described initial fundamental frequency model.
When determining the new fundamental frequency characteristic sequence of synthesis according to described new fundamental frequency model sequence, it can be specifically basis The described new fundamental frequency model sequence duration sequence corresponding with synthesis unit determines fundamental frequency model parameter, generates the new base of synthesis Frequently characteristic sequence.
Such as, according to below equation obtain synthesis new fundamental frequency characteristic sequence:
Omax=arg maxP (O |, λ, T)
Wherein, O is characterized sequence, and λ is given fundamental frequency model sequence, and T is the duration sequence that each synthesis unit is corresponding.
OmaxThe fundamental frequency characteristic sequence i.e. ultimately generated, in the range of unit duration sequence T, asks for corresponding to given The fundamental frequency characteristic sequence O with maximum likelihood value of fundamental frequency model sequence λmax
When calculating the distance of new fundamental frequency characteristic sequence and described original fundamental frequency characteristic sequence, Euclidean distance can be used Computational methods, it may be assumed that
D ( O , C ) = Σ i = 1 N ( O i - C i ) T ( O i - C i ) ;
Wherein, Oi,CiIt is i-th original fundamental frequency characteristic vector and the new fundamental frequency characteristic vector of i-th respectively.
Step 505, using described optimization model as the fundamental frequency model of the current synthesis unit optimized, and by described preferred mould Initial fundamental frequency model corresponding in described fundamental frequency model sequence replaced by type.
Step 506, it may be judged whether the synthesis unit being also not optimised.If it is, perform step 507;Otherwise, step is performed Rapid 508.
Step 507, obtains next synthesis unit as the current synthesis unit optimized.Then, step 503 is returned.
Step 508, exports the fundamental frequency model of each synthesis unit.
It is previously noted that described relevant fundamental frequency model set can be the fundamental frequency binary decision tree that described synthesis unit is corresponding All leaf nodes, it is contemplated that fundamental frequency binary decision tree leaf node number is the most more, calculating is compared consuming substantial amounts of one by one Calculation resources, is unfavorable for encoding the requirement of real-time.Accordingly it is also possible to selection part has relatively big preferred from all leaf nodes Possible leaf node, as described relevant fundamental frequency model set, participates in the optimization of follow-up fundamental frequency model.Detailed process can be such that
(1) original fundamental frequency characteristic sequence corresponding to described synthesis unit and all leaves of fundamental frequency binary decision tree are first calculated Likelihood score between the fundamental frequency model of node.
If original fundamental frequency characteristic sequence is(N is the frame number of voice signal), the current fundamental frequency investigated Model is λjj, j=1 ... J, J are whole model set size), then the likelihood score between both is:
Wherein,D is The dimension of characteristic sequence o.miBeing the average of described fundamental frequency model, corresponding ∑ is variance.
In this embodiment, select Gauss model as synthetic parameters model, so by average and variance the two parameter Determine model.
(2) fundamental frequency model with maximum possible is selected to form described relevant fundamental frequency model set.
Specifically, it may be preferred to there is M fundamental frequency model of maximum likelihood degree as preliminary election model, it is also possible to select all Likelihood score is more than the fundamental frequency model of predetermined threshold value as primary election model.Parameter M (usually positive integer, such as 50) and threshold value ( Log likelihood score arranges down usually negative, such as-200) preset by system, typically can control according to preliminary election model quantity M System.
It should be noted that foregoing synthesis unit can select different size according to different application demands. If in general requiring higher to rate bit stream, then select bigger synthesis unit, such as syllable unit, phoneme unit etc.;Otherwise If requiring higher to tonequality, then can select less synthesis unit, such as the state cell of model, feature stream unit etc..Adopting Arrange down with acoustic model based on HMM, also can choose each state of HMM model further as basic synthesis unit, and Obtain corresponding voice snippet based on state layer.So, the synthetic parameters model obtained can describe voice letter more meticulously Number feature, improve further transmitting voice signal quality.
It addition, in embodiments of the present invention, initial spectrum model is carried out the process of combined optimization with above-mentioned to first primordium Frequently model carry out combined optimization process be similar to, be not described in detail at this.
Visible, the speech signal transmission method of the embodiment of the present invention, before ensureing that voice recovers tonequality minimization of loss Put and significantly reduce transmission code flow rate, decrease flow consumption, solve traditional voice coded method and can not take into account tonequality With the problem of flow, the telex network demand under mobile network's epoch that improves is experienced.
Correspondingly, the embodiment of the present invention also provides for a kind of transmitting voice signal system, as shown in Figure 6, is the knot of this system Structure schematic diagram.
In this embodiment, described system includes:
Text acquisition module 601, for determining the content of text that continuous speech signal to be sent is corresponding;
Parameter model determines module 602, for determining each synthesis according to described content of text and described continuous speech signal The phonetic synthesis parameter model of unit;
Concatenation module 603, obtains phonetic synthesis parameter model for splicing the phonetic synthesis parameter model of each synthesis unit Sequence;
Sequence number string determines module 604, for determining the sequence number string that described phonetic synthesis parameter model sequence pair is answered;
Sending module 605, for being sent to receiving terminal by described sequence number string, so that described receiving terminal is according to described sequence number string Recover described continuous speech signal.
In actual applications, above-mentioned text acquisition module 601 can obtain described text automatically by speech recognition algorithm Content, naturally it is also possible to obtain described content of text by the way of artificial mark.To this end, can be at text acquisition module Voice recognition unit and/or markup information acquiring unit are set in 601, in order to user can be made to select different modes to be treated Content of text corresponding to continuous speech signal sent.Wherein, described voice recognition unit, for true by speech recognition algorithm The content of text that fixed continuous speech signal to be sent is corresponding;Described markup information acquiring unit is for the side by artificial mark Formula obtains the content of text that continuous speech signal to be sent is corresponding.
Described synthesis unit is minimum synthetic object set in advance, such as syllable unit, phoneme unit, even phoneme State cell etc. in HMM model.
Recover the loss of tonequality in order to reduce receiving terminal as far as possible, enable receiving terminal to recover continuous by phonetic synthesis mode Voice signal, parameter model determines that the phonetic synthesis parameter model that module 602 obtains should meet primitive tone signal feature as far as possible, To reduce Signal Compression and the loss of recovery.Specifically, according to described content of text, continuous speech signal can be carried out voice Segment cutting, obtains the voice snippet that each synthesis unit is corresponding, so obtain the duration of each synthesis unit, initial fundamental frequency model and Initial spectrum model, then utilizes the voice signal of collection to carry out combined optimization to initializing phonetic synthesis parameter model, obtains The fundamental frequency model of each synthesis unit and spectral model.
Correspondingly, recipient can obtain phonetic synthesis parameter model sequence according to described sequence number string from code book.Due to Each phonetic synthesis parameter model has a unique sequence number, and, all preserve identical code sender and recipient This, contain all phonetic synthesis parameter models in described code book.Therefore, after recipient receives described sequence number string, according to institute State sequence number string and can obtain the phonetic synthesis parameter model of corresponding each sequence number from code book, splice these phonetic synthesis parameter models Obtain described phonetic synthesis parameter model sequence.Then, determine that phonetic synthesis is joined according to described phonetic synthesis parameter model sequence Number Sequence, recovers voice signal by phonetic synthesis mode.
Visible, the transmitting voice signal system of the embodiment of the present invention, before ensureing that voice recovers tonequality minimization of loss Put and significantly reduce transmission code flow rate, decrease flow consumption, solve traditional voice coded method and can not take into account tonequality With the problem of flow, the telex network demand under mobile network's epoch that improves is experienced.
As it is shown in fig. 7, be a kind of structural representation that in the embodiment of the present invention, parameter model determines module.
Described parameter model determines that module includes:
Cutting unit 701, for described continuous speech signal being carried out voice snippet cutting according to described content of text, To the voice snippet that each synthesis unit is corresponding.
Specifically, the acoustic model sequence that continuous speech signal is corresponding with synthesis unit in described content of text can be done Force alignment, i.e. calculate the voice signal speech recognition decoder corresponding to described acoustic model sequence, thus it is single to obtain each synthesis The sound bite that unit is corresponding.
It should be noted that described synthesis unit can select different size according to different application demands.General next Saying, if requiring higher to rate bit stream, then selecting bigger voice unit, such as syllable unit, phoneme unit etc.;If otherwise to sound Matter requires higher, then can select less voice unit, such as the state cell of model, feature stream unit etc..Use based on The acoustic model of HMM (Hidden Markov Model, hidden Markov model) is arranged down, also can choose HMM model further Each state as synthesis unit, and obtain corresponding voice snippet based on state layer.Subsequently to each state respectively from it Corresponding fundamental frequency binary decision tree and frequency spectrum binary decision tree determine fundamental frequency model corresponding to each state and spectral model.This Sample can enable the phonetic synthesis parameter model of acquisition describe the feature of voice signal more meticulously.
Duration determines unit 702, is used for the duration of the voice snippet determining that each synthesis unit is corresponding successively.
Model determines unit 703, for determining the initial speech synthetic parameters model that each synthesis unit is corresponding successively, described Initial speech synthetic parameters model includes: initial fundamental frequency model and initial spectrum mould.
Model sequence acquiring unit 704, for obtaining fundamental frequency model sequence and the frequency spectrum of corresponding described continuous speech signal Model sequence.
First optimizes unit 705, is used for utilizing described continuous speech signal and described fundamental frequency model sequence pair respectively to synthesize list The initial fundamental frequency model of unit's correspondence carries out combined optimization, obtains the fundamental frequency model of each synthesis unit.
Second optimizes unit 706, is used for utilizing described continuous speech signal and described spectral model sequence pair respectively to synthesize list The initial spectrum model of unit's correspondence carries out combined optimization, obtains the spectral model of each synthesis unit.
It should be noted that in embodiments of the present invention, model determines that unit 703 can determine respectively based on binary decision tree The initial speech synthetic parameters model that synthesis unit is corresponding.
To this end, in embodiments of the present invention, also can farther include: binary decision tree builds module.
As shown in Figure 8, it is that in the embodiment of the present invention, in voice signal transmission system, binary decision tree builds the structure of module Schematic diagram.
Described binary decision tree builds module and includes:
Training data acquiring unit 801, is used for obtaining training data;
Parameter extraction unit 802, for extracting the voice snippet collection that described synthesis unit is corresponding from described training data The synthetic parameters closed, described synthetic parameters includes: fundamental frequency feature and spectrum signature;
Initialization unit 803, for carrying out the binary decision tree that described synthesis unit is corresponding according to described synthetic parameters Initialize, i.e. build the binary decision tree of only root node;
Node reviews unit 804, for from the beginning of the root node of described binary decision tree, investigates each non-leaf segment successively Point;If currently investigating node to need division, then current node of investigating is divided, and obtain the child node after division and institute State the training data that child node is corresponding;Otherwise, it is leaf node by currently investigating vertex ticks;
Binary decision tree output unit 805, is used for after all nonleaf nodes have been investigated by described node reviews unit, Export the binary decision tree of described synthesis unit.
In this embodiment, training data acquiring unit 801 specifically can gather substantial amounts of voice training data and to it Carry out text marking, then carry out basic voice unit or even synthesis unit (such as basic voice list according to the content of text of mark The state cell of meta-model) voice snippet cutting, obtain the voice snippet set that each synthesis unit is corresponding, and by each synthesis Voice snippet in the voice snippet set that unit is corresponding is as training data corresponding to this synthesis unit.
Above-mentioned node reviews unit 804, when judging current investigation node the need of division, can be investigated according to current The sample concentration class of node, selects the problem with maximum sample concentration class fall to carry out division as optimal selection problem and tastes Examination, obtains child node.If the concentration class divided according to described optimal selection problem is dropped by less than the threshold value set, or after division In child node, training data is less than the thresholding set, it is determined that current node of investigating does not continues to division.
Above-mentioned investigation and fission process can refer to the description in above embodiment of the present invention speech signal transmission method, at this Repeat no more.
It should be noted that in embodiments of the present invention, fundamental frequency binary decision tree and frequency spectrum binary decision tree can be by This binary decision tree builds module and sets up, and it is similar that it realizes process, describes in detail the most one by one at this.
In embodiments of the present invention, model determines that unit 703 may include that initial fundamental frequency model determines unit and initial frequency Spectrum model determines unit (not shown).
Described initial fundamental frequency model determines that unit includes:
First acquiring unit, for obtaining the fundamental frequency binary decision tree that described synthesis unit is corresponding;
First resolution unit, for carrying out text resolution to described synthesis unit, it is thus achieved that the context of described synthesis unit Information, such as, the inferior contextual information of phoneme unit, tonality, part of speech, fascicule;
First decision package, for according to described contextual information, carries out path certainly in described fundamental frequency binary decision tree Plan, obtains the leaf node of correspondence;The process carrying out path decision is as follows:
According to the contextual information of described synthesis unit, start successively to respectively from the root node of described fundamental frequency binary decision tree Node split problem is answered;A top-down coupling path is obtained according to answering result;According to described coupling path Obtain leaf node;
First output unit, for using fundamental frequency model corresponding for described leaf node as corresponding initial of described synthesis unit Fundamental frequency model.
Determining that unit is similar with initial fundamental frequency model, initial spectrum model determines that unit includes:
Second acquisition unit, for obtaining the frequency spectrum binary decision tree that described synthesis unit is corresponding;
Second resolution unit, for carrying out text resolution to described synthesis unit, it is thus achieved that the context of described synthesis unit Information;
Second decision package, for according to described contextual information, carries out path certainly in described frequency spectrum binary decision tree Plan, obtains the leaf node of correspondence;The process carrying out path decision is as follows:
According to the contextual information of described synthesis unit, start successively to respectively from the root node of described frequency spectrum binary decision tree Node split problem replies, and obtains a coupling path from top to down according to answering result, and decision-making obtains leaf segment Point;
Second output unit, for using spectral model corresponding for described leaf node as corresponding initial of described synthesis unit Spectral model.
It should be noted that above-mentioned initial fundamental frequency model determine unit and initial spectrum model determine unit can respectively by The most independent physical location realizes, it is also possible to unification is realized by a physical location, does not does this embodiment of the present invention Limit.
In embodiments of the present invention, above-mentioned first optimizes unit 705 and the second optimization unit 706 based on actual speech signal With the principle of encoding model synthetic speech signal minimization of loss, initial fundamental frequency model and initial spectrum model are joined respectively Close and optimize.
As it is shown in figure 9, be the first structural representation optimizing unit in the embodiment of the present invention.
Described first optimizes unit includes:
Fundamental frequency characteristic sequence extraction unit 901, for extracting the original fundamental frequency feature sequence that described continuous speech signal is corresponding Row;
First acquiring unit 902, for obtaining initial fundamental frequency model corresponding to each synthesis unit and relevant fundamental frequency mould successively Type set, described relevant fundamental frequency model set includes all or part of leaf of the fundamental frequency binary decision tree that described synthesis unit is corresponding Node;
First selects unit 903, is used for according to described original fundamental frequency characteristic sequence from described relevant fundamental frequency model set Select the optimization model of described initial fundamental frequency model;It is to say, according to described original fundamental frequency characteristic sequence and described dependency basis Frequently model set carries out combined optimization to described initial fundamental frequency model;
First replacement unit 904, for using described optimization model as the fundamental frequency model of described synthesis unit, and by described Optimization model replaces initial fundamental frequency model corresponding in described fundamental frequency model sequence.
Wherein, described first unit 903 is selected to include:
Fundamental frequency model sequence updating block, for selecting the fundamental frequency model in described relevant fundamental frequency model set to replace successively Initial fundamental frequency model corresponding in described fundamental frequency model sequence, obtains new fundamental frequency model sequence;And according to described new fundamental frequency Model sequence determines the new fundamental frequency characteristic sequence of synthesis;
First computing unit, for calculate described new fundamental frequency characteristic sequence and described original fundamental frequency characteristic sequence away from From;
Fundamental frequency model selects unit, and the fundamental frequency model being used for selecting minimum range corresponding is as described initial fundamental frequency model Optimization model.
As shown in Figure 10, it is the second structural representation optimizing unit in the embodiment of the present invention.
Described second optimizes unit includes:
Spectrum signature sequence extraction unit 1001, for extracting the original signal spectrum feature sequence that described continuous speech signal is corresponding Row;
Second acquisition unit 1002, for obtaining initial spectrum model corresponding to each synthesis unit and relevant frequency spectrum mould successively Type set, described relevant frequency spectrum model set includes all or part of leaf of the frequency spectrum binary decision tree that described synthesis unit is corresponding Node;
Second selects unit 1003, is used for according to described original signal spectrum characteristic sequence from described relevant frequency spectrum model set Select the optimization model of described initial spectrum model;
Second replacement unit 1004, for using described optimization model as the spectral model of described synthesis unit, and by institute State optimization model and replace initial spectrum model corresponding in described spectral model sequence.
Wherein, second unit 1003 is selected to include:
Spectral model sequence updating block, for selecting the spectral model in described relevant frequency spectrum model set to replace successively Initial spectrum model corresponding in described spectral model sequence, obtains new spectral model sequence;And according to described new frequency spectrum Model sequence determines the new spectrum signature sequence of synthesis;
Second computing unit, for calculate described new spectrum signature sequence and described original signal spectrum characteristic sequence away from From;
Spectral model selects unit, and the spectral model being used for selecting minimum range corresponding is as described initial spectrum model Optimization model.
It should be noted that above-mentioned first optimization unit 705 and the second optimization unit 706 can be respectively by the most independent Physical location realizes, it is also possible to unification is realized by a physical location, does not limits this embodiment of the present invention.
It should be noted that the synthesis unit described in the embodiment of the present invention can select according to different application demands Different size.If in general requiring higher to rate bit stream, then select bigger synthesis unit, such as syllable unit, phoneme list Unit etc.;If otherwise tonequality is required higher, then can select less synthesis unit, such as the state cell of model, feature stream list Unit etc..Under using acoustic model based on HMM to arrange, also can choose each state of HMM model further as basic conjunction Become unit, and obtain corresponding voice snippet based on state layer.So, the synthetic parameters model obtained can be retouched more meticulously The feature of predicate tone signal, improves transmitting voice signal quality further.
Visible, the transmitting voice signal system of the embodiment of the present invention, before ensureing that voice recovers tonequality minimization of loss Put and significantly reduce transmission code flow rate, decrease flow consumption, solve traditional voice coded method and can not take into account tonequality With the problem of flow, the telex network demand under mobile network's epoch that improves is experienced.
Each embodiment in this specification all uses the mode gone forward one by one to describe, identical similar portion between each embodiment Dividing and see mutually, what each embodiment stressed is the difference with other embodiments.Real especially for system For executing example, owing to it is substantially similar to embodiment of the method, so describing fairly simple, relevant part sees embodiment of the method Part illustrate.System embodiment described above is only schematically, wherein said illustrates as separating component Unit can be or may not be physically separate, the parts shown as unit can be or may not be Physical location, i.e. may be located at a place, or can also be distributed on multiple NE.Can be according to the actual needs Select some or all of module therein to realize the purpose of the present embodiment scheme.Those of ordinary skill in the art are not paying In the case of creative work, i.e. it is appreciated that and implements.
Being described in detail the embodiment of the present invention above, the present invention is carried out by detailed description of the invention used herein Illustrating, the explanation of above example is only intended to help to understand the method and apparatus of the present invention;Simultaneously for this area one As technical staff, according to the thought of the present invention, the most all will change, to sum up institute Stating, this specification content should not be construed as limitation of the present invention.

Claims (19)

1. a speech signal transmission method, it is characterised in that including:
Determine the content of text that continuous speech signal to be sent is corresponding;
The phonetic synthesis parameter model of each synthesis unit is determined according to described content of text and described continuous speech signal;
The phonetic synthesis parameter model splicing each synthesis unit obtains phonetic synthesis parameter model sequence;
Determine the sequence number string that described phonetic synthesis parameter model sequence pair is answered;
Described sequence number string is sent to receiving terminal, so that described receiving terminal recovers described continuous speech letter according to described sequence number string Number.
Method the most according to claim 1, it is characterised in that the described literary composition determining that continuous speech signal to be sent is corresponding This content includes:
The content of text that continuous speech signal to be sent is corresponding is determined by speech recognition algorithm;Or
Obtain, by the way of artificial mark, the content of text that continuous speech signal to be sent is corresponding.
Method the most according to claim 1, it is characterised in that described believe according to described content of text and described continuous speech Number determine that the phonetic synthesis parameter model of each synthesis unit includes:
According to described content of text, described continuous speech signal is carried out voice snippet cutting, obtain the language that each synthesis unit is corresponding Tablet breaks;
Determine duration and initial speech synthetic parameters model, the described initial speech of the voice snippet that each synthesis unit is corresponding successively Synthetic parameters model includes: initial fundamental frequency model and initial spectrum model, and obtains the fundamental frequency of corresponding described continuous speech signal Model sequence and spectral model sequence;
Described continuous speech signal and initial fundamental frequency model corresponding to the described each synthesis unit of fundamental frequency model sequence pair is utilized to carry out Combined optimization, obtains the fundamental frequency model of each synthesis unit;
Described continuous speech signal and initial spectrum model corresponding to the described each synthesis unit of spectral model sequence pair is utilized to carry out Combined optimization, obtains the spectral model of each synthesis unit.
Method the most according to claim 3, it is characterised in that the described initial fundamental frequency model bag determining that synthesis unit is corresponding Include:
Obtain the fundamental frequency binary decision tree that described synthesis unit is corresponding;
Described synthesis unit is carried out text resolution, it is thus achieved that the contextual information of described synthesis unit;
According to described contextual information, in described fundamental frequency binary decision tree, carry out path decision, obtain the leaf node of correspondence;
Using fundamental frequency model corresponding for described leaf node as initial fundamental frequency model corresponding to described synthesis unit.
Method the most according to claim 3, it is characterised in that the described initial spectrum model bag determining that synthesis unit is corresponding Include:
Obtain the frequency spectrum binary decision tree that described synthesis unit is corresponding;
Described synthesis unit is carried out text resolution, it is thus achieved that the contextual information of described synthesis unit;
According to described contextual information, in described frequency spectrum binary decision tree, carry out path decision, obtain the leaf node of correspondence;
Using spectral model corresponding for described leaf node as initial spectrum model corresponding to described synthesis unit.
6. according to the method described in claim 4 or 5, it is characterised in that described method also includes: build described in the following manner The binary decision tree that synthesis unit is corresponding:
Obtain training data;
The synthetic parameters of voice snippet set corresponding to described synthesis unit, described synthetic parameters is extracted from described training data Including: fundamental frequency feature or spectrum signature;
According to described synthetic parameters, the binary decision tree that described synthesis unit is corresponding is initialized;
From the beginning of the root node of described binary decision tree, investigate each nonleaf node successively;
If currently investigating node to need division, then current node of investigating is divided, and obtain the child node after division and The training data that described child node is corresponding;Otherwise, it is leaf node by currently investigating vertex ticks;
After all nonleaf nodes have been investigated, obtain the binary decision tree of described synthesis unit.
Method the most according to claim 3, it is characterised in that described utilize described continuous speech signal and described fundamental frequency mould The initial fundamental frequency model that each synthesis unit of type sequence pair is corresponding carries out combined optimization, obtains the fundamental frequency model bag of each synthesis unit Include:
Extract the original fundamental frequency characteristic sequence that described continuous speech signal is corresponding;
Successively each synthesis unit is carried out following process:
Obtain initial fundamental frequency model corresponding to described synthesis unit and relevant fundamental frequency model set, described relevant fundamental frequency model set All or part of leaf node including fundamental frequency binary decision tree corresponding to described synthesis unit;
From described relevant fundamental frequency model set, the excellent of described initial fundamental frequency model is selected according to described original fundamental frequency characteristic sequence Modeling type;
Using described optimization model as the fundamental frequency model of described synthesis unit, and described optimization model is replaced described fundamental frequency model Initial fundamental frequency model corresponding in sequence.
Method the most according to claim 7, it is characterised in that described according to described original fundamental frequency characteristic sequence from described phase Closing in fundamental frequency model set selects the optimization model of described initial fundamental frequency model to include:
Fundamental frequency model in described relevant fundamental frequency model set is selected to replace in described fundamental frequency model sequence successively corresponding initial Fundamental frequency model, obtains new fundamental frequency model sequence;
The new fundamental frequency characteristic sequence of synthesis is determined according to described new fundamental frequency model sequence;
Calculate the distance of described new fundamental frequency characteristic sequence and described original fundamental frequency characteristic sequence;
Select fundamental frequency model corresponding to minimum range as the optimization model of described initial fundamental frequency model.
Method the most according to claim 3, it is characterised in that described utilize described continuous speech signal and described frequency spectrum mould The initial spectrum model that each synthesis unit of type sequence pair is corresponding carries out combined optimization, obtains the spectral model bag of each synthesis unit Include:
Extract the original signal spectrum characteristic sequence that described continuous speech signal is corresponding;
Successively each synthesis unit is carried out following process:
Obtain initial spectrum model corresponding to described synthesis unit and relevant frequency spectrum model set, described relevant frequency spectrum model set All or part of leaf node including frequency spectrum binary decision tree corresponding to described synthesis unit;
From described relevant frequency spectrum model set, the excellent of described initial spectrum model is selected according to described original signal spectrum characteristic sequence Modeling type;
Using described optimization model as the spectral model of described synthesis unit, and described optimization model is replaced described spectral model Initial spectrum model corresponding in sequence.
Method the most according to claim 9, it is characterised in that described according to described original signal spectrum characteristic sequence from described The optimization model selecting described initial spectrum model in relevant frequency spectrum model set includes:
Spectral model in described relevant frequency spectrum model set is selected to replace in described spectral model sequence successively corresponding initial Spectral model, obtains new spectral model sequence;
The new spectrum signature sequence of synthesis is determined according to described new spectral model sequence;
Calculate the distance of described new spectrum signature sequence and described original signal spectrum characteristic sequence;
Select spectral model corresponding to minimum range as the optimization model of described initial spectrum model.
11. 1 kinds of transmitting voice signal systems, it is characterised in that including:
Text acquisition module, for determining the content of text that continuous speech signal to be sent is corresponding;
Parameter model determines module, for determining the language of each synthesis unit according to described content of text and described continuous speech signal Sound synthetic parameters model;
Concatenation module, obtains phonetic synthesis parameter model sequence for splicing the phonetic synthesis parameter model of each synthesis unit;
Sequence number string determines module, for determining the sequence number string that described phonetic synthesis parameter model sequence pair is answered;
Sending module, for being sent to receiving terminal by described sequence number string, so that described receiving terminal recovers institute according to described sequence number string State continuous speech signal.
12. systems according to claim 11, it is characterised in that described text acquisition module includes:
Voice recognition unit, for determining, by speech recognition algorithm, the content of text that continuous speech signal to be sent is corresponding; Or
Markup information acquiring unit, for obtaining, by the way of artificial mark, the text that continuous speech signal to be sent is corresponding Content.
13. systems according to claim 11, it is characterised in that described parameter model determines that module includes:
Cutting unit, for described continuous speech signal being carried out voice snippet cutting according to described content of text, obtains each conjunction Become the voice snippet that unit is corresponding;
Duration determines unit, is used for the duration of the voice snippet determining that each synthesis unit is corresponding successively;
Model determines unit, for determining the initial speech synthetic parameters model that each synthesis unit is corresponding, described initial language successively Sound synthetic parameters model includes: initial fundamental frequency model and initial spectrum model;
Model sequence acquiring unit, for obtaining fundamental frequency model sequence and the spectral model sequence of corresponding described continuous speech signal Row;
First optimizes unit, for utilizing described continuous speech signal and the described each synthesis unit of fundamental frequency model sequence pair corresponding Initial fundamental frequency model carries out combined optimization, obtains the fundamental frequency model of each synthesis unit;
Second optimizes unit, for utilizing described continuous speech signal and the described each synthesis unit of spectral model sequence pair corresponding Initial spectrum model carries out combined optimization, obtains the spectral model of each synthesis unit.
14. systems according to claim 13, it is characterised in that described model determines that unit includes: initial fundamental frequency model Determine that unit and initial spectrum model determine unit;
Described initial fundamental frequency model determines that unit includes:
First acquiring unit, for obtaining the fundamental frequency binary decision tree that described synthesis unit is corresponding;
First resolution unit, for carrying out text resolution to described synthesis unit, it is thus achieved that the contextual information of described synthesis unit;
First decision package, for according to described contextual information, carries out path decision in described fundamental frequency binary decision tree, To corresponding leaf node;
First output unit, for using fundamental frequency model corresponding for described leaf node as initial fundamental frequency corresponding to described synthesis unit Model;
Described initial spectrum model determines that unit includes:
Second acquisition unit, for obtaining the frequency spectrum binary decision tree that described synthesis unit is corresponding;
Second resolution unit, for carrying out text resolution to described synthesis unit, it is thus achieved that the contextual information of described synthesis unit;
Second decision package, for according to described contextual information, carries out path decision in described frequency spectrum binary decision tree, To corresponding leaf node;
Second output unit, for using spectral model corresponding for described leaf node as initial spectrum corresponding to described synthesis unit Model.
15. systems according to claim 14, it is characterised in that described system also includes: binary decision tree builds module, Described binary decision tree builds module and includes:
Training data acquiring unit, is used for obtaining training data;
Parameter extraction unit, for extracting the synthesis of voice snippet set corresponding to described synthesis unit from described training data Parameter, described synthetic parameters includes: fundamental frequency feature or spectrum signature;
Initialization unit, for initializing the binary decision tree that described synthesis unit is corresponding according to described synthetic parameters;
Node reviews unit, for from the beginning of the root node of described binary decision tree, investigates each nonleaf node successively;If worked as Front investigation node needs division, then divide current node of investigating, and obtain the child node after division and described child node Corresponding training data;Otherwise, it is leaf node by currently investigating vertex ticks;
Binary decision tree output unit, for after all nonleaf nodes have been investigated, obtains the y-bend of described synthesis unit certainly Plan tree.
16. systems according to claim 13, it is characterised in that described first optimizes unit includes:
Fundamental frequency characteristic sequence extraction unit, for extracting the original fundamental frequency characteristic sequence that described continuous speech signal is corresponding;
First acquiring unit, for obtaining initial fundamental frequency model corresponding to each synthesis unit and relevant fundamental frequency model set successively, Described relevant fundamental frequency model set includes all or part of leaf node of the fundamental frequency binary decision tree that described synthesis unit is corresponding;
First selects unit, described for selecting from described relevant fundamental frequency model set according to described original fundamental frequency characteristic sequence The optimization model of initial fundamental frequency model;
First replacement unit, for using described optimization model as the fundamental frequency model of described synthesis unit, and by described preferred mould Initial fundamental frequency model corresponding in described fundamental frequency model sequence replaced by type.
17. systems according to claim 16, it is characterised in that described first selects unit to include:
Fundamental frequency model sequence updating block, described for selecting the fundamental frequency model in described relevant fundamental frequency model set to replace successively Initial fundamental frequency model corresponding in fundamental frequency model sequence, obtains new fundamental frequency model sequence;And according to described new fundamental frequency model Sequence determines the new fundamental frequency characteristic sequence of synthesis;
First computing unit, for calculating the distance of described new fundamental frequency characteristic sequence and described original fundamental frequency characteristic sequence;
Fundamental frequency model selects unit, for selecting preferred as described initial fundamental frequency model of fundamental frequency model corresponding to minimum range Model.
18. systems according to claim 13, it is characterised in that described second optimizes unit includes:
Spectrum signature sequence extraction unit, for extracting the original signal spectrum characteristic sequence that described continuous speech signal is corresponding;
Second acquisition unit, for obtaining initial spectrum model corresponding to each synthesis unit and relevant frequency spectrum model set successively, Described relevant frequency spectrum model set includes all or part of leaf node of the frequency spectrum binary decision tree that described synthesis unit is corresponding;
Second selects unit, described for selecting from described relevant frequency spectrum model set according to described original signal spectrum characteristic sequence The optimization model of initial spectrum model;
Second replacement unit, for using described optimization model as the spectral model of described synthesis unit, and by described preferred mould Initial spectrum model corresponding in described spectral model sequence replaced by type.
19. systems according to claim 18, it is characterised in that described second selects unit to include:
Spectral model sequence updating block, described for selecting the spectral model in described relevant frequency spectrum model set to replace successively Initial spectrum model corresponding in spectral model sequence, obtains new spectral model sequence;And according to described new spectral model Sequence determines the new spectrum signature sequence of synthesis;
Second computing unit, for calculating the distance of described new spectrum signature sequence and described original signal spectrum characteristic sequence;
Spectral model selects unit, for selecting preferred as described initial spectrum model of spectral model corresponding to minimum range Model.
CN201310361783.1A 2013-08-19 2013-08-19 speech signal transmission method and system Active CN103474067B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310361783.1A CN103474067B (en) 2013-08-19 2013-08-19 speech signal transmission method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310361783.1A CN103474067B (en) 2013-08-19 2013-08-19 speech signal transmission method and system

Publications (2)

Publication Number Publication Date
CN103474067A CN103474067A (en) 2013-12-25
CN103474067B true CN103474067B (en) 2016-08-24

Family

ID=49798888

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310361783.1A Active CN103474067B (en) 2013-08-19 2013-08-19 speech signal transmission method and system

Country Status (1)

Country Link
CN (1) CN103474067B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105304081A (en) * 2015-11-09 2016-02-03 上海语知义信息技术有限公司 Smart household voice broadcasting system and voice broadcasting method
CN106373581A (en) * 2016-09-28 2017-02-01 成都奥克特科技有限公司 Data encoding processing method for speech signals
CN108346423B (en) * 2017-01-23 2021-08-20 北京搜狗科技发展有限公司 Method and device for processing speech synthesis model
CN109064789A (en) * 2018-08-17 2018-12-21 重庆第二师范学院 A kind of adjoint cerebral palsy speaks with a lisp supplementary controlled system and method, assistor

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0360265A2 (en) * 1988-09-21 1990-03-28 Nec Corporation Communication system capable of improving a speech quality by classifying speech signals
CN1964244A (en) * 2005-11-08 2007-05-16 厦门致晟科技有限公司 A method to receive and transmit digital signal using vocoder
JP2008139631A (en) * 2006-12-04 2008-06-19 Nippon Telegr & Teleph Corp <Ntt> Voice synthesis method, device and program
CN102592594A (en) * 2012-04-06 2012-07-18 苏州思必驰信息科技有限公司 Incremental-type speech online synthesis method based on statistic parameter model
CN102664021A (en) * 2012-04-20 2012-09-12 河海大学常州校区 Low-rate speech coding method based on speech power spectrum
CN102867516A (en) * 2012-09-10 2013-01-09 大连理工大学 Speech coding and decoding method using high-order linear prediction coefficient grouping vector quantization

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0360265A2 (en) * 1988-09-21 1990-03-28 Nec Corporation Communication system capable of improving a speech quality by classifying speech signals
CN1964244A (en) * 2005-11-08 2007-05-16 厦门致晟科技有限公司 A method to receive and transmit digital signal using vocoder
JP2008139631A (en) * 2006-12-04 2008-06-19 Nippon Telegr & Teleph Corp <Ntt> Voice synthesis method, device and program
CN102592594A (en) * 2012-04-06 2012-07-18 苏州思必驰信息科技有限公司 Incremental-type speech online synthesis method based on statistic parameter model
CN102664021A (en) * 2012-04-20 2012-09-12 河海大学常州校区 Low-rate speech coding method based on speech power spectrum
CN102867516A (en) * 2012-09-10 2013-01-09 大连理工大学 Speech coding and decoding method using high-order linear prediction coefficient grouping vector quantization

Also Published As

Publication number Publication date
CN103474067A (en) 2013-12-25

Similar Documents

Publication Publication Date Title
CN101447185B (en) Audio frequency rapid classification method based on content
CN103700370B (en) A kind of radio and television speech recognition system method and system
CN103065620B (en) Method with which text input by user is received on mobile phone or webpage and synthetized to personalized voice in real time
CN108053823A (en) A kind of speech recognition system and method
CN103810998B (en) Based on the off-line audio recognition method of mobile terminal device and realize method
CN103761975B (en) Method and device for oral evaluation
CN101510424B (en) Method and system for encoding and synthesizing speech based on speech primitive
WO2021073116A1 (en) Method and apparatus for generating legal document, device and storage medium
CN113283244B (en) Pre-training model-based bidding data named entity identification method
CN103474075B (en) Voice signal sending method and system, method of reseptance and system
CN103474067B (en) speech signal transmission method and system
CN103117060A (en) Modeling approach and modeling system of acoustic model used in speech recognition
CN106409313A (en) Audio signal classification method and apparatus
CN105631468A (en) RNN-based automatic picture description generation method
CN102723078A (en) Emotion speech recognition method based on natural language comprehension
CN1924994B (en) Embedded language synthetic method and system
CN109754790A (en) A kind of speech recognition system and method based on mixing acoustic model
CN102810311A (en) Speaker estimation method and speaker estimation equipment
CN106356054A (en) Method and system for collecting information of agricultural products based on voice recognition
CN106710588A (en) Voice data sentence type identification method and device and system
CN112420079B (en) Voice endpoint detection method and device, storage medium and electronic equipment
CN117238321A (en) Speech comprehensive evaluation method, device, equipment and storage medium
CN103474062A (en) Voice identification method
CN108010533A (en) The automatic identifying method and device of voice data code check
CN111091809A (en) Regional accent recognition method and device based on depth feature fusion

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Wangjiang Road high tech Development Zone Hefei city Anhui province 230088 No. 666

Applicant after: Iflytek Co., Ltd.

Address before: Wangjiang Road high tech Development Zone Hefei city Anhui province 230088 Building No. 666 Xunfei

Applicant before: Anhui USTC iFLYTEK Co., Ltd.

COR Change of bibliographic data
C14 Grant of patent or utility model
GR01 Patent grant