CA2083469A1

CA2083469A1 - Voicing decision method and device for vocoder with very low bit rate

Info

Publication number: CA2083469A1
Application number: CA002083469A
Authority: CA
Inventors: Pierre-Andre Laurent
Original assignee: Individual
Current assignee: Thales SA
Priority date: 1991-11-22
Filing date: 1992-11-20
Publication date: 1993-05-23
Also published as: EP0543719A1; FR2684226B1; FR2684226A1

Abstract

ABSTRACT OF THE DISCLOSURE
The voicing decision method consists in:
considering the evolution or progress of the speech signal over a determined number K of successive frames in assigning, at each current frame n, a score to each of the possible states of progress of the speech signal as a function of the rates of correlation of samples distributed in each frame; determining, in each current frame, the state of progress of the speech signal that has the maximum score to make a trace-back, from this state, to the most probable state of the frame n-K in passing successively through the states of maximum score of each preceding frame. Application to vocoders with low bit rates.
FIGURE 1.

Description

2~4~9 VOICING DECISION METHOD AND DEVICE FOR
VOCODER WITH VERY LOW BIT RATE
BACKGROUND OF THE INVENTION
The present invention relates to a voicing decision method and device for vocoders with very low ~it rates.
In vocoders with very low bits rates of 1200 bits~second and less, the speech signal is segmented into frames of constant duratlons of 10 milliseconds to 30 milliseconds in such a way as to determine the pitch of the spee~ch signal within these frames. These frames are positloned arhitrarily in the speech signal and a slngle voicing value is given for each frame. Since the frames are grouped together in blocks of 2, 6 or 8 ~5 frames for example~ and` in order that the bit rate allocated to the voicing may be reduced, not all the possible volcin~ combinations are generally allowed.
For, it is generally cvns.idered to be improbable that there might be one or two voiced frames isolated in the middle of a packet of unvoiced frames, or vice versa.
In the usual structuresl this leads to the implementing of a voicing decision process that works in two steps, a first step in which a first local voicing decision is taken at the level of each fr me, said decision resulting from the examination of the values of different parameters o~ the speech signal, and a second step to correct ~he decisions taken during the first step and to eliminate ~he small packets of unYoiced : ' 2~8~9 frames from the voiced parts, and vice versa.
Naturally, the working of the corresponding devices relies greatly on a heuristic approach (based on experiments, learning etc.). This is most usually satisfactory under conditions that are not very stringent but the situation de~eriorates rapidly once the speech signal is disturbed, for example by noise.
The aim of the invention i5 to overcome the above-mentioned drawbacks.
SUMMARY OF THE INVENTION
To this effect, an object o the invention is a voicing decision method for a vocoder with a very low bit rate, by which the speech signal is sampled and ; segmented~into rames of constant duration, wherein said method consists in:
- considering the evolution or progress of the speech signal over a det~rmined number K of successive frames in assignin~, at each current frame n, a score to each of the possible states of progress of the speech signal as a function of the rates of correlation of samples distributed in each frame, - determining, in each current rame, the state of progress of the speech signal that has ~he maximum score to make a trace-back, from this state, to the most probable state of the frame n-K in passing successively throu~h the states of maximum score of each preceding frame.

- ~

2~83~9 BRIEF DESCRIPTION OF THE DRAWINGS
Other features and advantages of the present invention shall appear from the following description, made with reference to the appended drawings, of which:
- Figure 1 shows the differen~ steps of the method according to the invention, placed in the form of a flow chart;
- Figure 2 shows a diagram of states to show the mechanism of the transitions implemented in the method according to the invention, providing for the maki.ng of voicing decisions with maximum likelihood;
- Figure 3 shows an example of the operation of the mechanism of s~ates shown in figure 2 on five successive frames;
- Figure 4 is a graph showing how a voicing decision:can be taken into account by a routing of the states on five successive frames;
- Figure 5 is an embodiment of a ~evice for the implementation of the method according to the lnvention.
MORE DETAILED DESCRIPTION
The method according to the invention, which is represented schematically by means of the blocks 1 to 5 in figure 1, makes a voicing decision on speech signal intervals having a dura~ion that is a multiple of the pitch. Thi~ duration îs computed between two prede~ermined xtseme values which are a minimum value t~ take su~fLclent account of signals and a maximum ' 2 ~ 6 ~

value designed, at the same time, to limit the burden of computation and take account of the speed of natural variation of the charac~eristics of the speech. The processing starts with the computation of three parameters which are a long-term correlation rate referenced R , a first-order correlation rate M

referenced R and a rate o~ passing through zero (hereinafter called a cross-over rate) referenced T
ppz These computations are shown in the blocks 1, 2 and 3 of figure 1.
The long-term correlation rate R is computed M

according to the expression:
N
~ Sn S ~m RM MAX m =~ .. M ~ K n--1 -- ( 1) n=1 ~ 1 wherein M represents a pitch value in number of sample~, S and S are amplitudes of signal samples, n n-m N designates the number of analyzed samples, K is a constant and ~ repxesents a fraction of M. This computation can be used ~o give the true value of the pitch when the correlation rate that is obtained shows a ma~imum value. For a perfectly voiced sound, R is equal to 1 while it is practically zero for a random (unvoiced) sound. Th~ constant K prevents the divisions by zero and gives a very low value of R in the absence of speech where the power is very low.

.

':
, ~083~6~

The first-order correlation rate R i~ computed according to the expression n=l ( 2 n~l n=l in which S and S represent, as above, the n n-1 amplitudes of N samples and K is a constant.
For a voicea sound, R is very close to 1 and it approaches -1 for an unvoiced sound. The constant makes it possible to give R a value that is practically zero during the silences.
The cross-over rate T gives the ratio between lS ~ the number of changes~in sign for each sample S and N.
This value changes between a value that is almost zero for a voiced sound and a valuP that is close to 1 for an urlvoiced sound.
The processing continues, in the manner depicted by the block 4 in figure 1, with a standardization of the parameters R , R and T detenmined here above.
This standardization has the effect of fixing the values of the parame ters R I R and T between the N 1 ppZ
two values 0 and 1, the value 0 corresponding to the perfectly unvoiced sound and the value 1 corresponding to the parfectly voiced sound. The s~andardized vaIues obtained are then weighted by weighting coefficients 2~83~

a , a and a to f orm a quantity Z defined by the relationship:

Z = a1 RM + ~2(~2--) + a3( TPPZ~ ( By taking care to choose the values of the weighting coefficients i~ such a way that the relationship a ~ a + a = 1 is verified, Z then takes ~ 1 2 3 the value ~ for the perfectly voiced sounds and the value 0 for the perfectly unvoiced sounds. The parameters a , a and a may be determined, for ~ 2 3 example, as follows: a = 0.45, a = 0.35 and a = 0.2, but their optimum values cannot be really found except by an experimental adjustment of the values o~ the coefficien~s a as a function notably of the filterings on:the signal samples S .
: n : The~process of adapting to a variable level of ambient noi.se takes place through the observation that adding noise to the voiced signals has the consequence of reducing the correlation rates R and R and of increasing the cross-over rate T , i.e. of bringing about an overall reduction i~ ~he maximum value Z
possible for Z at a given inskant. TQ this end, an updati~g of an es~imate of ~ is achieved and the M~X
relative quantity ~5 'læ" = Zj7J (4) M~X
then serves rough indicator o~ voicing given the noise.

.
-. :
':

~83'~

The quantity Z may be estimated at each new MAX
value of z by updating Z by each new value of z if Z
MAX
is greater than the value Z or if the new value of Z
MA~
is smaller than Z by computing ~he quantity Z
MAX MPX
5according to tbe relationship:
(53 MAX MAX
the decrease of whlch is defined by a time constant that is ~a function of ~o follow the progress or developrnent of the noise, this time constant being all the greater as ~ is small.
The voicing decision depicted by the block 5 in figure 1 is made on the basis of a criterion o dec.ision by maximum likelihood, in considering the possible progress of the speech signal on several successiYe frames, in assigning, in each new framel a ~: :
score to ~each of the possible states of progress depending on the value of the number z obtained from tha relationship ~43.
Taking into account a voicing state for a minimum duration, ~or example of three consecutive frames, implies considering a combinational logic with six possible voicing states, as defined in the table o figure 2, wherein "V" signifies "voiced" and 'IN"
sign.ifies "unvoiced", and considering the graph of the transitions of figure 3 represen~ing the transitions between possible states to reach the current frame from the possible 5tat2S of the preceding frame n-1. It must be no~ed with refsrsnce to figure 3 that each .

~8~

state of a frame n is preceded by one or two states in the preceding frame n-l Thus, in the frame n, the state NNV is preceded by the state NNN, the state N W is preceded by the state NNV, the 6tate VVV i~ preceded either by the state N W or by the state VVV, the state NNN is preceded either by the state ~NN itsPlf or by the state VNN, the s~ate ~NN is preceded by the state VVN and the state WN is preceded by ~he state VVV.
Natur~ally, this type of mechanism can be very e~sily adapted to different durations on which the number of states to be considered will depend in each case.
A score is assigned to each of the possible states of arrival at the frame n. For each o~ them, the score chosen is the greatest possible among those obtained by the addition or su~traction, according to the signs indicated on: the arrows xepresenting the transitions between states in figure 3, of the value of z, computed by the preceding rela~ionship (4), a predetermined threshold value Z of 0 to 1 bein~ possibly deducted from this value of z. To take account of the fact that certain initlal states become forbidden owing to the fact that their score value becomes infinite, the corresponding arrival states are also forbidden.
For example, the score S~ of the arrival state 2 (VVV) in figure 3 is determlned by ona of the relationships:
- SA2 = MAX (~SDl +Z-Zo); SD2~(Z-Zo) 2~3~

if the initial states SD and SD are both permitted.
- SA =SD +(z-z ) if the initial state 2 is forbidden - SA =SD + ( z-æ ) if the initial state 1 is forbidden where SA ~ - infinity if the initial states 1 and 2 are both for~idden. In the latter case, the state 2 iS
forbidden as an lnitial state in the next frame.
: Naturally, at each new frame, the method memorises the arrival scores which will then serve as initial scores for the next frame and the table of the preceding states is also memori.zed.
The flnal~decision takes place on all the frames with a certain delay, i.e. the voicing decision at the instant of arrival of the frame n will be that of the frame n-K in the manner shown by the flow chart of figure 4. The method consists in making a search, among the six :arrlva:l s~ates, for the one that has the highest score, then a trace-back is done K minus one times in succession from the state that it has reached at the frame n-i (i = 0, 1, ..., K-1) to the previous state which is memorized for this frame. In the example of figure 4, which shows a situation with a delay of five frames, at the frame n, the state 1 is supposed to have the best score. Owing ~o the act ~hat, according to ~h~ diagram of txansitions o figure 3, this state : 25 haa, as its preceding state, ~he state 0 at the frame n-1, the method compxises a trace-bac~ to this sta~e and the procedure is continued similarly until the frame n-5 as shown b~ the u~brok~n bold lines in figure ~83~9 4. The state to which the method thus makes a trace-back is the state VVN which corresponds to the unvoiced frame n-5 of the state 5. However, the method is of no value unless the constraint of three successive frames of a same voicing state is complied with, which is not always the case i the method stops at the preceding stage. Indeed, given that the frame n~5 corresponds to the state 5, the frame n-~, in this case, can only be in the state 4, the only state that it is possible to reach according to figure 3, starting from flgure 51 it being obligatorily necessary for the trace-back in the ~tates during the nexk frame to reach the state 5 in the frame n-4. The rest of the proc s~ing then consists in the elimination, from the frame n-4 and for all the possihle states at the frame n, of the:states (herein the states 0, 31 4 and 5) from which it is not possible to make a trace~back to the state S, in po~itioning their scores at inf.inity. In th~ example of figure 4, the state l is naturally not eliminated for i~ is from this state l that the trace-back in the states has been begun, and the state 2 is pr~served for the method can maXe a trace-back from this state 2 to the state 5 in the frame n-4 (shown in bold dots).
If, in another example, the initial trace~back had led to the state 3 (frames n-5)1 it being known that the state 3 can be followed only by the ~tates 0 or 3 (frame n-4), all the s~ates in the frame n that do not , . ~ , ~183~

go back to the states 0 or 3 in the frame n-4 would have been eliminated. This method of eliminaking states has, among other advantages, that of giving, at output, a sequence of voicing values that methodically meet the constraint requiring that there should be no isolated voiced/unvoiced islands. If, on the reception side, this constraint is not verified, it means that there has been a transmission error which makes it possible to take corrective measures/ for exa~ple by positioning the voicing indicator, in the identified error zone, at "voiced".
A device for the implementation of the method according to the invention is shown in figure 5. This device comprises the following elements, shown lS respectively inside boxes: a device 6 for the computation of thP parameter Z, a ~tandardization device 7, and a tinal decision device 8. The device for the computation of the parameter Z has a device 9 for the computation of long-term ~elf-correlation of the value R , a device 10 for the ~irst-order correlation of the co fficient R and a device 11 to compute the cross-over rate T . The computation devices 9, 10 and ppz 11 each simultaneously receive the signal samples $ at a first input and carry out the computations corresponding to the previous relationships (1), (2) and (3~ within a window of analysis given by a device 12 which determines a duration o~ analysis for each rame n as a Eunation o~ the value of the pitch M. The . ` , ', : .

2 ~ 9 multiplier circuits 13, 14, 15 are coupled to the output of the computation devices 9, 10 and 11 to apply the weighting coefficients a , a and ~ to the correlation rates R and R1 ~ 1 and to thè cross-over rate T . The results of the PP~
computations performed by the multiplier circuits 13, 14, 15 are applied to the inputs of a summator circuit 16 giving the value Z at its output k. The value Z o the relationship (3) is applied to the input of the standardization block 7. The standardiæation block 7 comprises a register 17 for the memorizLng of the : :15 maximum number Z. The output of the registe.r 17 is ;~ ~ coupled to a first input of a comparator circuit 18, the second input of which receives the value Z given by the computation device 6. The input of the register 17 i~ connected to the output of a multiplexer circuit 19 controlled by the output of the comparator circuit 18 to apply, to the input of the register 17, either the value 2 as such or the result of the computation given by a multiplier circuit 20 having a first operand input connected to the ou~put of the register 17 and a second : 25 operand input to which ~here is applied the coefficient 1-~ of the~ relatLonship 5 described here absve.
divider circuit ~1 is coupled by a first operand input to the output of the register 17 and by a second .

.'. : . ' ' .' -' '' . . : ' .. . , .. -: . .
. ., .. , : ~ -- ~ .

~83~

operand input to the output of the device 6 to compute the ~uantity "z" equal to Z/Z . The quantity "z" is MAX
applied to the input of the final decision devi.ce 8 which comprises a device 22 for computing the scores, carrying out trace-backs to the states, and eliminating the fo~idden states, this computation device 22 being connected to a memory 23 of the initial scores SDi, a memory 24 of tbe arrival scores SAi and a memory 25 of the previous states. The device 22 for the computation of the scores furthermore receives the threshold value Z at a second input and the number of frames n at a third input and gives th~. voicing decisions for each of the frames n-K at an output.
Naturally, the implementation of the method according to the invent.ion that has just been described : is not unique,~ and i~ goe~ without saying that other em~odiments implementing notably signal processi.ng microprocessors can equally well be envisaged, their programming being within the scope of those skilled in the art.

., - ~
. .

Claims

1. A voicing decision method for a vocoder with a very low bit rate, by which the speech signal is sampled and segmented into frames of constant duration, wherein said method consists in:
- considering the evolution or progress of the speech signal over a determined number K of successive frames in assigning, at each current frame n, a score to each of the possible states of progress of the speech signal as a function of the rates of correlation of samples distributed in each frame, - determining, in each current frame, the state of progress of the speech signal that has the maximum score to make a trace-back, from this state, to the most probable state of the frame n-K in passing successively through the states of maximum score of each preceding frame.

2. A method according to claim 1, consisting in computing a long-term correlation rate RM for an interval M equal to the value of the pitch of the speech signal.

3. A method according to claim 2, consisting in computing a first-order correlation rate RM to determine the nature, whether voiced or unvoiced, of the sound corresponding to the correlated samples.

4. A method according to claim 3, consisting in determining the rate of cross-overs of the speech signal.

5. A method according to claim 3, consisting in carrying out a weighted summation of the long-term correlation rate, the first-order correlation rate and the cross-over rate to obtain an score increment "z"
and in making a computation, from the possible scores of each frame preceding a current frame, of the scores of the possible states of arrival.

6. A method according to claim 5, consisting in giving infinite score values to the forbidden states.

7. A method according to claim 5, wherein the increment value is equal to the weighted sum of the long-term correlation rate, the first-order correlation rate and the cross-over rate minus a predetermined threshold value Z0.

8. A device for the implementation of the method according to any one of the claims 1 to 7, comprising a device for the computation of the score increment value coupled to a final decision device comprising a device for the computation of scores, the tracing back of states and the elimination of forbidden states by means of a standardization device.

9. A device according to claim 8, wherein the score increment value computing device comprises a long-term correlation computing device, a first-order correlation computing device and a cross-over rate computing device, these computing devices being coupled to a summator circuit.

10. A device according to claim 9, wherein the score value computing device is formed by a signal processor.