WO1981003392A1 - Improvements in signal processing - Google Patents

Improvements in signal processing Download PDF

Info

Publication number
WO1981003392A1
WO1981003392A1 PCT/AU1981/000060 AU8100060W WO8103392A1 WO 1981003392 A1 WO1981003392 A1 WO 1981003392A1 AU 8100060 W AU8100060 W AU 8100060W WO 8103392 A1 WO8103392 A1 WO 8103392A1
Authority
WO
WIPO (PCT)
Prior art keywords
coefficients
data
filter
pef
formant
Prior art date
Application number
PCT/AU1981/000060
Other languages
English (en)
French (fr)
Inventor
J Reid
Original Assignee
J Reid
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by J Reid filed Critical J Reid
Priority to AU71550/81A priority Critical patent/AU7155081A/en
Priority to BR8108616A priority patent/BR8108616A/pt
Publication of WO1981003392A1 publication Critical patent/WO1981003392A1/en
Priority to DK21282A priority patent/DK21282A/da

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/15Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information

Definitions

  • the present invention relates to signal processing and more particularly to signal processing by means of linear prediction coding.
  • the invention is primarily concerned with speech signals but is not limited thereto.
  • a fundamental problem in the analysis of speech, and in similar fields, concerns the identification of the broad features, or trend, of the power spectrum of a signal in the presence of fine structure brought about by the harmonics of low frequency components and by the presence of noise.
  • the broad peaks in the envelope of the power spectrum of a signal are known as formants.
  • the problem is a problem of formant identification.
  • Present techniques apply either in the frequency domain or in the time domain.
  • the power spectrum of the signal is first approximated in some way such as by feeding the signal through a parallel array of filters or by computing the Fourier transform of a segment of the input data. Smoothing procedures are then applied and peaks in the smoothed frequency domain data are taken as estimates of the formant peaks of the spectrum.
  • a more sophisticated time domain approach used in the analysis of speech and of which the invention is a special case, is the linear prediction method. Under this method, successive time segments of the signal are each assumed to be the outcome of a stationary autoregressive random process.
  • PEF prediction error filter
  • ⁇ ' is the transpose of ⁇ .
  • Equation (2) is solved for the vector which com prises an estimate of the population prediction error filter (PEF) coefficient vector, ⁇ .
  • PPF population prediction error filter
  • ⁇ t is the sampling interval
  • f N is the Nyquist frequency
  • z transform of the PEF coefficients given by
  • the linear prediction method of estimating spect trends has several advantages over other methods, viz: (i) only a small number of parameters are required to represent spectral trends, (ii) the estimated spectrum at low orders represents smooth approximation to the population spectrum is completely unaffected by the presence of pitch harmonics, since these cannot affect the values of the covariances, for moderate values of the order, n,
  • the invention is an implementation of a multiprocessing approach to real timespectral analysis analogous to that which must occur in living organisms and is intendedto exploit the cheapness and power of contemporary silicon chip technology.
  • a linear prediction method is used but it is a method which avoids the time consuming aspects of conventional linear prediction methods. Essentially it involves a process by which the inversion of a single high order matrix is replaced by the simultaneous inversion of a number of second order matrices to yield a solution to equation (2) above. This solution although mathematically imprecise, is sufficiently accurate for practical purposes.
  • the spectrum, S n , derived for a particular data sequence is completely contrqlled by the roots of the polynomial Y (z) defined in equation (4).
  • the polynomial may be factorized into a number of binomial factors having real coefficients, viz:
  • Each binomial fadtor in (7) corresponds to a pair of roots in the complex number plane. Those root pairs which are close to the unit circle are related to peaks in the spectrum, Sn , which are called formants in the case of speech data. Thus some of the binomial coefficient pairs (a 1 , a 2 ) , (b 1 , b 2 ) etc. are associated with particular formants while others which are of less practical importance merely control the spectral trend. Now suppose that the binomial coefficient pair (a 1 , a 2 ) associated with a particular formant is known precisely. They can be used as the coefficients of a non-recursive filter which acts on the data according to equation (32) below to yield a new data sequence whose linear prediction spectrum S* is givenby
  • the second order PEF coefficients are found for this second data sequence they will summarize the trend in the spectrum after the partial removal of the dominant peak. Convolution of the original data sequence with this second set of coefficients will have the effect of reducing the residual peaks in this data sequence, leaving the dominant peak more isolated than before and the sequence will yield PEF coefficients less contaminated by the residual peaks. In general, if this process of convoluting a pair of data sequences with the PEF coefficients derived from the alternate sequence is continued, one sequence of sequences will -converge to a limit sequence in which the dominant peak or formant is present in isolation and the other will converge to a limit sequence which has the spectrum of the original sequence but with the dominant peak removed and which can itself be operated on in the same way so as to remove further peaks.
  • the data is divided into a plurality of data streams for example, stream A and B, whereby the PEF coefficients computed from stream A are convoluted with the original data stream to yield stream B, while the PEF coefficients from stream B are convoluted with the original stream to yield stream A.
  • stream A and B the data is divided into a plurality of data streams for example, stream A and B, whereby the PEF coefficients computed from stream A are convoluted with the original data stream to yield stream B, while the PEF coefficients from stream B are convoluted with the original stream to yield stream A.
  • the PEF coefficients yielded by both streams are identical and are a poor approximation to the PEF coefficients of the dominant peak. Some asymmetry must be deliberately introduced into such a network.
  • a method of introducing an asymmetry is to filter stream A with a recursive filter (see equation (33) below) whose coefficients are equal to or derived from the PEF coefficients computed from stream A itself.
  • This method is effective in isolating the dominant formant in stream A while stream B comprises a stream in which the dominant formant has been removed and which can be further processed in order to isolate remaining formants.
  • stream B comprises a stream in which the dominant formant has been removed and which can be further processed in order to isolate remaining formants.
  • the network selected to perform a particular task will depend upon the application; the type of spectral information required, the available hardware and so on. Even in the case of speech analysis different networks may be appropriate to each of three distinct applications, viz: phoneme recognition, speaker identification and speech compression for storage and transmission.
  • a cascaded two stream embodiment will yield, at any instant, several pairs of coefficients, one pair from each module associated with the formant isolated in the A stream by that module, plus a final coefficient pair associated with the B stream of the final module from which all the formants have been removed.
  • Each pair of PEF coefficients, C 1 and C 2 contain information about both the centre frequency and the bandwidth of the corresponding formant (see equation (34) below).
  • This second dimension of information is extremely useful in practice since the second coefficient can be used directly as a criterion for accepting or rejecting a particular formant during phoneme recognition; small values of C 2 , that is, less than some threshold value, correspond to peaks which are too broad and weak to be considered valid as formants.
  • Fig. 1 shows a circuit block diagram of an embodiment of the device as a formant tracker
  • Fig. 2 shows a circuit block diagram of one of the second order prediction filter estimators depicted by circles in Fig. 1,
  • Fig. 3 is a graph showing the Fourier transform log power spectrum of a segment of the data sequence fed to the device at line 1.
  • Fig. 4 is a graph showing the Fourier transform log power spectrum of a segment of the data sequence fed to the device at line 4, after the first formant has been removed.
  • Fig. 5 is a graph showing the Fourier transform log power spectrum of a segment of the data sequence appearing at line 5. The second formant occurs in isolation.
  • Fig. 6 is a graph showing the Fourier transform log power spectrum of a segment of the data sequence appearing at line 6. In Fig.
  • Peripheral devices such as microphones, analogue filters and clocks are not shown.
  • Fig. 1 Inspection of Fig. 1, reveals that it comprises three modules fed respectively by data streams 2, 4 and 6.
  • the modules are identical except for the first and are arranged in the form of a hierarchy or cascade, each module except the last passing a data sequence to the next in line.
  • the module comprises two non-recursive filters 11 and 12, one recursive filter
  • PEF estimator 13 two prediction error filter (PEF) estimators 14 and 15, a coefficient modifier 16 and an output buffer 17.
  • the PEF coefficients computed by estimator 15 are passed to the non-recursive filter 11 and the PEF coefficients computed by estimator 14 are passed to non-recursive filter 12, to recursive filter 13, via the coefficient modifier 16 and to the buffer 17.
  • the PEF estimator 15 will now compute PEF coefficients related to the residual peaks in data sequence 6 and pass these coefficients to filter 11.
  • the residual peaks will then be attenuated in data stream 5 allowing PEF estimator 14 to compute coefficients relating to the dominant peak which are less contaminated by the residual peaks than was previously the case. This effect of isolating the dominant peak will be further reinforced by the action of filter 13.
  • coefficient buffer 17, 18, 19 and 20 contain PEF coefficient pairs associated with each of the peaks in the input sequence 1.
  • coefficients used in the filters 11, 12 and 13 are initially set to values close to the values they are expected to assume thus allowing more rapid convergence to take place.
  • the action of the non-recursive filters constitute a type of negative feedback since they nullify an effect while the recursive filters constitute a form of positive feedback since they exaggerate an effect detected by the PEF estimators. If the filter 13 were not present there would be no asymmetry in the module and data streams 5 and 6 would remain identical.
  • the effect of the recursive filter as described above is too strong and although the action of the device in separating formants commences adequately, the device soon ceases to be responsive to changes in the incoming data stream. This effect can be overcome by lessening the degree of "positive feedback" i.e. by modifying the coefficients which are passed to the recursive filters.
  • the first formant is not completely removed by the action of a module such as the one described which left behind two small residual peaks not present in the original data spectrum.
  • the first formant behaves as a double resonance and the effect is easily overcome by the inclusion of two non-recursive filters (22 & 23) on one side of the first module.
  • FIG. 2 A block diagram of one specific embodiment of the second order PEF estimators 14 and 15 referred to above is depicted in Fig. 2. This description is given in digital terms although analogue embodiments are equally feasible.
  • the data are presented one at a time by some external device such as a digitizer (not shown) to line 41 in the diagram.
  • Delays 42 and 43 and lines 41, 44 and 45 constitute a three word shift register so that at time i ⁇ t the quantities x i , x i-1 and x i- 2 appear at lines 41, 44 and 45 respectively.
  • the attenuating factor p will usually be very close to unity and the attenuation of previous .covariance values may best be carried out less frequently than once, every clock cycle, that is, previous values can be multiplied by Np every N clock cycles.
  • the PEF coefficients themselves do not change rapidly with time and they too need be computed less frequently than once every clock cycle.
  • Fig. 3 shows the input signal on line 1 of Fig. 1 wherein it can be seen that there are four formants F 1 - F 4 present in the spectrum.
  • This diagram comprises a 1024 point Fourier transform log power spectrum of the utterance "i" in the word "television”.
  • Fig. 4 shows the signal on line 4 of Fig. 1 wherein it can be seen that the first formant F 1 has been removed and the remaining formants are more pronounced.
  • FIG. 6 shows the signal on line 6 of Fig. 1 wherein the first two formants have been removed.
  • Fig. 5 shows the second formant in isolation, that is, the spectrum of the data stream on line 5 of Fig. 1.
  • the coefficient pair summarizing this spectrum appears in buffer 17 of Fig. 1.
  • the data sequence comprises a discrete set of quantities, ⁇ x i ⁇ , one definition of theelements, r ij , of the variances and covariances, r ij ,
  • ⁇ t is a constant lag or separation of the data in the domain.
  • Another simplification is to compute each variance or covariance recursively, viz:
  • coefficients may be used as the coefficient of a non-recursive (or "finite impulse response") filter which acts on a data sequence ⁇ x i ⁇ to yield a new data sequence ⁇ y i ⁇ , viz:
  • This operation is also referred to as the "convolution" of the data sequence ⁇ x i ⁇ with the PEF coefficients ⁇ 1, C 1 , C 2 ⁇ to yield a new data sequence ⁇ y i ⁇ .
  • the PEF coefficients may be used as the coefficients of a recursive (or "infinite impulse response") filter which acts on a data sequence ⁇ u i ⁇ to yield a new data sequence ⁇ V i ⁇ , viz:
  • V i - C 1 V i-1 - C 2 V i-2 + u i (33)
  • the coefficients C 1 and C 2 summarize the gross features of the power spectrum (i.e. the power spectral density function) of the data sequence from which they were derived.
  • the frequency of the peak, f 0 is given by
  • ⁇ t is the sampling interval in the discrete case.
  • the coefficient C 2 is controlled by the half power width of the peak. It is close to unity when the peak is narrow and is closer to zero when the peak is broad or where more than one peak is present in the spectrum.
  • C 2 can be used in some threshold criterion to decide whether a peak is sufficiently narrow to be classified as a single formant and the quantity, f 0 , can be used to determine the frequency of that formant.
  • C 1 and C 2 themselves or simple functions of them can be checked against population ranges in order to classify a formant.
  • a network for the recognition of different phonemes in speech data could dispense with the coefficient modifiers such as 16 and replace them with switches so that full positive feedback is maintained for a short time causing rapid convergence in the isolation of the formants.
  • the recursive filters such as 13 can be switched out of the network for the remainder of the duration of the phoneme.
  • Statistically significant changes in the coefficient values appearing in the buffers 17, 18, 19 and 20 can be used to detect the onset of a new phoneme and so cause the recursive filters to again become operative. In this way a speech recognition device can be constructed which is phoneme synchronous, thus avoiding the need for time axis normalization.
  • the simplest embodiment of the invention comprises the PEF estimator of Fig. 2.
  • This device alone is unsuited to the analysis of spectrally complex signals such as speech but it can be used for the recognition of simpler signals such as the signalling tones used in telephone switching systems. In this way a single device can be used to distinguish between a wide variety of tones when the input (line 1 in Fig. 2) is fed from an appropriate line in the telephone system.
  • the coefficients C 1 and C 2 generated by the device are then compared with predefined values in order to classify the incoming signal and to cause the remainder of the telephone system to take appropriate action.
  • An analogue embodiment of the device utilizing fixed delays in place of shift registers would be more suited to this application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
PCT/AU1981/000060 1980-05-19 1981-05-18 Improvements in signal processing WO1981003392A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
AU71550/81A AU7155081A (en) 1980-05-19 1981-05-18 Improvements in signal processing
BR8108616A BR8108616A (pt) 1980-05-19 1981-05-18 Aperfeicoamentos em processamento de sinais
DK21282A DK21282A (da) 1980-05-19 1982-01-19 Fremgangsmaade ved databehandling med lineaer forudsigelse samtapparat til udoevelse af fremgangsmaaden

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
AUPE360580 1980-05-19
AU5556/80 1980-09-12
AUPE555680 1980-09-12

Publications (1)

Publication Number Publication Date
WO1981003392A1 true WO1981003392A1 (en) 1981-11-26

Family

ID=25642381

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/AU1981/000060 WO1981003392A1 (en) 1980-05-19 1981-05-18 Improvements in signal processing

Country Status (5)

Country Link
EP (1) EP0052120A4 (GUID-C5D7CC26-194C-43D0-91A1-9AE8C70A9BFF.html)
JP (1) JPS57500901A (GUID-C5D7CC26-194C-43D0-91A1-9AE8C70A9BFF.html)
BR (1) BR8108616A (GUID-C5D7CC26-194C-43D0-91A1-9AE8C70A9BFF.html)
DK (1) DK21282A (GUID-C5D7CC26-194C-43D0-91A1-9AE8C70A9BFF.html)
WO (1) WO1981003392A1 (GUID-C5D7CC26-194C-43D0-91A1-9AE8C70A9BFF.html)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3296374A (en) * 1963-06-28 1967-01-03 Ibm Speech analyzing system
US3327057A (en) * 1963-11-08 1967-06-20 Bell Telephone Labor Inc Speech analysis
US3335225A (en) * 1964-02-20 1967-08-08 Melpar Inc Formant period tracker
US3369076A (en) * 1964-05-18 1968-02-13 Ibm Formant locating system
US3649765A (en) * 1969-10-29 1972-03-14 Bell Telephone Labor Inc Speech analyzer-synthesizer system employing improved formant extractor
US4032711A (en) * 1975-12-31 1977-06-28 Bell Telephone Laboratories, Incorporated Speaker recognition arrangement
GB1571139A (en) * 1976-11-30 1980-07-09 Western Electric Co Speech recognition
US4220819A (en) * 1979-03-30 1980-09-02 Bell Telephone Laboratories, Incorporated Residual excited predictive speech coding system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3296374A (en) * 1963-06-28 1967-01-03 Ibm Speech analyzing system
US3327057A (en) * 1963-11-08 1967-06-20 Bell Telephone Labor Inc Speech analysis
US3335225A (en) * 1964-02-20 1967-08-08 Melpar Inc Formant period tracker
US3369076A (en) * 1964-05-18 1968-02-13 Ibm Formant locating system
US3649765A (en) * 1969-10-29 1972-03-14 Bell Telephone Labor Inc Speech analyzer-synthesizer system employing improved formant extractor
US4032711A (en) * 1975-12-31 1977-06-28 Bell Telephone Laboratories, Incorporated Speaker recognition arrangement
GB1571139A (en) * 1976-11-30 1980-07-09 Western Electric Co Speech recognition
US4220819A (en) * 1979-03-30 1980-09-02 Bell Telephone Laboratories, Incorporated Residual excited predictive speech coding system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
IBM Technical Disclosure Bulletin, Volume 18, No.11, issued 1976, April, (New York) J.K. and J.M. BAKER, "Continuous Formant Tracker", see pages 38690 3872. *
Proceedings IEEE, Volume 63, No. 4, issued 1975 April, (New York), J. MAKHOUL , "Linear Prediction: A Tutorial Review," see pages 561-580. *
See also references of EP0052120A4 *

Also Published As

Publication number Publication date
BR8108616A (pt) 1982-04-06
EP0052120A1 (en) 1982-05-26
DK21282A (da) 1982-01-19
JPS57500901A (GUID-C5D7CC26-194C-43D0-91A1-9AE8C70A9BFF.html) 1982-05-20
EP0052120A4 (en) 1983-12-09

Similar Documents

Publication Publication Date Title
Chang et al. Analysis of conjugate gradient algorithms for adaptive filtering
Lee et al. Blind source separation of real world signals
Le Roux et al. A fixed point computation of partial correlation coefficients
EP1070390B1 (en) Convolutive blind source separation using a multiple decorrelation method
Lim et al. A new algorithm for two-dimensional maximum entropy power spectrum estimation
Rabiner et al. Applications of a nonlinear smoothing algorithm to speech processing
Heinonen et al. FIR-median hybrid filters with predictive FIR substructures
Dautrich et al. On the effects of varying filter bank parameters on isolated word recognition
Weiss et al. Fundamental limitations in passive time delay estimation--Part I: Narrow-band systems
Nadeu Camprubí et al. On the decorrelation of filter-bank energies in speech recognition
Trancoso et al. Efficient procedures for finding the optimum innovation in stochastic coders
US4486900A (en) Real time pitch detection by stream processing
CA1172362A (en) Continuous speech recognition method
US20050216259A1 (en) Filter set for frequency analysis
Luo et al. Ultra-lightweight speech separation via group communication
Robinson Logical convolution and discrete Walsh and Fourier power spectra
NL7812151A (nl) Werkwijze en inrichting voor het bepalen van de toon- hoogte in menselijke spraak.
Sakuma et al. MLP-based architecture with variable length input for automatic speech recognition
Friedlander et al. Least squares algorithms for adaptive linear-phase filtering
Kaveh et al. An optimum tapered Burg algorithm for linear prediction and spectral analysis
Südholt et al. Pruning deep neural network models of guitar distortion effects
Wu et al. A novel two-level method for the computation of the LSP frequencies using a decimation-in-degree algorithm
Laakso et al. Energy-based effective length of the impulse response of a recursive filter
JPS6356560B2 (GUID-C5D7CC26-194C-43D0-91A1-9AE8C70A9BFF.html)
EP1417755A1 (en) Method and apparatus for providing an error characterization estimate of an impulse response derived using least squares

Legal Events

Date Code Title Description
AK Designated states

Designated state(s): AU BR DK JP US

AL Designated countries for regional patents

Designated state(s): AT CH DE FR GB NL SE

WWE Wipo information: entry into national phase

Ref document number: 1981901295

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1981901295

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 1981901295

Country of ref document: EP