EP0976124A2 - Speech detection - Google Patents

Speech detection

Info

Publication number
EP0976124A2
EP0976124A2 EP98917143A EP98917143A EP0976124A2 EP 0976124 A2 EP0976124 A2 EP 0976124A2 EP 98917143 A EP98917143 A EP 98917143A EP 98917143 A EP98917143 A EP 98917143A EP 0976124 A2 EP0976124 A2 EP 0976124A2
Authority
EP
European Patent Office
Prior art keywords
signal
speech
neural network
noise
peak value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP98917143A
Other languages
German (de)
French (fr)
Inventor
Samu Kaajas
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nokia Oyj
Original Assignee
Nokia Networks Oy
Nokia Oyj
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Networks Oy, Nokia Oyj filed Critical Nokia Networks Oy
Publication of EP0976124A2 publication Critical patent/EP0976124A2/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/06Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being correlation coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the invention relates to a method for speech detection in a telecommunication system comprising a signal source producing a signal, a signal processor including a neural network for processing said signal, in which method said neural network is trained to distinguish between a speech signal and a noise signal using speech and noise samples.
  • Non-speech is generally either silence or background noise, but may also be an information signal like the DTMF dual tone multi frequency signal to be transferred on a telephone channel.
  • the capacity of the classifier is often decisive for the operation of the system.
  • the classification of signal as speech and non-speech is, for exam- pie, included in a prehandling part of each speech identification system. It is essential in speech identification to find the precise starting point of speech sounds, words or sentences to ensure a reliable identification.
  • speech recording the amount of data to be processed can be significantly reduced by recording only speech-containing data.
  • the load and power consumption of a mobile station or another telecommunication apparatus or system can be reduced if only information corresponding to real speech is coded and transmitted, not unnecessary background noise.
  • a speech signal is composed of consecutive speech sounds. Pressure pulses deriving from the vocal cords pass through the oral cavity and/or the nasal cavity forming a speech sound.
  • the shape of the cavities affects the pronunciation result, as well as which of the cavities is open, thus distinguishing between different speech sounds.
  • the speech sounds can be grouped in various ways, a basic division being to group them as vowels and consonants.
  • background noise refers to all sound signals that are not speech and that will be processed by a speech detector.
  • background noise is often modelled as white noise. It accurately models noise occurring in analogical mobile or radio phones, but for modelling background noise occurring in various real life situations the model is too simple. Background noise can roughly be divided into two categories. Stationary noise is fairly even continuous noise that can be caused by, for exam- pie, a ventilator, a copying machine, a restaurant environment or traffic. A common feature is the continuity of the noise. Observed in the time domain noise occurs continuously throughout the entire observation period.
  • the other type of background noise is dynamic background noise. It is composed of random, heavy noise peaks. For example, all bangs and slams can be classified as dynamic noise. In the time domain noise is seen as fairly short noise sequences having a strong amplitude.
  • Speech can be distinguished from background noise by determining values for various properties describing the signal. Examples of such properties are the signal energy or spectrum.
  • US patent 5,611 ,019 describes a speech detection arrangement.
  • the arrangement comprises a reference model, a parameter extractor for ex- tracting parameters from an input signal and decision-making means for deciding whether the signal is speech or non-speech.
  • the presented solution also includes the idea of training the arrangement to detect and distinguish speech from non-speech.
  • US patent 5,598,466 describes a method for detecting speech ac- tivity in a semi-duplex audio telecommunication system. In this method an average peak value representing the envelope of an audio signal is determined.
  • US patent 5,596,680 shows a method and an apparatus for detecting speech activity in an input signal.
  • a starting point is detected using power/zero crossing.
  • For detecting the end point of the speech sound in the signal cepstrum is determined.
  • vector quantization is used to classify the speech sound as either speech or noise.
  • US patent 5,572,623 shows a method for speech detection from a signal containing noise.
  • a frame containing sound is detected, noise-containing frames preceding said frame are searched for, an autore- gressive model of noise and a spectrum of average noise are formed. Then the flames preceding the sound in the spectrum are bleached wherefrom the real starting point of speech is searched for.
  • US patent 5,276,765 shows a voice activity detector (VAD) to be used in an LPC coder in a mobile system.
  • VAD voice activity detector
  • the solution operates in connection with the LPC coder but does not use LPC coefficients when deciding whether the signal to be observed is speech or noise. Instead, the solution uses autocorrelation coefficients of an input signal, weighted and combined, for classifying the signal parts as speech or correspondingly noise.
  • ETSI GSM ETSI GSM 06.32, v.4.1.0. July 1995. Voice Activity Detection. ETSI.
  • a problem with prior art solutions is that they generally classify heavy background noise as speech. Classifying dynamic noise in particular, for example bangs, by the simple methods described often produces incorrect results.
  • a problem particularly in semi-duplex trunking mobile communication systems is that although the mobile stations comprise a push to talk- switch (PTT) for indicating the start and the end of a speech item, the subscriber cannot inform the mobile communication system when he/she will use his/her speech item when talking from another telecommunication system to said semi-duplex mobile communication system.
  • PTT push to talk- switch
  • speech detection is required at least for all speech signals arriving from the outside of said semi- duplex mobile communication system.
  • An object of the invention is to provide a method and an apparatus for speech detection, the method and the apparatus operating in real time, taking an input signal from a telephone channel and being able to distinguish speech from noise particularly from dynamic noise as reliably as possible.
  • the invention relates to a method for speech detection in a telecommunication system, characterized by comprising the following steps: determining from said signal identification numbers comprising at least the following identification numbers
  • the main idea of the invention is to employ the neural network in speech detection.
  • the properties determined from the signal to be observed are given as input to the neural network making the classification decision.
  • the neural network to be used in this inventive method is trained to distinguish speech from background noise.
  • An advantage with the invention is that the neural network of the invention can also be trained to distinguish heavy background noise from speech.
  • Another advantage with the invention is that the neural network is simple and easy to implement also as a real time solution, thus being applicable to speech transfer in almost real time, for example, in a mobile communication system.
  • Figure 1 is a flow chart describing a method of the invention
  • Figure 2 is a block diagram showing a neural network used in the implementation of the invention.
  • a neural network is used in the decision-making only when the signal average exceeds a particular threshold value.
  • the performance of the entire speech detection method and appa- ratus is improved by low noise levels.
  • Autocorrelation is calculated and 0-level exceedings are counted from a 60 ms time window.
  • the LPC coefficients are obtained directly from a TETRA speech codec.
  • Hangover time refers to time after the last speech classification during which the signal is still classified as speech although the window to be observed no longer includes speech. As a result the weak consonants in the middle and end of a sentence are detected.
  • a local (short-time) autocorrelation function is determined as follows:
  • w(n) is a finite length window function.
  • the parameters are again window length N and window function shape.
  • Autocorrelation indicates how the signal correlates with itself. Perio- dicities can be detected in the signal by means of autocorrelation. Since voiced speech sounds comprise basic frequency sequences the autocorrelation function should have peaks at intervals corresponding to the basic frequency. The existence of these peaks as well as their location and amplitude are properties by which speech can be distinguished from background noise using the auto- correlation function. Though a telephone channel attenuates frequencies below 300 Hz, the autocorrelation function peaks more probably correspond to the first resonance, or formant, frequency than to the basic frequency.
  • a linear predictor is a system that predicts a sample value as a linear combination of previous samples.
  • Neural networks are nonlinear calculation networks which are tuned using training data.
  • the neural network is loosely based on brain function, but is nevertheless inadequate to model the real function of the brain.
  • the neural network can be implemented as a neurocomputer referring to a wide parallel simple adaptive network consisting of calculation elements. Such a network functions like a real biological neural network.
  • the neural network consists of several calculation elements, or neu- rons, and the connections between them.
  • a neuron is given one or more input values and based on these values the neuron produces a result that can be forwarded to one or more neurons.
  • a dot product is thus calculated between the weighting coefficients and the input, the bias is added thereto and the result is given as an argument to the transfer function producing the final result.
  • the neurons can be combined into neuron layers.
  • a neuron layer comprises one or more neurons.
  • the neuron layer neurons all obtain the same input vector and the neurons have the same transfer function.
  • the neuron layer result vector can be given to another layer as input, in which case a multilayer neural network is obtained.
  • the neural network has to be trained to solve a given problem. Dur- ing the training phase the weighting coefficients and the bias values of the network neurons are tuned according to a training algorithm. In the invention some real life recordings are used to represent background noise. These recordings are chosen so as to represent real situations particularly well and extensively. After training the neural network is used by giving the network an input vector for which the network calculates a corresponding output. A properly trained neural network always calculates a similar result for an input vector group possessing certain properties.
  • the neural network is well applicable to solve the classification problem.
  • the neural network decides into which category the given input vector belongs i.e. whether the signal sample to be observed is speech or noise.
  • a conveniently trained neural network succeeds in classifying the signal as speech or background noise.
  • the input to be brought into the system can be a preprocessed signal as such.
  • An example of the former is found in the publication presented above Electronic Design, March 22, 1990, Newman W.C., Detecting Speech with an Adaptive Neural Network, pp. 79-89, presenting a neural network which is given a 25 ms signal as input, i.e. 250 samples at a 10 kHz sampling frequency.
  • the LPC coefficients that are calculated from the signal and that describe the signal are given as input to the neural network.
  • the inventive method and apparatus are designed to be implemented in a digital signal processor. This is why also the training data is as-muld using a signal processor.
  • the signal processor calculates the LPC coefficients and the autocorrelation and counts the number of zero exceedings and stores the values in memory.
  • Eight neurons are selected for a hidden layer of the neural network of the invention.
  • An increase in the number of neurons does not necessarily improve the function of the network.
  • the output layer of the neural network two neurons are chosen which are trained in such a way that one reacts to speech and the other to noise.
  • the transfer function of the output layer is linear.
  • the signal is classified as speech when the output of the neuron tuned for speech is higher than the threshold value and the output of the noise neuron is smaller than that of the speech neuron.
  • the performance of the network improves since the threshold value can be increased without increasing speech clipping. The erroneous interpretations of noise as speech are thus reduced.
  • the neural network performance at low noise levels is poor. Therefore an observation of the signal amplitude average is added to the method and apparatus of the invention.
  • the neural network is not used in the classification until the signal average exceeds a predetermined threshold value. A more quiet signal is classified directly as silence.
  • the software implementing the method and apparatus of the invention is carried out as a software run in real time in the digital signal processor.
  • the Texas Instruments TMS320C50 signal processor is used as the processor being able to receive and transmit audio data sampled at 8 kHz sampling frequency.
  • the software is implemented in such a way that speech travels as such from the input to the output of the signal processor I/O port only when the algorithm classifies the signal as speech. Otherwise silence is transmitted.
  • the method and apparatus of the invention are implemented to a telecommunication system the practical solution can naturally be carried out in another way, for example in such a way that white noise or another sound, like an audible tone, can be heard instead of noise from the speech detector.
  • the method and apparatus of the invention are used for speech detection.
  • the function of the method and apparatus of the invention can be significantly improved by adding a so-called hangover time to them.
  • hangover time is started during which a final classifica- tion decision is made, i.e. a decision on whether the signal is speech or noise. It is thus easier to include the weak consonants in the middle of the sentence.
  • a possible hangover time added to the method and apparatuses of the invention and to the algorithms implementing them is 500 ms.
  • the speech detector in the exchange or transmission parts of the telecommunication system.
  • the speech detector can also be located in a telecommunication terminal.
  • the speech detector of the invention can be located, for example, in a system exchange, a base station con- trailer, a base station or a mobile station, or in several of the above mentioned network elements.
  • a mobile communication system typically has an exchange connected to other telecommunication and telephone networks using an interface unit.
  • the mobile communication exchange is connected to a fixed telephone network using the interface unit of the fixed network.
  • the speech detector of the invention can be located in this interface unit of the fixed network in particular. Naturally the speech detector can also be located elsewhere in the mobile communication system or in its exchange.
  • the classi- fication information of the signal arriving from the telephone line or the speech channel of the fixed telephone network made by the speech detector of the mobile communication system on whether the signal is classified as speech or non-speech is transferred through a central possessing unit CPU of the interface unit of the fixed telephone network to a call control computer (CC) of the exchange.
  • the CCC distributes the speech items between the radio telephone subscriber and the fixed network subscriber on the basis of the information.
  • the speech detection method and apparatus of the invention can be carried out for example in the Texas Instruments TMS320C50 (40 MIPS) digital signal processor, whose main task is to convert or code speech arriving from the telephone line and correspondingly decode speech going to the fixed telephone line in the TETRA (Trans European Trunked Radio) mobile communication system.
  • Texas Instruments TMS320C50 40 MIPS
  • TETRA Trans European Trunked Radio
  • the speech coding algorithm is based on the ACELP (Algebraic Code-Exited Linear Predictive) coding model.
  • ACELP Algebraic Code-Exited Linear Predictive
  • a speech sample frame is synthesized by filtering a suitable excitation sequence with two time-dependent filters.
  • the first filter is a long-term prediction filter by which the periodicities of the speech signal are modelled.
  • the second filter is a short-term prediction filter by which the envelope of the speech spectrum is modelled.
  • the short-term filtering is performed as follows:
  • a,s are the linear prediction coefficients and p is the order of the predictor which is 10 in the TETRA codec.
  • the linear prediction analyses is performed at 30 ms intervals using a 32 ms asymmetrical window.
  • the window consists of two Hamming windows of different lengths, the first one comprising 216 samples of the frame to be processed and 40 samples of the next frame.
  • the LPC coefficients are solved employing an autocorrelation method using the Levinson-Durbin recursion.
  • the LPC coefficients calculated by the TETRA codec can be given in accordance with the invention when delivered directly as the speech detec- tor input.
  • speech is always coded two frames at a time, or two times successively at 60 ms intervals.
  • the LPC coefficients used by the speech detector are calculated on the basis of the first frame in accordance with the asymmetrical window described above.
  • the other speech detector inputs can be determined for coding from a buffered 60 ms window. If signal was gathered into the buffer during a time period over 60 ms and the buffered data was given as delayed to the speech codec, the window length used by the speech detector could be extended.
  • the combined performance time of the encoder and decoder used in the TETRA mobile communication system has been measured as 27 ms. Since the coder and at the same time the speech detector are called at 60 ms intervals, 33 ms at the most remains at the speech detector's disposal.
  • the performance time of the speech detector implemented in a manner according to the invention has been measured as 18 ms, the system thus meeting the real time requirements.
  • the speech detector is implemented as a specific program block as a part of the entire digital signal processor program code.
  • the implementation is performed using the assembly language of the TMS320C50 processor.
  • a subprogram makes all the arrangements for encoding the TETRA mobile communication system.
  • the subprogram copies the 480 samples to be coded into the buffer for the speech detector.
  • the speech detector is called from the subprogram after the first telecommunication frame (240 samples) is coded.
  • the encoder has then calculated the LPC coefficients corresponding to the first frame needed by the speech detector.
  • the method of the invention can be carried out as a software to be run in the digital signal processor.
  • the speech detector is in the same processor with the speech codec of the mobile communication system.
  • Figure 1 shows a flow chart describing the method of the invention.
  • the first operation the speech detector performs is to calculate the average of the samples in step 101.
  • the function of the entire speech detection system improves significantly when a signal whose average exceeds only a certain level is given to the neural network for classification.
  • the neural network inputs are in principal independent of the signal level, in practice the input values determined from the same speech sample at a different signal level deviate from one another. This is caused by the inaccuracy of the AD/DA transformer biasing used in the arrangement, the inaccuracy effects becoming apparent at low signal levels.
  • the average is calculated in the loop where a sum of absolute values of all 480 samples fed into the buffer is calculated.
  • the result of the average calculation is compared 102 to a fixed threshold value and if the average is smaller 103A than the threshold value a jump 103 is performed directly to the decision logic part 104 and the signal is classified as non-speech.
  • the threshold value is determined in such a manner that noise arriving from a normal speech channel is removed and the threshold value is determined as high as possible without cutting off the speech signal. Then the hangover time can be increased in the next step 105 of the method. If again the average exceeds 106 the threshold value the process continues with the neural network part.
  • said average calculation and result comparison are compensated with a prehandling part whose function is to remove the DC-offset from the signal i.e. to center the signal as close around 0-level as possible.
  • This is carried out by filtering the signal with a high-pass filter, the cut-off frequency of which is very low.
  • the LPC coefficients of the signal are determined 106A.
  • the values of the LPC coefficients for example 1 , 3, 4, 5 and 10 calculated by the speech encoder of the TETRA mobile communication system, are copied into the input buffer of the neural network.
  • the encoder stores the coefficients LPC1 , LPC3, LPC4, LPC5, LPC10 calculated on the basis of the first frame in memory from where they are transferred to be used by the neural network.
  • autocorrelation is calculated 107 us- ing two internal loops.
  • the inner loop the actual sum of products is calculated and the outer loop sets the indexings in place.
  • the outer loop sets two C-50 processor help registers indicating the beginning of the sample buffer and a position preceding the beginning of the buffer by a lag to be calculated corresponding to the value k in equation (1).
  • the lag is the value of the auto- correlation function argument, or k, described in equations (1) and (2).
  • the number of repetitions of both the outer and inner loops is stored in one help register.
  • the inner loop is repeated 480 times when the first lag is being calculated, 479 times when the second lag is being calculated, 478 times for the third etc.
  • the outer loop stores the result calculated by the inner loop in the autocorrelation buffer.
  • the inner loop includes three commands. The first one performs the multiplication, the second adds the result into the sum and loads the multiplication register for the following multiplication.
  • the third command is NOP (no operation) by which the loop length is increased into the minimum length of the three commands required by C50.
  • the multipli- cation result is shifted 6 bits to the right by which 128 product sum operations can be performed without the risk of overflow. There is a small risk of overflow when the lags in the beginning are calculated, but the average low signal levels and the use of saturation arithmetic further reduce the risk. Thus, two more bits of calculation precision are obtained compared to the scaling that is totally without risk.
  • the highest peak of the auto- correlation function is searched for 108 i.e. the second highest buffer value after the value corresponding to lag 0. In practice some values succeeding lag 0 are also high, therefore the search for a peak value is not started until lag 9.
  • the entire autocorrelation buffer is gone through in the loop and the memory address of the highest value is stored.
  • the start address of the buffer is subtracted from said address the lag corresponding to the peak value is obtained and this information Acorl is copied into the neural network input buffer.
  • the peak value is divided 109 by energy, or by the lag 0 value. This is performed using the conditional subtraction command of the C 50 processor.
  • the result obtained Acor2 is transferred to the neural network input buffer.
  • the 0-level exceedings in the signal are counted in the method of the invention.
  • counting 110 the 0-level exceedings signs of the previous and the following sample are stored in an accumulator and an accumulator buffer. These are compared by summing the signs and by comparing with a bit mask. If the signs differ the counter in the help register is increased by one. The entire sample buffer is thus gone through in the loop. The result is transferred to the neural network input buffer.
  • FIG. 2 shows a block diagram of the neural network to be used in the implementation of the invention.
  • the neural network shows a vector 201 fed thereto, the vector 201 comprising 5 LPC coefficients calculated above, LPC1 , LPC3, LPC4, LPC5, LPC10, two autocorrelation values Acorl and Acor2 mentioned above and the number of 0-level exceedings ZeroR in the signal.
  • the neural network shows a hidden layer 202 comprising eight neurons 203, 204, 205, 206, 207, 208, 209, 210 and two neurons of an output layer 211 , a speech neuron 212 and a noise neuron 213.
  • the block diagram shows a decision logic 214 of the invention interpreting the values of the output layer 211.
  • the neural network in Figure 2 in the method shown in Figure 1 cal- culates 111 on the basis of the input values the output values for the speech 212 and noise neurons 213 on the basis of which the decision logic 214 makes 104 a classification decision of the signal to be processed.
  • the operating principal of the neural network is the following. Only a dot product according to equation (5), which is easy to implement by a C50 processor MAC command (multiply and accumulate), is calculated in all neurons. Then calculating a fixed point presents a problem.
  • the weighting coeffi- cients accomplished during the training phase of the neural network vary in a wide area. Typically the weighting coefficients of a neuron can be between 0.05 - 90. The coefficients have to be scaled neuron-specifically to maintain the best possible calculation precision.
  • the intermediate results of the hidden layer 202 can also move at a wide range.
  • the hidden layer 202 neuron first calculates the product sum between the weighting coefficients and the input values of the neuron.
  • the coefficients are scaled between [-0.25, 0.25] by dividing by a suitable power of two that is higher than the highest coefficient.
  • An additional division into four is performed to reduce the risk of overflow of the MAC operation.
  • a bias is added to the sum in the MAC loop.
  • the bias is multiplied by one and added to the sum as the last component.
  • the sum obtained is then shifted to the left so that the result corresponds to the result calculated by unsealed weighting coefficients.
  • the upper accumulator then comprises the fractional part of the result and the lower accumulator the integral part. Sign bits are set and trash bits are removed using bit masks.
  • the 32-bit end result is stored in the hidden layer 202 buffer.
  • All eight neurons 203-210 of the hidden layer 202 are implemented in the manner described above.
  • the weighting coefficient scaling cause neu- ron-specific differences affecting the amount of shifting and the bit masks.
  • the search is performed by a corresponding positive value and the searched value is negated.
  • the table search is performed in the loop for all neuron 203-210 results of the hidden layer 202.
  • the results are stored in the hidden layer buffer. Since the results are between [-1 , 1] they are now shown as 16-bits and stored in every second memory address in the hidden layer 202 buffer.
  • the implementation of the speech 212 and noise neurons 213 of the output layer 211 is similar to that of the hidden layer neurons, but somewhat simpler. Since the output values are determined in the network training data as either zero or one the output value varies in practice between [-0.1 , 1.1]. Values higher than one can be saturated to correspond to one and thus the results of the output neurons 212, 213 can be indicated as 16-bits. This 16-bit result is ready in the upper accumulator of the processor and thus the setting of sign bits using bit masks can be avoided. The end results are stored in the output buffer. Since the transfer function of the output layer 211 is linear it is not necessary to perform the actual transfer function.
  • the decision logic 214 reads the output value of the speech neuron 212. If the value is smaller than the threshold value the frame to be processed is classified as non-speech. If the value is higher the speech neuron 212 out- put is compared with the noise neuron 213 output. If the noise neuron 213 value is higher the signal is classified as non-speech otherwise as speech. If the program execution arrives 103 at the decision logic block 214 directly from the average calculation block 101 the signal is directly classified as non- speech.
  • a hangover time increase block 105 shown in Figure 1 is the last part of the speech detector.
  • a signal classified as speech initializes a _hang_over variable adjusting the hangover time as one and also a _speech_flag variable as the speech detector output is set at value 0x1111 corresponding to speech. If a frame is classified as non-speech the value of the _hang_over variable is checked. If the value is zero or a maximum value, zero is set for the _hang_over variable and a value 0x0000 corresponding to non-speech is set for the _speech_flag variable. Otherwise the _hang_over variable is increased by one and the _speech_flag is set at value 0x1111. When the speech detector is started the _speech-flag and the _hang_over are set at zero.
  • the hangover time can be changed by the maximum value of the _hang_over variable. Value 9 is used in the implementation in which case the hangover time after the end of speech is 480 ms (8 * 60 ms). The hangover time can be changed at 60 ms steps. After an increase in hangover time the _speech_detector subprogram is closed.
  • the speech detector needs memory for variables and buffers approximately 1 kiloword, from where the sample buffer and the autocorrelation buffer take up the greatest part. Data tables are also needed. They include the neural network weighting coefficient table and tangent sigmoid function memory table. They are located in program memory and the size of them is totally 730 words. The total memory need of the speech detector is thus approxi- mately 1.7 kilowords.
  • the best speech detector is a solution based on a neural network using autocorrelation function properties and LPC coefficients as input.
  • the performance of the speech detector is entirely solved by neural network training. Best results are naturally achieved by a particularly extensive training material.
  • the neural network is able to make very reliable classification decisions if the background noise is of the type used in training.

Abstract

A method for speech detection in a telecommunication system comprising a signal source and a signal processor for processing a signal to be observed. In the method desired identification numbers are determined (107-110) from said signal, said identification numbers are fed into a neural network, output values are calculated (111) for speech and noise neurons on the basis of said identification numbers in said neural network and a decision is made (104) on whether said signal is speech or noise.

Description

SPEECH DETECTION IN A TELECOMMUNICATION SYSTEM
BACKGROUND OF THE INVENTION
The invention relates to a method for speech detection in a telecommunication system comprising a signal source producing a signal, a signal processor including a neural network for processing said signal, in which method said neural network is trained to distinguish between a speech signal and a noise signal using speech and noise samples.
An important problem in different speech processing telecommunication systems is speech detection i.e. classifying a signal as speech or non- speech. Non-speech is generally either silence or background noise, but may also be an information signal like the DTMF dual tone multi frequency signal to be transferred on a telephone channel. The capacity of the classifier is often decisive for the operation of the system.
The classification of signal as speech and non-speech is, for exam- pie, included in a prehandling part of each speech identification system. It is essential in speech identification to find the precise starting point of speech sounds, words or sentences to ensure a reliable identification. In speech recording the amount of data to be processed can be significantly reduced by recording only speech-containing data. The load and power consumption of a mobile station or another telecommunication apparatus or system can be reduced if only information corresponding to real speech is coded and transmitted, not unnecessary background noise.
A speech signal is composed of consecutive speech sounds. Pressure pulses deriving from the vocal cords pass through the oral cavity and/or the nasal cavity forming a speech sound. The shape of the cavities affects the pronunciation result, as well as which of the cavities is open, thus distinguishing between different speech sounds. The speech sounds can be grouped in various ways, a basic division being to group them as vowels and consonants. Here background noise refers to all sound signals that are not speech and that will be processed by a speech detector.
In different publications background noise is often modelled as white noise. It accurately models noise occurring in analogical mobile or radio phones, but for modelling background noise occurring in various real life situations the model is too simple. Background noise can roughly be divided into two categories. Stationary noise is fairly even continuous noise that can be caused by, for exam- pie, a ventilator, a copying machine, a restaurant environment or traffic. A common feature is the continuity of the noise. Observed in the time domain noise occurs continuously throughout the entire observation period.
The other type of background noise is dynamic background noise. It is composed of random, heavy noise peaks. For example, all bangs and slams can be classified as dynamic noise. In the time domain noise is seen as fairly short noise sequences having a strong amplitude.
Speech can be distinguished from background noise by determining values for various properties describing the signal. Examples of such properties are the signal energy or spectrum.
In the following some prior art speech detection methods and systems are described.
US patent 5,611 ,019 describes a speech detection arrangement. The arrangement comprises a reference model, a parameter extractor for ex- tracting parameters from an input signal and decision-making means for deciding whether the signal is speech or non-speech. The presented solution also includes the idea of training the arrangement to detect and distinguish speech from non-speech.
US patent 5,598,466 describes a method for detecting speech ac- tivity in a semi-duplex audio telecommunication system. In this method an average peak value representing the envelope of an audio signal is determined.
US patent 5,596,680 shows a method and an apparatus for detecting speech activity in an input signal. In this method a starting point is detected using power/zero crossing. For detecting the end point of the speech sound in the signal cepstrum is determined. When the starting and end points of the speech sound are determined vector quantization is used to classify the speech sound as either speech or noise.
US patent 5,572,623 shows a method for speech detection from a signal containing noise. In this method a frame containing sound is detected, noise-containing frames preceding said frame are searched for, an autore- gressive model of noise and a spectrum of average noise are formed. Then the flames preceding the sound in the spectrum are bleached wherefrom the real starting point of speech is searched for.
US patent 5,276,765 shows a voice activity detector (VAD) to be used in an LPC coder in a mobile system. The solution operates in connection with the LPC coder but does not use LPC coefficients when deciding whether the signal to be observed is speech or noise. Instead, the solution uses autocorrelation coefficients of an input signal, weighted and combined, for classifying the signal parts as speech or correspondingly noise.
A speech detector used in present mobile communication systems is shown in ETSI GSM standard ETSI GSM 06.32, v.4.1.0. July 1995. Voice Activity Detection. ETSI.
A method and an arrangement for speech detection is shown in publication Electronic Design, March 22, 1990, Newman W. O, Detecting Speech with an Adaptive Neural Network, pp. 79-89. Since speech is composed of very different speech sounds, and background noise can also be very different, it is difficult to find common properties for both background noise groups i.e. for stationary and dynamic noise. On this account it is also difficult to make a classification decision.
A problem with prior art solutions is that they generally classify heavy background noise as speech. Classifying dynamic noise in particular, for example bangs, by the simple methods described often produces incorrect results.
A problem particularly in semi-duplex trunking mobile communication systems is that although the mobile stations comprise a push to talk- switch (PTT) for indicating the start and the end of a speech item, the subscriber cannot inform the mobile communication system when he/she will use his/her speech item when talking from another telecommunication system to said semi-duplex mobile communication system. In this case speech detection is required at least for all speech signals arriving from the outside of said semi- duplex mobile communication system.
BRIEF DESCRIPTION OF THE INVENTION
An object of the invention is to provide a method and an apparatus for speech detection, the method and the apparatus operating in real time, taking an input signal from a telephone channel and being able to distinguish speech from noise particularly from dynamic noise as reliably as possible.
The invention relates to a method for speech detection in a telecommunication system, characterized by comprising the following steps: determining from said signal identification numbers comprising at least the following identification numbers
- LPC coefficients of said signal, - a peak value lag of an autocorrelation function of said signal,
- an autocorrelation function peak value of said signal divided by said signal energy, and
- a number of 0-level exceedings in said signal during a determined observation period, feeding said identification numbers as input vectors into said neural network previously trained to distinguish between speech and noise signals using speech and noise samples, calculating output values for speech and noise neurons on the basis of the identification numbers included in said input vectors in said neural network, deciding whether said signal is speech or noise on the basis of said output values.
The main idea of the invention is to employ the neural network in speech detection.
In accordance with the invention the properties determined from the signal to be observed, particularly LPC coefficients, autocorrelation function properties, 0-level exceedings of the signal, are given as input to the neural network making the classification decision. The neural network to be used in this inventive method is trained to distinguish speech from background noise.
An advantage with the invention is that the neural network of the invention can also be trained to distinguish heavy background noise from speech.
Another advantage with the invention is that the neural network is simple and easy to implement also as a real time solution, thus being applicable to speech transfer in almost real time, for example, in a mobile communication system.
BRIEF DESCRIPTION OF THE DRAWINGS In the following the invention will be described in greater detail with reference to accompanying drawings, in which
Figure 1 is a flow chart describing a method of the invention, and Figure 2 is a block diagram showing a neural network used in the implementation of the invention. DETAILED DESCRIPTION OF THE INVENTION
In accordance with the invention a neural network is used in the decision-making only when the signal average exceeds a particular threshold value. Thus the performance of the entire speech detection method and appa- ratus is improved by low noise levels. Autocorrelation is calculated and 0-level exceedings are counted from a 60 ms time window. The LPC coefficients are obtained directly from a TETRA speech codec. Hangover time refers to time after the last speech classification during which the signal is still classified as speech although the window to be observed no longer includes speech. As a result the weak consonants in the middle and end of a sentence are detected.
Autocorrelation
An autocorrelation function of the signal x(n) is determined: φ(k) = ∑x(m)x(m + k) (1) A local (short-time) autocorrelation function is determined as follows:
Rn (k) = N∑x(m)x(m - k)w(n)w(n + k) (2) m=0
In equation (2) w(n) is a finite length window function. The parameters are again window length N and window function shape.
Autocorrelation indicates how the signal correlates with itself. Perio- dicities can be detected in the signal by means of autocorrelation. Since voiced speech sounds comprise basic frequency sequences the autocorrelation function should have peaks at intervals corresponding to the basic frequency. The existence of these peaks as well as their location and amplitude are properties by which speech can be distinguished from background noise using the auto- correlation function. Though a telephone channel attenuates frequencies below 300 Hz, the autocorrelation function peaks more probably correspond to the first resonance, or formant, frequency than to the basic frequency.
Linear prediction and LPC coefficients A linear predictor is a system that predicts a sample value as a linear combination of previous samples. The output of the system thus being: s(n) = ∑aks(n -k) (3)
Here p is the order of the predictor and α coefficients are weighting coefficients, or so-called LPC coefficients, of the previous samples. These coeffi- cients can be solved by minimizing the sum E of the squares of the difference of real samples and predicted samples by a finite interval n in accordance with equation (4):
~\2
E = ∑ s(n) - ∑aks(n -k) (4) k=1 In practice the coefficients are generally solved using an autocorrelation method where the matrix equation is solved using Levinson-Durbin recursion.
Autocorrelation is the best one of the methods studied. Car noise and white noise were distinguished from speech fairly reliably. Cepstrum gives similar results as autocorrelation. LPC was strong in the theater interval noise sample representing general noise. A bang sample was the most difficult noise sample.
On the basis of what has been presented above autocorrelation and LPC coefficients are worth taking as the detection methods of the speech de- tection method and apparatus of the invention.
Neural networks
Neural networks are nonlinear calculation networks which are tuned using training data. The neural network is loosely based on brain function, but is nevertheless inadequate to model the real function of the brain. The neural network can be implemented as a neurocomputer referring to a wide parallel simple adaptive network consisting of calculation elements. Such a network functions like a real biological neural network.
The neural network consists of several calculation elements, or neu- rons, and the connections between them. A neuron is given one or more input values and based on these values the neuron produces a result that can be forwarded to one or more neurons. The neuron calculates the result as follows: a = F(iv p +b) (5) where a is a result, is an input vector, w is a weighting coefficient vector, b is a bias and F is a transfer function of the neuron. A dot product is thus calculated between the weighting coefficients and the input, the bias is added thereto and the result is given as an argument to the transfer function producing the final result.
The neurons can be combined into neuron layers. A neuron layer comprises one or more neurons. The neuron layer neurons all obtain the same input vector and the neurons have the same transfer function. The neuron layer result vector can be given to another layer as input, in which case a multilayer neural network is obtained.
The neural network has to be trained to solve a given problem. Dur- ing the training phase the weighting coefficients and the bias values of the network neurons are tuned according to a training algorithm. In the invention some real life recordings are used to represent background noise. These recordings are chosen so as to represent real situations particularly well and extensively. After training the neural network is used by giving the network an input vector for which the network calculates a corresponding output. A properly trained neural network always calculates a similar result for an input vector group possessing certain properties.
The neural network is well applicable to solve the classification problem. In this invention the neural network decides into which category the given input vector belongs i.e. whether the signal sample to be observed is speech or noise.
A conveniently trained neural network succeeds in classifying the signal as speech or background noise. The input to be brought into the system can be a preprocessed signal as such. An example of the former is found in the publication presented above Electronic Design, March 22, 1990, Newman W.C., Detecting Speech with an Adaptive Neural Network, pp. 79-89, presenting a neural network which is given a 25 ms signal as input, i.e. 250 samples at a 10 kHz sampling frequency. In the method and apparatus of the invention the LPC coefficients that are calculated from the signal and that describe the signal are given as input to the neural network.
The inventive method and apparatus are designed to be implemented in a digital signal processor. This is why also the training data is as- sembled using a signal processor. The signal processor calculates the LPC coefficients and the autocorrelation and counts the number of zero exceedings and stores the values in memory.
Eight neurons are selected for a hidden layer of the neural network of the invention. An increase in the number of neurons does not necessarily improve the function of the network. A hyperbolic tangent sigmoid transfer function is chosen as the transfer function of the hidden layer: F(n) = tanh(n) (6)
To the output layer of the neural network two neurons are chosen which are trained in such a way that one reacts to speech and the other to noise. The transfer function of the output layer is linear. According to the method of the invention the signal is classified as speech when the output of the neuron tuned for speech is higher than the threshold value and the output of the noise neuron is smaller than that of the speech neuron.
When the number of speech vectors is increased in the training data the performance of the network improves since the threshold value can be increased without increasing speech clipping. The erroneous interpretations of noise as speech are thus reduced. By changing the threshold value a compromise can be reached between speech clipping and erroneous speech detection. The neural network performance at low noise levels is poor. Therefore an observation of the signal amplitude average is added to the method and apparatus of the invention. The neural network is not used in the classification until the signal average exceeds a predetermined threshold value. A more quiet signal is classified directly as silence. In the example case the software implementing the method and apparatus of the invention is carried out as a software run in real time in the digital signal processor. The Texas Instruments TMS320C50 signal processor is used as the processor being able to receive and transmit audio data sampled at 8 kHz sampling frequency. The software is implemented in such a way that speech travels as such from the input to the output of the signal processor I/O port only when the algorithm classifies the signal as speech. Otherwise silence is transmitted. When the method and apparatus of the invention are implemented to a telecommunication system the practical solution can naturally be carried out in another way, for example in such a way that white noise or another sound, like an audible tone, can be heard instead of noise from the speech detector.
The method and apparatus of the invention are used for speech detection. The function of the method and apparatus of the invention can be significantly improved by adding a so-called hangover time to them. When the method or the apparatus performs a first non-speech classification after speech detection the hangover time is started during which a final classifica- tion decision is made, i.e. a decision on whether the signal is speech or noise. It is thus easier to include the weak consonants in the middle of the sentence. A possible hangover time added to the method and apparatuses of the invention and to the algorithms implementing them is 500 ms. In telecommunication systems it is possible to locate the speech detector in the exchange or transmission parts of the telecommunication system. Naturally the speech detector can also be located in a telecommunication terminal. In mobile communication systems the speech detector of the invention can be located, for example, in a system exchange, a base station con- trailer, a base station or a mobile station, or in several of the above mentioned network elements.
The invention is here described with reference to the mobile communication system in particular. A mobile communication system typically has an exchange connected to other telecommunication and telephone networks using an interface unit. The mobile communication exchange is connected to a fixed telephone network using the interface unit of the fixed network. The speech detector of the invention can be located in this interface unit of the fixed network in particular. Naturally the speech detector can also be located elsewhere in the mobile communication system or in its exchange. The classi- fication information of the signal arriving from the telephone line or the speech channel of the fixed telephone network made by the speech detector of the mobile communication system on whether the signal is classified as speech or non-speech is transferred through a central possessing unit CPU of the interface unit of the fixed telephone network to a call control computer (CCC) of the exchange. The CCC distributes the speech items between the radio telephone subscriber and the fixed network subscriber on the basis of the information.
The speech detection method and apparatus of the invention can be carried out for example in the Texas Instruments TMS320C50 (40 MIPS) digital signal processor, whose main task is to convert or code speech arriving from the telephone line and correspondingly decode speech going to the fixed telephone line in the TETRA (Trans European Trunked Radio) mobile communication system.
In the TETRA mobile communication system the speech coding algorithm is based on the ACELP (Algebraic Code-Exited Linear Predictive) coding model. In this model a speech sample frame is synthesized by filtering a suitable excitation sequence with two time-dependent filters. The first filter is a long-term prediction filter by which the periodicities of the speech signal are modelled. The second filter is a short-term prediction filter by which the envelope of the speech spectrum is modelled. The short-term filtering is performed as follows:
where a,s are the linear prediction coefficients and p is the order of the predictor which is 10 in the TETRA codec. The linear prediction analyses is performed at 30 ms intervals using a 32 ms asymmetrical window. The window consists of two Hamming windows of different lengths, the first one comprising 216 samples of the frame to be processed and 40 samples of the next frame. The LPC coefficients are solved employing an autocorrelation method using the Levinson-Durbin recursion.
The LPC coefficients calculated by the TETRA codec can be given in accordance with the invention when delivered directly as the speech detec- tor input.
When implementing the TETRA codec in practice speech is always coded two frames at a time, or two times successively at 60 ms intervals. The LPC coefficients used by the speech detector are calculated on the basis of the first frame in accordance with the asymmetrical window described above. The other speech detector inputs can be determined for coding from a buffered 60 ms window. If signal was gathered into the buffer during a time period over 60 ms and the buffered data was given as delayed to the speech codec, the window length used by the speech detector could be extended.
The combined performance time of the encoder and decoder used in the TETRA mobile communication system has been measured as 27 ms. Since the coder and at the same time the speech detector are called at 60 ms intervals, 33 ms at the most remains at the speech detector's disposal. The performance time of the speech detector implemented in a manner according to the invention has been measured as 18 ms, the system thus meeting the real time requirements.
The speech detector is implemented as a specific program block as a part of the entire digital signal processor program code. The implementation is performed using the assembly language of the TMS320C50 processor.
A subprogram makes all the arrangements for encoding the TETRA mobile communication system. The subprogram copies the 480 samples to be coded into the buffer for the speech detector. The speech detector is called from the subprogram after the first telecommunication frame (240 samples) is coded. The encoder has then calculated the LPC coefficients corresponding to the first frame needed by the speech detector. The method of the invention can be carried out as a software to be run in the digital signal processor. The speech detector is in the same processor with the speech codec of the mobile communication system.
Figure 1 shows a flow chart describing the method of the invention. According to the first embodiment of the invention the first operation the speech detector performs is to calculate the average of the samples in step 101. The function of the entire speech detection system improves significantly when a signal whose average exceeds only a certain level is given to the neural network for classification. Although the neural network inputs are in principal independent of the signal level, in practice the input values determined from the same speech sample at a different signal level deviate from one another. This is caused by the inaccuracy of the AD/DA transformer biasing used in the arrangement, the inaccuracy effects becoming apparent at low signal levels. The average is calculated in the loop where a sum of absolute values of all 480 samples fed into the buffer is calculated. This may conveniently be done as 32 bits utilizing a 32-bit accumulator buffer of the C50 processor presented above. The sum obtained is divided by 512 by shifting to the right. With a 0.9375 probability this produces the correct result, which is sufficiently close to the real average. In the following step of the method the result of the average calculation is compared 102 to a fixed threshold value and if the average is smaller 103A than the threshold value a jump 103 is performed directly to the decision logic part 104 and the signal is classified as non-speech. The threshold value is determined in such a manner that noise arriving from a normal speech channel is removed and the threshold value is determined as high as possible without cutting off the speech signal. Then the hangover time can be increased in the next step 105 of the method. If again the average exceeds 106 the threshold value the process continues with the neural network part.
Alternatively in the second embodiment of the invention said average calculation and result comparison are compensated with a prehandling part whose function is to remove the DC-offset from the signal i.e. to center the signal as close around 0-level as possible. This is carried out by filtering the signal with a high-pass filter, the cut-off frequency of which is very low. The filtering algorithm in the time region is the following: s'(n) = s(n) - s(n-1 ) + alpha * s'(n-1 ) (8) where s(n) is the original signal and s'(n) is the filtered signal, alpha determining the cut-off frequency. After both embodiments the process continues by calculating 107 the autocorrelation function of the signal to be observed. Before calculating the autocorrelation the LPC coefficients of the signal are determined 106A. In the implementation of the invention the values of the LPC coefficients, for example 1 , 3, 4, 5 and 10 calculated by the speech encoder of the TETRA mobile communication system, are copied into the input buffer of the neural network. The encoder stores the coefficients LPC1 , LPC3, LPC4, LPC5, LPC10 calculated on the basis of the first frame in memory from where they are transferred to be used by the neural network.
In the method of the invention autocorrelation is calculated 107 us- ing two internal loops. In the inner loop the actual sum of products is calculated and the outer loop sets the indexings in place. The outer loop sets two C-50 processor help registers indicating the beginning of the sample buffer and a position preceding the beginning of the buffer by a lag to be calculated corresponding to the value k in equation (1). The lag is the value of the auto- correlation function argument, or k, described in equations (1) and (2). In addition, the number of repetitions of both the outer and inner loops is stored in one help register. The inner loop is repeated 480 times when the first lag is being calculated, 479 times when the second lag is being calculated, 478 times for the third etc. The outer loop stores the result calculated by the inner loop in the autocorrelation buffer. The inner loop includes three commands. The first one performs the multiplication, the second adds the result into the sum and loads the multiplication register for the following multiplication. The third command is NOP (no operation) by which the loop length is increased into the minimum length of the three commands required by C50. The multipli- cation result is shifted 6 bits to the right by which 128 product sum operations can be performed without the risk of overflow. There is a small risk of overflow when the lags in the beginning are calculated, but the average low signal levels and the use of saturation arithmetic further reduce the risk. Thus, two more bits of calculation precision are obtained compared to the scaling that is totally without risk.
After calculating 107 autocorrelation the highest peak of the auto- correlation function is searched for 108 i.e. the second highest buffer value after the value corresponding to lag 0. In practice some values succeeding lag 0 are also high, therefore the search for a peak value is not started until lag 9. The entire autocorrelation buffer is gone through in the loop and the memory address of the highest value is stored. When the start address of the buffer is subtracted from said address the lag corresponding to the peak value is obtained and this information Acorl is copied into the neural network input buffer. As a last calculation operation connected with autocorrelation the peak value is divided 109 by energy, or by the lag 0 value. This is performed using the conditional subtraction command of the C 50 processor. The result obtained Acor2 is transferred to the neural network input buffer.
Hereafter the 0-level exceedings in the signal are counted in the method of the invention. When counting 110 the 0-level exceedings signs of the previous and the following sample are stored in an accumulator and an accumulator buffer. These are compared by summing the signs and by comparing with a bit mask. If the signs differ the counter in the help register is increased by one. The entire sample buffer is thus gone through in the loop. The result is transferred to the neural network input buffer.
Figure 2 shows a block diagram of the neural network to be used in the implementation of the invention. The neural network shows a vector 201 fed thereto, the vector 201 comprising 5 LPC coefficients calculated above, LPC1 , LPC3, LPC4, LPC5, LPC10, two autocorrelation values Acorl and Acor2 mentioned above and the number of 0-level exceedings ZeroR in the signal. Hereafter the neural network shows a hidden layer 202 comprising eight neurons 203, 204, 205, 206, 207, 208, 209, 210 and two neurons of an output layer 211 , a speech neuron 212 and a noise neuron 213. Finally the block diagram shows a decision logic 214 of the invention interpreting the values of the output layer 211.
The neural network in Figure 2 in the method shown in Figure 1 cal- culates 111 on the basis of the input values the output values for the speech 212 and noise neurons 213 on the basis of which the decision logic 214 makes 104 a classification decision of the signal to be processed.
The operating principal of the neural network is the following. Only a dot product according to equation (5), which is easy to implement by a C50 processor MAC command (multiply and accumulate), is calculated in all neurons. Then calculating a fixed point presents a problem. The weighting coeffi- cients accomplished during the training phase of the neural network vary in a wide area. Typically the weighting coefficients of a neuron can be between 0.05 - 90. The coefficients have to be scaled neuron-specifically to maintain the best possible calculation precision. The intermediate results of the hidden layer 202 can also move at a wide range.
The hidden layer 202 neuron first calculates the product sum between the weighting coefficients and the input values of the neuron. The coefficients are scaled between [-0.25, 0.25] by dividing by a suitable power of two that is higher than the highest coefficient. An additional division into four is performed to reduce the risk of overflow of the MAC operation. A bias is added to the sum in the MAC loop. The bias is multiplied by one and added to the sum as the last component. The sum obtained is then shifted to the left so that the result corresponds to the result calculated by unsealed weighting coefficients. The upper accumulator then comprises the fractional part of the result and the lower accumulator the integral part. Sign bits are set and trash bits are removed using bit masks. The 32-bit end result is stored in the hidden layer 202 buffer.
All eight neurons 203-210 of the hidden layer 202 are implemented in the manner described above. The weighting coefficient scaling cause neu- ron-specific differences affecting the amount of shifting and the bit masks.
The results in the hidden layer buffer should after this be given as argument to the tangent sigmoid transfer function presented above in equation (6). In practice the only sensible way to perform the transfer function is to tabulate function values. Here a compromise has to be reached between memory space needed and table accuracy. In the implementation 640 values corresponding to the function values in the argument range [0, 2.5] were chosen to be tabulated. Negative values are obtained as complements of the tabulated values. If the argument is higher than 2.5 the function value is already so close to one that it is approximated by one. In the implementation an index for table search is formed from a 32-bit result including the integral part and the fraction. If the index exceeds the end address of the table the search result is set at one. If the starting value has been negative the search is performed by a corresponding positive value and the searched value is negated. The table search is performed in the loop for all neuron 203-210 results of the hidden layer 202. The results are stored in the hidden layer buffer. Since the results are between [-1 , 1] they are now shown as 16-bits and stored in every second memory address in the hidden layer 202 buffer.
The implementation of the speech 212 and noise neurons 213 of the output layer 211 is similar to that of the hidden layer neurons, but somewhat simpler. Since the output values are determined in the network training data as either zero or one the output value varies in practice between [-0.1 , 1.1]. Values higher than one can be saturated to correspond to one and thus the results of the output neurons 212, 213 can be indicated as 16-bits. This 16-bit result is ready in the upper accumulator of the processor and thus the setting of sign bits using bit masks can be avoided. The end results are stored in the output buffer. Since the transfer function of the output layer 211 is linear it is not necessary to perform the actual transfer function.
The decision logic 214 reads the output value of the speech neuron 212. If the value is smaller than the threshold value the frame to be processed is classified as non-speech. If the value is higher the speech neuron 212 out- put is compared with the noise neuron 213 output. If the noise neuron 213 value is higher the signal is classified as non-speech otherwise as speech. If the program execution arrives 103 at the decision logic block 214 directly from the average calculation block 101 the signal is directly classified as non- speech. A hangover time increase block 105 shown in Figure 1 is the last part of the speech detector. A signal classified as speech initializes a _hang_over variable adjusting the hangover time as one and also a _speech_flag variable as the speech detector output is set at value 0x1111 corresponding to speech. If a frame is classified as non-speech the value of the _hang_over variable is checked. If the value is zero or a maximum value, zero is set for the _hang_over variable and a value 0x0000 corresponding to non-speech is set for the _speech_flag variable. Otherwise the _hang_over variable is increased by one and the _speech_flag is set at value 0x1111. When the speech detector is started the _speech-flag and the _hang_over are set at zero. The hangover time can be changed by the maximum value of the _hang_over variable. Value 9 is used in the implementation in which case the hangover time after the end of speech is 480 ms (8 * 60 ms). The hangover time can be changed at 60 ms steps. After an increase in hangover time the _speech_detector subprogram is closed. The speech detector needs memory for variables and buffers approximately 1 kiloword, from where the sample buffer and the autocorrelation buffer take up the greatest part. Data tables are also needed. They include the neural network weighting coefficient table and tangent sigmoid function memory table. They are located in program memory and the size of them is totally 730 words. The total memory need of the speech detector is thus approxi- mately 1.7 kilowords.
On the basis of simulation results the best speech detector is a solution based on a neural network using autocorrelation function properties and LPC coefficients as input. The performance of the speech detector is entirely solved by neural network training. Best results are naturally achieved by a particularly extensive training material. The neural network is able to make very reliable classification decisions if the background noise is of the type used in training.
It is clear that the parameters of the method and apparatus, for example the number of neurons and the length of the signal window, can be changed and thus affect the function of the speech detector.
It is obvious for those skilled in the art that as technology progresses the basic idea of the invention can be implemented in various ways. The invention and its embodiments are thus not restricted to the examples described above but can vary within the scope of the claims.

Claims

1. A method for speech detection in a telecommunication system comprising a signal source producing a signal, a signal processor including a neural network for processing said signal, in which method said neural network is trained to distinguish between a speech signal and a noise signal using speech and noise samples, characterized by the method comprising the following steps: determining (107-110) from said signal identification numbers comprising at least the following identification numbers
- LPC coefficients (LPC1, LPC3, LPC4, LPC5, LPC10) of said sig- nal,
- a peak value lag (Acorl) of an autocorrelation function of said signal,
- an autocorrelation function peak value of said signal divided by said signal energy (Acor2), and - a number of 0-level (ZeroR) exceedings in said signal during a determined observation period, feeding said identification numbers as input vectors into said neural network (Figure 2) previously trained to distinguish between speech and noise signals using speech and noise samples, calculating (111) output values for speech (212) and noise (213) neurons on the basis of the identification numbers included in said input vectors in said neural network, deciding whether said signal is speech or noise on the basis of said output values.
2. A method as claimed in claim 1, characterized by said neural network input layer (201) and hidden layer (202) comprising eight neurons (203-210).
3. A method as claimed in claim 1, characterized by said neural network output layer comprising one speech neuron (212) and one noise neuron (213).
4. A method as claimed in claim 1, characterized by a deci- sion logic (214) located after the neural network making said decision (104) on whether said signal is speech or noise.
5. A method as claimed in claim 1 , c h a r a c t e r i z e d by comprising the following steps: a) calculating (101) an amplitude average of the signal to be processed, b) comparing (102) said average to a predetermined threshold value, c) classifying (104) said signal as non-speech on the basis of said comparison if said average is smaller (103A) than said threshold value, d) proceeding to process said signal on the basis of said comparison if said average is higher (105) than said threshold value, whereby e) determining (106A) the predetermined LPC coefficients of said signal, f) feeding LPC coefficient values into a neural network input buffer, g) calculating (107) the autocorrelation function of said signal, h) searching for (108) the highest peak value of the autocorrelation function, i) subtracting a buffer starting address from said peak address, whereby a lag corresponding to the peak value is obtained, j) feeding the lag corresponding to said peak value into said neural network input buffer, k) dividing said peak value by said signal energy and obtaining a quotient, I) feeding said quotient into said neural network input buffer, m) counting (110) a number of 0-level exceedings in said signal during a determined observation period, n) feeding said number into said neural network input buffer, o) performing (111) said calculation in said neural network, and p) making said decision (104) on whether said signal is speech or noise.
6. A method as claimed in claim 1 , c h a r a c t e r i z e d by comprising the following steps: a) filtering the signal to be processed by a high pass filter to remove a DC-offset of the signal, b) determining (106A) the predetermined LPC coefficients of said signal, c) feeding LPC coefficient values into a neural network input buffer, d) calculating (107) the autocorrelation function of said signal, e) searching for (108) a highest peak value of the autocorrelation function, f) subtracting a buffer starting address from said peak address, whereby a lag corresponding to the peak value is obtained, g) feeding a lag corresponding to said peak value to said neural network input buffer, h) dividing said peak value by said signal energy obtaining a quotient, i) feeding said quotient into said neural network input buffer, j) counting (110) a number of 0-level exceedings in said signal during a determined observation period, k) feeding said number into said neural network input buffer,
I) performing (111) said calculation in said neural network, and m) making said decision (104) on whether said signal is speech or noise.
EP98917143A 1997-04-18 1998-04-17 Speech detection Withdrawn EP0976124A2 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
FI971679 1997-04-18
FI971679A FI971679A (en) 1997-04-18 1997-04-18 Detection of speech in a telecommunication system
PCT/FI1998/000345 WO1998048407A2 (en) 1997-04-18 1998-04-17 Speech detection in a telecommunication system

Publications (1)

Publication Number Publication Date
EP0976124A2 true EP0976124A2 (en) 2000-02-02

Family

ID=8548676

Family Applications (1)

Application Number Title Priority Date Filing Date
EP98917143A Withdrawn EP0976124A2 (en) 1997-04-18 1998-04-17 Speech detection

Country Status (6)

Country Link
EP (1) EP0976124A2 (en)
AU (1) AU736133B2 (en)
CA (1) CA2286770A1 (en)
FI (1) FI971679A (en)
NZ (1) NZ500272A (en)
WO (1) WO1998048407A2 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10381020B2 (en) 2017-06-16 2019-08-13 Apple Inc. Speech model-based neural network-assisted signal enhancement

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5276765A (en) * 1988-03-11 1994-01-04 British Telecommunications Public Limited Company Voice activity detection
JP2776848B2 (en) * 1988-12-14 1998-07-16 株式会社日立製作所 Denoising method, neural network learning method used for it
JPH03111898A (en) * 1989-09-26 1991-05-13 Sekisui Chem Co Ltd Voice detection system
JP2643593B2 (en) * 1989-11-28 1997-08-20 日本電気株式会社 Voice / modem signal identification circuit
IT1270438B (en) * 1993-06-10 1997-05-05 Sip PROCEDURE AND DEVICE FOR THE DETERMINATION OF THE FUNDAMENTAL TONE PERIOD AND THE CLASSIFICATION OF THE VOICE SIGNAL IN NUMERICAL CODERS OF THE VOICE
GB2278984A (en) * 1993-06-11 1994-12-14 Redifon Technology Limited Speech presence detector

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO9848407A3 *

Also Published As

Publication number Publication date
NZ500272A (en) 2001-03-30
AU736133B2 (en) 2001-07-26
WO1998048407A3 (en) 1999-02-11
FI971679A (en) 1998-10-19
FI971679A0 (en) 1997-04-18
AU7045398A (en) 1998-11-13
WO1998048407A2 (en) 1998-10-29
CA2286770A1 (en) 1998-10-29

Similar Documents

Publication Publication Date Title
Zhang et al. A deep ensemble learning method for monaural speech separation
US10878823B2 (en) Voiceprint recognition method, device, terminal apparatus and storage medium
RU2417456C2 (en) Systems, methods and devices for detecting changes in signals
Nilsson et al. Speech recognition using hidden markov model
JP2654917B2 (en) Speaker independent isolated word speech recognition system using neural network
US5579435A (en) Discriminating between stationary and non-stationary signals
JPH0816187A (en) Speech recognition method in speech analysis
CN106409310A (en) Audio signal classification method and device
KR20020052191A (en) Variable bit-rate celp coding of speech with phonetic classification
EP2089877A1 (en) Voice activity detection system and method
CN109961776A (en) Speech information processing apparatus
CN113889090A (en) Multi-language recognition model construction and training method based on multi-task learning
CN101496095A (en) Systems, methods, and apparatus for signal change detection
Rajesh Kumar et al. Optimization-enabled deep convolutional network for the generation of normal speech from non-audible murmur based on multi-kernel-based features
CN113488063B (en) Audio separation method based on mixed features and encoding and decoding
AU736133B2 (en) Speech detection in a telecommunication system
Bäckström et al. Voice activity detection
CN113782000B (en) Language identification method based on multiple tasks
Wang et al. Phonetic segmentation for low rate speech coding
Hamandouche Speech Detection for noisy audio files
Maeran et al. Speech recognition through phoneme segmentation and neural classification
JP3183072B2 (en) Audio coding device
JPH04115299A (en) Method and device for voiced/voiceless sound decision making
Wang et al. Chip design of portable speech memopad suitable for persons with visual disabilities
Cooper Speech detection using gammatone features and one-class support vector machine

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 19991022

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): BE DE DK ES FR GB IT NL SE

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: NOKIA CORPORATION

GRAH Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOS IGRA

RIC1 Information provided on ipc code assigned before grant

Free format text: 7G 10L 3/00 A

RTI1 Title (correction)

Free format text: SPEECH DETECTION IN A TELECOMMUNICATION SYSTEM

RIC1 Information provided on ipc code assigned before grant

Free format text: 7G 10L 11/02 A

RTI1 Title (correction)

Free format text: SPEECH DETECTION

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20030415